All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-24 16:02 ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-24 16:02 UTC (permalink / raw)
  To: teigland; +Cc: cluster-devel, agruenba, aahringo, stable

This patch fixes a possible plock op collisions when using F_SETLKW lock
requests and fsid, number and owner are not enough to identify a result
for a pending request. The ltp testcases [0] and [1] are examples when
this is not enough in case of using classic posix locks with threads and
open filedescriptor posix locks.

The idea to fix the issue here is to place all lock request in order. In
case of non F_SETLKW lock request (indicated if wait is set or not) the
lock requests are ordered inside the recv_list. If a result comes back
the right plock op can be found by the first plock_op in recv_list which
has not info.wait set. This can be done only by non F_SETLKW plock ops as
dlm_controld always reads a specific plock op (list_move_tail() from
send_list to recv_mlist) and write the result immediately back.

This behaviour is for F_SETLKW not possible as multiple waiters can be
get a result back in an random order. To avoid a collisions in cases
like [0] or [1] this patch adds more fields to compare the plock
operations as the lock request is the same. This is also being made in
NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
still can't find the exact lock request for a specific result if the
lock request is the same, but if this is the case we don't care the
order how the identical lock requests get their result back to grant the
lock.

[0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
[1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
---
change since v2:
 - don't split recv_list into recv_setlkw_list

 fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
 1 file changed, 30 insertions(+), 13 deletions(-)

diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
index 31bc601ee3d8..53d17dbbb716 100644
--- a/fs/dlm/plock.c
+++ b/fs/dlm/plock.c
@@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
 		if (op->info.flags & DLM_PLOCK_FL_CLOSE)
 			list_del(&op->list);
 		else
-			list_move(&op->list, &recv_list);
+			list_move_tail(&op->list, &recv_list);
 		memcpy(&info, &op->info, sizeof(info));
 	}
 	spin_unlock(&ops_lock);
@@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
 		return -EINVAL;
 
 	spin_lock(&ops_lock);
-	list_for_each_entry(iter, &recv_list, list) {
-		if (iter->info.fsid == info.fsid &&
-		    iter->info.number == info.number &&
-		    iter->info.owner == info.owner) {
-			list_del_init(&iter->list);
-			memcpy(&iter->info, &info, sizeof(info));
-			if (iter->data)
-				do_callback = 1;
-			else
-				iter->done = 1;
-			op = iter;
-			break;
+	if (info.wait) {
+		list_for_each_entry(iter, &recv_list, list) {
+			if (iter->info.fsid == info.fsid &&
+			    iter->info.number == info.number &&
+			    iter->info.owner == info.owner &&
+			    iter->info.pid == info.pid &&
+			    iter->info.start == info.start &&
+			    iter->info.end == info.end &&
+			    iter->info.ex == info.ex &&
+			    iter->info.wait) {
+				op = iter;
+				break;
+			}
 		}
+	} else {
+		list_for_each_entry(iter, &recv_list, list) {
+			if (!iter->info.wait) {
+				op = iter;
+				break;
+			}
+		}
+	}
+
+	if (op) {
+		list_del_init(&op->list);
+		memcpy(&op->info, &info, sizeof(info));
+		if (op->data)
+			do_callback = 1;
+		else
+			op->done = 1;
 	}
 	spin_unlock(&ops_lock);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-24 16:02 ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-24 16:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch fixes a possible plock op collisions when using F_SETLKW lock
requests and fsid, number and owner are not enough to identify a result
for a pending request. The ltp testcases [0] and [1] are examples when
this is not enough in case of using classic posix locks with threads and
open filedescriptor posix locks.

The idea to fix the issue here is to place all lock request in order. In
case of non F_SETLKW lock request (indicated if wait is set or not) the
lock requests are ordered inside the recv_list. If a result comes back
the right plock op can be found by the first plock_op in recv_list which
has not info.wait set. This can be done only by non F_SETLKW plock ops as
dlm_controld always reads a specific plock op (list_move_tail() from
send_list to recv_mlist) and write the result immediately back.

This behaviour is for F_SETLKW not possible as multiple waiters can be
get a result back in an random order. To avoid a collisions in cases
like [0] or [1] this patch adds more fields to compare the plock
operations as the lock request is the same. This is also being made in
NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
still can't find the exact lock request for a specific result if the
lock request is the same, but if this is the case we don't care the
order how the identical lock requests get their result back to grant the
lock.

[0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
[1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731

Cc: stable at vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
---
change since v2:
 - don't split recv_list into recv_setlkw_list

 fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
 1 file changed, 30 insertions(+), 13 deletions(-)

diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
index 31bc601ee3d8..53d17dbbb716 100644
--- a/fs/dlm/plock.c
+++ b/fs/dlm/plock.c
@@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
 		if (op->info.flags & DLM_PLOCK_FL_CLOSE)
 			list_del(&op->list);
 		else
-			list_move(&op->list, &recv_list);
+			list_move_tail(&op->list, &recv_list);
 		memcpy(&info, &op->info, sizeof(info));
 	}
 	spin_unlock(&ops_lock);
@@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
 		return -EINVAL;
 
 	spin_lock(&ops_lock);
-	list_for_each_entry(iter, &recv_list, list) {
-		if (iter->info.fsid == info.fsid &&
-		    iter->info.number == info.number &&
-		    iter->info.owner == info.owner) {
-			list_del_init(&iter->list);
-			memcpy(&iter->info, &info, sizeof(info));
-			if (iter->data)
-				do_callback = 1;
-			else
-				iter->done = 1;
-			op = iter;
-			break;
+	if (info.wait) {
+		list_for_each_entry(iter, &recv_list, list) {
+			if (iter->info.fsid == info.fsid &&
+			    iter->info.number == info.number &&
+			    iter->info.owner == info.owner &&
+			    iter->info.pid == info.pid &&
+			    iter->info.start == info.start &&
+			    iter->info.end == info.end &&
+			    iter->info.ex == info.ex &&
+			    iter->info.wait) {
+				op = iter;
+				break;
+			}
 		}
+	} else {
+		list_for_each_entry(iter, &recv_list, list) {
+			if (!iter->info.wait) {
+				op = iter;
+				break;
+			}
+		}
+	}
+
+	if (op) {
+		list_del_init(&op->list);
+		memcpy(&op->info, &info, sizeof(info));
+		if (op->data)
+			do_callback = 1;
+		else
+			op->done = 1;
 	}
 	spin_unlock(&ops_lock);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-05-24 16:02 ` [Cluster-devel] " Alexander Aring
@ 2023-05-25 15:02   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-05-25 15:02 UTC (permalink / raw)
  To: Alexander Aring; +Cc: teigland, cluster-devel, stable

On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> This patch fixes a possible plock op collisions when using F_SETLKW lock
> requests and fsid, number and owner are not enough to identify a result
> for a pending request. The ltp testcases [0] and [1] are examples when
> this is not enough in case of using classic posix locks with threads and
> open filedescriptor posix locks.
>
> The idea to fix the issue here is to place all lock request in order. In
> case of non F_SETLKW lock request (indicated if wait is set or not) the
> lock requests are ordered inside the recv_list. If a result comes back
> the right plock op can be found by the first plock_op in recv_list which
> has not info.wait set. This can be done only by non F_SETLKW plock ops as
> dlm_controld always reads a specific plock op (list_move_tail() from
> send_list to recv_mlist) and write the result immediately back.
>
> This behaviour is for F_SETLKW not possible as multiple waiters can be
> get a result back in an random order. To avoid a collisions in cases
> like [0] or [1] this patch adds more fields to compare the plock
> operations as the lock request is the same. This is also being made in
> NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> still can't find the exact lock request for a specific result if the
> lock request is the same, but if this is the case we don't care the
> order how the identical lock requests get their result back to grant the
> lock.

When the recv_list contains multiple indistinguishable requests, this
can only be because they originated from multiple threads of the same
process. In that case, I agree that it doesn't matter which of those
requests we "complete" in dev_write() as long as we only complete one
request. We do need to compare the additional request fields in
dev_write() to find a suitable request, so that makes sense as well.
We need to compare all of the fields that identify a request (optype,
ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
"right" request (or in case there is more than one identical request,
a "suitable" request).

The above patch description doesn't match the code anymore, and the
code doesn't fully revert the recv_list splitting of the previous
version.

> [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Alexander Aring <aahringo@redhat.com>
> ---
> change since v2:
>  - don't split recv_list into recv_setlkw_list
>
>  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
>  1 file changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 31bc601ee3d8..53d17dbbb716 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
>                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
>                         list_del(&op->list);
>                 else
> -                       list_move(&op->list, &recv_list);
> +                       list_move_tail(&op->list, &recv_list);

^ This should be obsolete, but it won't hurt, either.

>                 memcpy(&info, &op->info, sizeof(info));
>         }
>         spin_unlock(&ops_lock);
> @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
>                 return -EINVAL;
>
>         spin_lock(&ops_lock);
> -       list_for_each_entry(iter, &recv_list, list) {
> -               if (iter->info.fsid == info.fsid &&
> -                   iter->info.number == info.number &&
> -                   iter->info.owner == info.owner) {
> -                       list_del_init(&iter->list);
> -                       memcpy(&iter->info, &info, sizeof(info));
> -                       if (iter->data)
> -                               do_callback = 1;
> -                       else
> -                               iter->done = 1;
> -                       op = iter;
> -                       break;
> +       if (info.wait) {

We should be able to use the same list_for_each_entry() loop for
F_SETLKW requests (which have info.wait set) as for all other requests
as far as I can see.

> +               list_for_each_entry(iter, &recv_list, list) {
> +                       if (iter->info.fsid == info.fsid &&
> +                           iter->info.number == info.number &&
> +                           iter->info.owner == info.owner &&
> +                           iter->info.pid == info.pid &&
> +                           iter->info.start == info.start &&
> +                           iter->info.end == info.end &&
> +                           iter->info.ex == info.ex &&
> +                           iter->info.wait) {
> +                               op = iter;
> +                               break;
> +                       }
>                 }
> +       } else {
> +               list_for_each_entry(iter, &recv_list, list) {
> +                       if (!iter->info.wait) {
> +                               op = iter;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       if (op) {
> +               list_del_init(&op->list);
> +               memcpy(&op->info, &info, sizeof(info));
> +               if (op->data)
> +                       do_callback = 1;
> +               else
> +                       op->done = 1;
>         }

Can't this code just remain in the list_for_each_entry() loop?

>         spin_unlock(&ops_lock);
>
> --
> 2.31.1
>

Andreas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-25 15:02   ` Andreas Gruenbacher
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-05-25 15:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> This patch fixes a possible plock op collisions when using F_SETLKW lock
> requests and fsid, number and owner are not enough to identify a result
> for a pending request. The ltp testcases [0] and [1] are examples when
> this is not enough in case of using classic posix locks with threads and
> open filedescriptor posix locks.
>
> The idea to fix the issue here is to place all lock request in order. In
> case of non F_SETLKW lock request (indicated if wait is set or not) the
> lock requests are ordered inside the recv_list. If a result comes back
> the right plock op can be found by the first plock_op in recv_list which
> has not info.wait set. This can be done only by non F_SETLKW plock ops as
> dlm_controld always reads a specific plock op (list_move_tail() from
> send_list to recv_mlist) and write the result immediately back.
>
> This behaviour is for F_SETLKW not possible as multiple waiters can be
> get a result back in an random order. To avoid a collisions in cases
> like [0] or [1] this patch adds more fields to compare the plock
> operations as the lock request is the same. This is also being made in
> NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> still can't find the exact lock request for a specific result if the
> lock request is the same, but if this is the case we don't care the
> order how the identical lock requests get their result back to grant the
> lock.

When the recv_list contains multiple indistinguishable requests, this
can only be because they originated from multiple threads of the same
process. In that case, I agree that it doesn't matter which of those
requests we "complete" in dev_write() as long as we only complete one
request. We do need to compare the additional request fields in
dev_write() to find a suitable request, so that makes sense as well.
We need to compare all of the fields that identify a request (optype,
ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
"right" request (or in case there is more than one identical request,
a "suitable" request).

The above patch description doesn't match the code anymore, and the
code doesn't fully revert the recv_list splitting of the previous
version.

> [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
>
> Cc: stable at vger.kernel.org
> Signed-off-by: Alexander Aring <aahringo@redhat.com>
> ---
> change since v2:
>  - don't split recv_list into recv_setlkw_list
>
>  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
>  1 file changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 31bc601ee3d8..53d17dbbb716 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
>                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
>                         list_del(&op->list);
>                 else
> -                       list_move(&op->list, &recv_list);
> +                       list_move_tail(&op->list, &recv_list);

^ This should be obsolete, but it won't hurt, either.

>                 memcpy(&info, &op->info, sizeof(info));
>         }
>         spin_unlock(&ops_lock);
> @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
>                 return -EINVAL;
>
>         spin_lock(&ops_lock);
> -       list_for_each_entry(iter, &recv_list, list) {
> -               if (iter->info.fsid == info.fsid &&
> -                   iter->info.number == info.number &&
> -                   iter->info.owner == info.owner) {
> -                       list_del_init(&iter->list);
> -                       memcpy(&iter->info, &info, sizeof(info));
> -                       if (iter->data)
> -                               do_callback = 1;
> -                       else
> -                               iter->done = 1;
> -                       op = iter;
> -                       break;
> +       if (info.wait) {

We should be able to use the same list_for_each_entry() loop for
F_SETLKW requests (which have info.wait set) as for all other requests
as far as I can see.

> +               list_for_each_entry(iter, &recv_list, list) {
> +                       if (iter->info.fsid == info.fsid &&
> +                           iter->info.number == info.number &&
> +                           iter->info.owner == info.owner &&
> +                           iter->info.pid == info.pid &&
> +                           iter->info.start == info.start &&
> +                           iter->info.end == info.end &&
> +                           iter->info.ex == info.ex &&
> +                           iter->info.wait) {
> +                               op = iter;
> +                               break;
> +                       }
>                 }
> +       } else {
> +               list_for_each_entry(iter, &recv_list, list) {
> +                       if (!iter->info.wait) {
> +                               op = iter;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       if (op) {
> +               list_del_init(&op->list);
> +               memcpy(&op->info, &info, sizeof(info));
> +               if (op->data)
> +                       do_callback = 1;
> +               else
> +                       op->done = 1;
>         }

Can't this code just remain in the list_for_each_entry() loop?

>         spin_unlock(&ops_lock);
>
> --
> 2.31.1
>

Andreas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-05-25 15:02   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-05-29 22:18     ` Alexander Aring
  -1 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-29 22:18 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: teigland, cluster-devel, stable

Hi,

On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > requests and fsid, number and owner are not enough to identify a result
> > for a pending request. The ltp testcases [0] and [1] are examples when
> > this is not enough in case of using classic posix locks with threads and
> > open filedescriptor posix locks.
> >
> > The idea to fix the issue here is to place all lock request in order. In
> > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > lock requests are ordered inside the recv_list. If a result comes back
> > the right plock op can be found by the first plock_op in recv_list which
> > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > dlm_controld always reads a specific plock op (list_move_tail() from
> > send_list to recv_mlist) and write the result immediately back.
> >
> > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > get a result back in an random order. To avoid a collisions in cases
> > like [0] or [1] this patch adds more fields to compare the plock
> > operations as the lock request is the same. This is also being made in
> > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > still can't find the exact lock request for a specific result if the
> > lock request is the same, but if this is the case we don't care the
> > order how the identical lock requests get their result back to grant the
> > lock.
>
> When the recv_list contains multiple indistinguishable requests, this
> can only be because they originated from multiple threads of the same
> process. In that case, I agree that it doesn't matter which of those
> requests we "complete" in dev_write() as long as we only complete one
> request. We do need to compare the additional request fields in
> dev_write() to find a suitable request, so that makes sense as well.
> We need to compare all of the fields that identify a request (optype,
> ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> "right" request (or in case there is more than one identical request,
> a "suitable" request).
>

In my "definition" why this works is as you said the "identical
request". There is a more deeper definition of "when is a request
identical" and in my opinion it is here as: "A request A is identical
to request B when they get granted under the same 'time'" which is all
the fields you mentioned.

Even with cancellation (F_SETLKW only) it does not matter which
"identical request" you cancel because the kernel and user
(dlm_controld) makes no relation between a lock request instance. You
need to have at least the same amount of "results" coming back from
user space as the amount you are waiting for a result for the same
"identical request".

> The above patch description doesn't match the code anymore, and the
> code doesn't fully revert the recv_list splitting of the previous
> version.
>

This isn't a revert. Is it a new patch version, I did drop the
recv_setlkw_list here, dropping in means of removing the
recv_setlkw_list and handling everything in the recv_list. Although
there might be a performance impact by splitting the requests in two
lists as we don't need to jump over all F_SETLKW requests.

> > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> >
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > ---
> > change since v2:
> >  - don't split recv_list into recv_setlkw_list
> >
> >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> >  1 file changed, 30 insertions(+), 13 deletions(-)
> >
> > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > index 31bc601ee3d8..53d17dbbb716 100644
> > --- a/fs/dlm/plock.c
> > +++ b/fs/dlm/plock.c
> > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> >                         list_del(&op->list);
> >                 else
> > -                       list_move(&op->list, &recv_list);
> > +                       list_move_tail(&op->list, &recv_list);
>
> ^ This should be obsolete, but it won't hurt, either.
>

No it is necessary, I tested it and looked deeper into the reason.
dlm_controld handles the lock requests in an ordered way over a
select() mechanism, but it will not always write a result back when
it's read the request out. This is the case for F_SETLKW but also for
all other plock op requests, such as F_GETLK. Instead of writing the
result back it will send it to corosync and the corosync select()
mechanism will write the result back. Corosync will keep the order to
write the result back. Due the fact that it's going through corosync
multiple non F_SETLKW can be queued up in recv_list and need to be
appended on the tail to later find the first entry which is non
F_SETLKW to find the result.

This ordered lock request read and write the result back (for non
F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
always this way.

> >                 memcpy(&info, &op->info, sizeof(info));
> >         }
> >         spin_unlock(&ops_lock);
> > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> >                 return -EINVAL;
> >
> >         spin_lock(&ops_lock);
> > -       list_for_each_entry(iter, &recv_list, list) {
> > -               if (iter->info.fsid == info.fsid &&
> > -                   iter->info.number == info.number &&
> > -                   iter->info.owner == info.owner) {
> > -                       list_del_init(&iter->list);
> > -                       memcpy(&iter->info, &info, sizeof(info));
> > -                       if (iter->data)
> > -                               do_callback = 1;
> > -                       else
> > -                               iter->done = 1;
> > -                       op = iter;
> > -                       break;
> > +       if (info.wait) {
>
> We should be able to use the same list_for_each_entry() loop for
> F_SETLKW requests (which have info.wait set) as for all other requests
> as far as I can see.
>

We can't match non F_SETLKW operations on all fields because F_GETLK
will change some fields when it's handled in user space. This is the
whole reason why the ordered handling is done here.

However there can be matched more fields but because F_GETLK we
require that this mechanism works in the above mentioned ordered way.
Those fields are checked by WARN_ON() that we get aware about changes
and "things" doesn't work anymore as they should.

> > +               list_for_each_entry(iter, &recv_list, list) {
> > +                       if (iter->info.fsid == info.fsid &&
> > +                           iter->info.number == info.number &&
> > +                           iter->info.owner == info.owner &&
> > +                           iter->info.pid == info.pid &&
> > +                           iter->info.start == info.start &&
> > +                           iter->info.end == info.end &&
> > +                           iter->info.ex == info.ex &&
> > +                           iter->info.wait) {
> > +                               op = iter;
> > +                               break;
> > +                       }
> >                 }
> > +       } else {
> > +               list_for_each_entry(iter, &recv_list, list) {
> > +                       if (!iter->info.wait) {
> > +                               op = iter;
> > +                               break;
> > +                       }
> > +               }
> > +       }
> > +
> > +       if (op) {
> > +               list_del_init(&op->list);
> > +               memcpy(&op->info, &info, sizeof(info));
> > +               if (op->data)
> > +                       do_callback = 1;
> > +               else
> > +                       op->done = 1;
> >         }
>
> Can't this code just remain in the list_for_each_entry() loop?
>

It can, but we need two of them then in each loop because two loops
are necessary (see above).

- Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-29 22:18     ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-29 22:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > requests and fsid, number and owner are not enough to identify a result
> > for a pending request. The ltp testcases [0] and [1] are examples when
> > this is not enough in case of using classic posix locks with threads and
> > open filedescriptor posix locks.
> >
> > The idea to fix the issue here is to place all lock request in order. In
> > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > lock requests are ordered inside the recv_list. If a result comes back
> > the right plock op can be found by the first plock_op in recv_list which
> > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > dlm_controld always reads a specific plock op (list_move_tail() from
> > send_list to recv_mlist) and write the result immediately back.
> >
> > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > get a result back in an random order. To avoid a collisions in cases
> > like [0] or [1] this patch adds more fields to compare the plock
> > operations as the lock request is the same. This is also being made in
> > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > still can't find the exact lock request for a specific result if the
> > lock request is the same, but if this is the case we don't care the
> > order how the identical lock requests get their result back to grant the
> > lock.
>
> When the recv_list contains multiple indistinguishable requests, this
> can only be because they originated from multiple threads of the same
> process. In that case, I agree that it doesn't matter which of those
> requests we "complete" in dev_write() as long as we only complete one
> request. We do need to compare the additional request fields in
> dev_write() to find a suitable request, so that makes sense as well.
> We need to compare all of the fields that identify a request (optype,
> ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> "right" request (or in case there is more than one identical request,
> a "suitable" request).
>

In my "definition" why this works is as you said the "identical
request". There is a more deeper definition of "when is a request
identical" and in my opinion it is here as: "A request A is identical
to request B when they get granted under the same 'time'" which is all
the fields you mentioned.

Even with cancellation (F_SETLKW only) it does not matter which
"identical request" you cancel because the kernel and user
(dlm_controld) makes no relation between a lock request instance. You
need to have at least the same amount of "results" coming back from
user space as the amount you are waiting for a result for the same
"identical request".

> The above patch description doesn't match the code anymore, and the
> code doesn't fully revert the recv_list splitting of the previous
> version.
>

This isn't a revert. Is it a new patch version, I did drop the
recv_setlkw_list here, dropping in means of removing the
recv_setlkw_list and handling everything in the recv_list. Although
there might be a performance impact by splitting the requests in two
lists as we don't need to jump over all F_SETLKW requests.

> > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> >
> > Cc: stable at vger.kernel.org
> > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > ---
> > change since v2:
> >  - don't split recv_list into recv_setlkw_list
> >
> >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> >  1 file changed, 30 insertions(+), 13 deletions(-)
> >
> > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > index 31bc601ee3d8..53d17dbbb716 100644
> > --- a/fs/dlm/plock.c
> > +++ b/fs/dlm/plock.c
> > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> >                         list_del(&op->list);
> >                 else
> > -                       list_move(&op->list, &recv_list);
> > +                       list_move_tail(&op->list, &recv_list);
>
> ^ This should be obsolete, but it won't hurt, either.
>

No it is necessary, I tested it and looked deeper into the reason.
dlm_controld handles the lock requests in an ordered way over a
select() mechanism, but it will not always write a result back when
it's read the request out. This is the case for F_SETLKW but also for
all other plock op requests, such as F_GETLK. Instead of writing the
result back it will send it to corosync and the corosync select()
mechanism will write the result back. Corosync will keep the order to
write the result back. Due the fact that it's going through corosync
multiple non F_SETLKW can be queued up in recv_list and need to be
appended on the tail to later find the first entry which is non
F_SETLKW to find the result.

This ordered lock request read and write the result back (for non
F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
always this way.

> >                 memcpy(&info, &op->info, sizeof(info));
> >         }
> >         spin_unlock(&ops_lock);
> > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> >                 return -EINVAL;
> >
> >         spin_lock(&ops_lock);
> > -       list_for_each_entry(iter, &recv_list, list) {
> > -               if (iter->info.fsid == info.fsid &&
> > -                   iter->info.number == info.number &&
> > -                   iter->info.owner == info.owner) {
> > -                       list_del_init(&iter->list);
> > -                       memcpy(&iter->info, &info, sizeof(info));
> > -                       if (iter->data)
> > -                               do_callback = 1;
> > -                       else
> > -                               iter->done = 1;
> > -                       op = iter;
> > -                       break;
> > +       if (info.wait) {
>
> We should be able to use the same list_for_each_entry() loop for
> F_SETLKW requests (which have info.wait set) as for all other requests
> as far as I can see.
>

We can't match non F_SETLKW operations on all fields because F_GETLK
will change some fields when it's handled in user space. This is the
whole reason why the ordered handling is done here.

However there can be matched more fields but because F_GETLK we
require that this mechanism works in the above mentioned ordered way.
Those fields are checked by WARN_ON() that we get aware about changes
and "things" doesn't work anymore as they should.

> > +               list_for_each_entry(iter, &recv_list, list) {
> > +                       if (iter->info.fsid == info.fsid &&
> > +                           iter->info.number == info.number &&
> > +                           iter->info.owner == info.owner &&
> > +                           iter->info.pid == info.pid &&
> > +                           iter->info.start == info.start &&
> > +                           iter->info.end == info.end &&
> > +                           iter->info.ex == info.ex &&
> > +                           iter->info.wait) {
> > +                               op = iter;
> > +                               break;
> > +                       }
> >                 }
> > +       } else {
> > +               list_for_each_entry(iter, &recv_list, list) {
> > +                       if (!iter->info.wait) {
> > +                               op = iter;
> > +                               break;
> > +                       }
> > +               }
> > +       }
> > +
> > +       if (op) {
> > +               list_del_init(&op->list);
> > +               memcpy(&op->info, &info, sizeof(info));
> > +               if (op->data)
> > +                       do_callback = 1;
> > +               else
> > +                       op->done = 1;
> >         }
>
> Can't this code just remain in the list_for_each_entry() loop?
>

It can, but we need two of them then in each loop because two loops
are necessary (see above).

- Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-05-29 22:18     ` [Cluster-devel] " Alexander Aring
@ 2023-05-29 22:20       ` Alexander Aring
  -1 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-29 22:20 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: teigland, cluster-devel, stable

Hi,

On Mon, May 29, 2023 at 6:18 PM Alexander Aring <aahringo@redhat.com> wrote:
>
> Hi,
>
> On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > requests and fsid, number and owner are not enough to identify a result
> > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > this is not enough in case of using classic posix locks with threads and
> > > open filedescriptor posix locks.
> > >
> > > The idea to fix the issue here is to place all lock request in order. In
> > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > lock requests are ordered inside the recv_list. If a result comes back
> > > the right plock op can be found by the first plock_op in recv_list which
> > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > send_list to recv_mlist) and write the result immediately back.
> > >
> > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > get a result back in an random order. To avoid a collisions in cases
> > > like [0] or [1] this patch adds more fields to compare the plock
> > > operations as the lock request is the same. This is also being made in
> > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > still can't find the exact lock request for a specific result if the
> > > lock request is the same, but if this is the case we don't care the
> > > order how the identical lock requests get their result back to grant the
> > > lock.
> >
> > When the recv_list contains multiple indistinguishable requests, this
> > can only be because they originated from multiple threads of the same
> > process. In that case, I agree that it doesn't matter which of those
> > requests we "complete" in dev_write() as long as we only complete one
> > request. We do need to compare the additional request fields in
> > dev_write() to find a suitable request, so that makes sense as well.
> > We need to compare all of the fields that identify a request (optype,
> > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > "right" request (or in case there is more than one identical request,
> > a "suitable" request).
> >
>
> In my "definition" why this works is as you said the "identical
> request". There is a more deeper definition of "when is a request
> identical" and in my opinion it is here as: "A request A is identical
> to request B when they get granted under the same 'time'" which is all
> the fields you mentioned.

s/under/at/

at the same 'time' or under the same conditions...

- Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-29 22:20       ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-29 22:20 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Mon, May 29, 2023 at 6:18?PM Alexander Aring <aahringo@redhat.com> wrote:
>
> Hi,
>
> On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > requests and fsid, number and owner are not enough to identify a result
> > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > this is not enough in case of using classic posix locks with threads and
> > > open filedescriptor posix locks.
> > >
> > > The idea to fix the issue here is to place all lock request in order. In
> > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > lock requests are ordered inside the recv_list. If a result comes back
> > > the right plock op can be found by the first plock_op in recv_list which
> > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > send_list to recv_mlist) and write the result immediately back.
> > >
> > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > get a result back in an random order. To avoid a collisions in cases
> > > like [0] or [1] this patch adds more fields to compare the plock
> > > operations as the lock request is the same. This is also being made in
> > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > still can't find the exact lock request for a specific result if the
> > > lock request is the same, but if this is the case we don't care the
> > > order how the identical lock requests get their result back to grant the
> > > lock.
> >
> > When the recv_list contains multiple indistinguishable requests, this
> > can only be because they originated from multiple threads of the same
> > process. In that case, I agree that it doesn't matter which of those
> > requests we "complete" in dev_write() as long as we only complete one
> > request. We do need to compare the additional request fields in
> > dev_write() to find a suitable request, so that makes sense as well.
> > We need to compare all of the fields that identify a request (optype,
> > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > "right" request (or in case there is more than one identical request,
> > a "suitable" request).
> >
>
> In my "definition" why this works is as you said the "identical
> request". There is a more deeper definition of "when is a request
> identical" and in my opinion it is here as: "A request A is identical
> to request B when they get granted under the same 'time'" which is all
> the fields you mentioned.

s/under/at/

at the same 'time' or under the same conditions...

- Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-05-29 22:18     ` [Cluster-devel] " Alexander Aring
@ 2023-05-30 11:01       ` Andreas Gruenbacher
  -1 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-05-30 11:01 UTC (permalink / raw)
  To: Alexander Aring; +Cc: teigland, cluster-devel, stable

On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > requests and fsid, number and owner are not enough to identify a result
> > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > this is not enough in case of using classic posix locks with threads and
> > > open filedescriptor posix locks.
> > >
> > > The idea to fix the issue here is to place all lock request in order. In
> > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > lock requests are ordered inside the recv_list. If a result comes back
> > > the right plock op can be found by the first plock_op in recv_list which
> > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > send_list to recv_mlist) and write the result immediately back.
> > >
> > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > get a result back in an random order. To avoid a collisions in cases
> > > like [0] or [1] this patch adds more fields to compare the plock
> > > operations as the lock request is the same. This is also being made in
> > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > still can't find the exact lock request for a specific result if the
> > > lock request is the same, but if this is the case we don't care the
> > > order how the identical lock requests get their result back to grant the
> > > lock.
> >
> > When the recv_list contains multiple indistinguishable requests, this
> > can only be because they originated from multiple threads of the same
> > process. In that case, I agree that it doesn't matter which of those
> > requests we "complete" in dev_write() as long as we only complete one
> > request. We do need to compare the additional request fields in
> > dev_write() to find a suitable request, so that makes sense as well.
> > We need to compare all of the fields that identify a request (optype,
> > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > "right" request (or in case there is more than one identical request,
> > a "suitable" request).
> >
>
> In my "definition" why this works is as you said the "identical
> request". There is a more deeper definition of "when is a request
> identical" and in my opinion it is here as: "A request A is identical
> to request B when they get granted under the same 'time'" which is all
> the fields you mentioned.
>
> Even with cancellation (F_SETLKW only) it does not matter which
> "identical request" you cancel because the kernel and user
> (dlm_controld) makes no relation between a lock request instance. You
> need to have at least the same amount of "results" coming back from
> user space as the amount you are waiting for a result for the same
> "identical request".

That's not incorrect per se, but cancellations create an additional
difficulty: they can either succeed or fail. To indicate that a
cancellation has succeeded, a new type of message can be introduced
(say, "CANCELLED"), and it's obvious that a CANCELLED message can only
belong to a locking request that is being cancelled. When cancelling a
locking request fails, the kernel will see a "locking request granted"
message though, and when multiple identical locking requests are
queued and only some of them have been cancelled, it won't be obvious
which locking request a "locking request granted" message should be
assigned to anymore. You really don't want to mix things up in that
case.

This complication makes it a whole lot more difficult to reason about
the correctness of the code. All that complexity is avoidable by
sticking with a fixed mapping of requests and replies (i.e., a unique
request identifier).

To put it differently, you can shoot yourself in the foot and still
hop along on the other leg, but it may not be the best of all possible
ideas.

> > The above patch description doesn't match the code anymore, and the
> > code doesn't fully revert the recv_list splitting of the previous
> > version.
> >
>
> This isn't a revert. Is it a new patch version, I did drop the
> recv_setlkw_list here, dropping in means of removing the
> recv_setlkw_list and handling everything in the recv_list. Although
> there might be a performance impact by splitting the requests in two
> lists as we don't need to jump over all F_SETLKW requests.
>
> > > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> > >
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > > ---
> > > change since v2:
> > >  - don't split recv_list into recv_setlkw_list
> > >
> > >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> > >  1 file changed, 30 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > index 31bc601ee3d8..53d17dbbb716 100644
> > > --- a/fs/dlm/plock.c
> > > +++ b/fs/dlm/plock.c
> > > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> > >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> > >                         list_del(&op->list);
> > >                 else
> > > -                       list_move(&op->list, &recv_list);
> > > +                       list_move_tail(&op->list, &recv_list);
> >
> > ^ This should be obsolete, but it won't hurt, either.
> >
>
> No it is necessary, I tested it and looked deeper into the reason.
> dlm_controld handles the lock requests in an ordered way over a
> select() mechanism, but it will not always write a result back when
> it's read the request out. This is the case for F_SETLKW but also for
> all other plock op requests, such as F_GETLK. Instead of writing the
> result back it will send it to corosync and the corosync select()
> mechanism will write the result back. Corosync will keep the order to
> write the result back. Due the fact that it's going through corosync
> multiple non F_SETLKW can be queued up in recv_list and need to be
> appended on the tail to later find the first entry which is non
> F_SETLKW to find the result.
>
> This ordered lock request read and write the result back (for non
> F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
> always this way.

This sounds pretty confused. Let's look at

> > >                 memcpy(&info, &op->info, sizeof(info));
> > >         }
> > >         spin_unlock(&ops_lock);
> > > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> > >                 return -EINVAL;
> > >
> > >         spin_lock(&ops_lock);
> > > -       list_for_each_entry(iter, &recv_list, list) {
> > > -               if (iter->info.fsid == info.fsid &&
> > > -                   iter->info.number == info.number &&
> > > -                   iter->info.owner == info.owner) {
> > > -                       list_del_init(&iter->list);
> > > -                       memcpy(&iter->info, &info, sizeof(info));
> > > -                       if (iter->data)
> > > -                               do_callback = 1;
> > > -                       else
> > > -                               iter->done = 1;
> > > -                       op = iter;
> > > -                       break;
> > > +       if (info.wait) {
> >
.> > We should be able to use the same list_for_each_entry() loop for
> > F_SETLKW requests (which have info.wait set) as for all other requests
> > as far as I can see.
> >
>
> We can't match non F_SETLKW operations on all fields because F_GETLK
> will change some fields when it's handled in user space. This is the
> whole reason why the ordered handling is done here.

I know that F_GETLK uses the l_type field to indicate the outcome of
the operation. But that happens in dlm_posix_get() when processing the
reply from dlm_controld; it doesn't affect info.optype or any other
fields in struct dlm_plock_info. So we actually can compare all of the
key fields in struct dlm_plock_info.

> However there can be matched more fields but because F_GETLK we
> require that this mechanism works in the above mentioned ordered way.
> Those fields are checked by WARN_ON() that we get aware about changes
> and "things" doesn't work anymore as they should.
>
> > > +               list_for_each_entry(iter, &recv_list, list) {
> > > +                       if (iter->info.fsid == info.fsid &&
> > > +                           iter->info.number == info.number &&
> > > +                           iter->info.owner == info.owner &&
> > > +                           iter->info.pid == info.pid &&
> > > +                           iter->info.start == info.start &&
> > > +                           iter->info.end == info.end &&
> > > +                           iter->info.ex == info.ex &&
> > > +                           iter->info.wait) {
> > > +                               op = iter;
> > > +                               break;
> > > +                       }
> > >                 }
> > > +       } else {
> > > +               list_for_each_entry(iter, &recv_list, list) {
> > > +                       if (!iter->info.wait) {
> > > +                               op = iter;
> > > +                               break;
> > > +                       }
> > > +               }
> > > +       }
> > > +
> > > +       if (op) {
> > > +               list_del_init(&op->list);
> > > +               memcpy(&op->info, &info, sizeof(info));
> > > +               if (op->data)
> > > +                       do_callback = 1;
> > > +               else
> > > +                       op->done = 1;
> > >         }
> >
> > Can't this code just remain in the list_for_each_entry() loop?
> >
>
> It can, but we need two of them then in each loop because two loops
> are necessary (see above).

Well yes, my comment was based on the fact that there actually
shouldn't be two loops.

Andreas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-30 11:01       ` Andreas Gruenbacher
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-05-30 11:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > requests and fsid, number and owner are not enough to identify a result
> > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > this is not enough in case of using classic posix locks with threads and
> > > open filedescriptor posix locks.
> > >
> > > The idea to fix the issue here is to place all lock request in order. In
> > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > lock requests are ordered inside the recv_list. If a result comes back
> > > the right plock op can be found by the first plock_op in recv_list which
> > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > send_list to recv_mlist) and write the result immediately back.
> > >
> > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > get a result back in an random order. To avoid a collisions in cases
> > > like [0] or [1] this patch adds more fields to compare the plock
> > > operations as the lock request is the same. This is also being made in
> > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > still can't find the exact lock request for a specific result if the
> > > lock request is the same, but if this is the case we don't care the
> > > order how the identical lock requests get their result back to grant the
> > > lock.
> >
> > When the recv_list contains multiple indistinguishable requests, this
> > can only be because they originated from multiple threads of the same
> > process. In that case, I agree that it doesn't matter which of those
> > requests we "complete" in dev_write() as long as we only complete one
> > request. We do need to compare the additional request fields in
> > dev_write() to find a suitable request, so that makes sense as well.
> > We need to compare all of the fields that identify a request (optype,
> > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > "right" request (or in case there is more than one identical request,
> > a "suitable" request).
> >
>
> In my "definition" why this works is as you said the "identical
> request". There is a more deeper definition of "when is a request
> identical" and in my opinion it is here as: "A request A is identical
> to request B when they get granted under the same 'time'" which is all
> the fields you mentioned.
>
> Even with cancellation (F_SETLKW only) it does not matter which
> "identical request" you cancel because the kernel and user
> (dlm_controld) makes no relation between a lock request instance. You
> need to have at least the same amount of "results" coming back from
> user space as the amount you are waiting for a result for the same
> "identical request".

That's not incorrect per se, but cancellations create an additional
difficulty: they can either succeed or fail. To indicate that a
cancellation has succeeded, a new type of message can be introduced
(say, "CANCELLED"), and it's obvious that a CANCELLED message can only
belong to a locking request that is being cancelled. When cancelling a
locking request fails, the kernel will see a "locking request granted"
message though, and when multiple identical locking requests are
queued and only some of them have been cancelled, it won't be obvious
which locking request a "locking request granted" message should be
assigned to anymore. You really don't want to mix things up in that
case.

This complication makes it a whole lot more difficult to reason about
the correctness of the code. All that complexity is avoidable by
sticking with a fixed mapping of requests and replies (i.e., a unique
request identifier).

To put it differently, you can shoot yourself in the foot and still
hop along on the other leg, but it may not be the best of all possible
ideas.

> > The above patch description doesn't match the code anymore, and the
> > code doesn't fully revert the recv_list splitting of the previous
> > version.
> >
>
> This isn't a revert. Is it a new patch version, I did drop the
> recv_setlkw_list here, dropping in means of removing the
> recv_setlkw_list and handling everything in the recv_list. Although
> there might be a performance impact by splitting the requests in two
> lists as we don't need to jump over all F_SETLKW requests.
>
> > > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> > >
> > > Cc: stable at vger.kernel.org
> > > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > > ---
> > > change since v2:
> > >  - don't split recv_list into recv_setlkw_list
> > >
> > >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> > >  1 file changed, 30 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > index 31bc601ee3d8..53d17dbbb716 100644
> > > --- a/fs/dlm/plock.c
> > > +++ b/fs/dlm/plock.c
> > > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> > >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> > >                         list_del(&op->list);
> > >                 else
> > > -                       list_move(&op->list, &recv_list);
> > > +                       list_move_tail(&op->list, &recv_list);
> >
> > ^ This should be obsolete, but it won't hurt, either.
> >
>
> No it is necessary, I tested it and looked deeper into the reason.
> dlm_controld handles the lock requests in an ordered way over a
> select() mechanism, but it will not always write a result back when
> it's read the request out. This is the case for F_SETLKW but also for
> all other plock op requests, such as F_GETLK. Instead of writing the
> result back it will send it to corosync and the corosync select()
> mechanism will write the result back. Corosync will keep the order to
> write the result back. Due the fact that it's going through corosync
> multiple non F_SETLKW can be queued up in recv_list and need to be
> appended on the tail to later find the first entry which is non
> F_SETLKW to find the result.
>
> This ordered lock request read and write the result back (for non
> F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
> always this way.

This sounds pretty confused. Let's look at

> > >                 memcpy(&info, &op->info, sizeof(info));
> > >         }
> > >         spin_unlock(&ops_lock);
> > > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> > >                 return -EINVAL;
> > >
> > >         spin_lock(&ops_lock);
> > > -       list_for_each_entry(iter, &recv_list, list) {
> > > -               if (iter->info.fsid == info.fsid &&
> > > -                   iter->info.number == info.number &&
> > > -                   iter->info.owner == info.owner) {
> > > -                       list_del_init(&iter->list);
> > > -                       memcpy(&iter->info, &info, sizeof(info));
> > > -                       if (iter->data)
> > > -                               do_callback = 1;
> > > -                       else
> > > -                               iter->done = 1;
> > > -                       op = iter;
> > > -                       break;
> > > +       if (info.wait) {
> >
.> > We should be able to use the same list_for_each_entry() loop for
> > F_SETLKW requests (which have info.wait set) as for all other requests
> > as far as I can see.
> >
>
> We can't match non F_SETLKW operations on all fields because F_GETLK
> will change some fields when it's handled in user space. This is the
> whole reason why the ordered handling is done here.

I know that F_GETLK uses the l_type field to indicate the outcome of
the operation. But that happens in dlm_posix_get() when processing the
reply from dlm_controld; it doesn't affect info.optype or any other
fields in struct dlm_plock_info. So we actually can compare all of the
key fields in struct dlm_plock_info.

> However there can be matched more fields but because F_GETLK we
> require that this mechanism works in the above mentioned ordered way.
> Those fields are checked by WARN_ON() that we get aware about changes
> and "things" doesn't work anymore as they should.
>
> > > +               list_for_each_entry(iter, &recv_list, list) {
> > > +                       if (iter->info.fsid == info.fsid &&
> > > +                           iter->info.number == info.number &&
> > > +                           iter->info.owner == info.owner &&
> > > +                           iter->info.pid == info.pid &&
> > > +                           iter->info.start == info.start &&
> > > +                           iter->info.end == info.end &&
> > > +                           iter->info.ex == info.ex &&
> > > +                           iter->info.wait) {
> > > +                               op = iter;
> > > +                               break;
> > > +                       }
> > >                 }
> > > +       } else {
> > > +               list_for_each_entry(iter, &recv_list, list) {
> > > +                       if (!iter->info.wait) {
> > > +                               op = iter;
> > > +                               break;
> > > +                       }
> > > +               }
> > > +       }
> > > +
> > > +       if (op) {
> > > +               list_del_init(&op->list);
> > > +               memcpy(&op->info, &info, sizeof(info));
> > > +               if (op->data)
> > > +                       do_callback = 1;
> > > +               else
> > > +                       op->done = 1;
> > >         }
> >
> > Can't this code just remain in the list_for_each_entry() loop?
> >
>
> It can, but we need two of them then in each loop because two loops
> are necessary (see above).

Well yes, my comment was based on the fact that there actually
shouldn't be two loops.

Andreas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-05-30 11:01       ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-05-30 14:06         ` Alexander Aring
  -1 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-30 14:06 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: teigland, cluster-devel, stable

Hi,

On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > <agruenba@redhat.com> wrote:
> > >
> > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > requests and fsid, number and owner are not enough to identify a result
> > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > this is not enough in case of using classic posix locks with threads and
> > > > open filedescriptor posix locks.
> > > >
> > > > The idea to fix the issue here is to place all lock request in order. In
> > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > the right plock op can be found by the first plock_op in recv_list which
> > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > send_list to recv_mlist) and write the result immediately back.
> > > >
> > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > get a result back in an random order. To avoid a collisions in cases
> > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > operations as the lock request is the same. This is also being made in
> > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > still can't find the exact lock request for a specific result if the
> > > > lock request is the same, but if this is the case we don't care the
> > > > order how the identical lock requests get their result back to grant the
> > > > lock.
> > >
> > > When the recv_list contains multiple indistinguishable requests, this
> > > can only be because they originated from multiple threads of the same
> > > process. In that case, I agree that it doesn't matter which of those
> > > requests we "complete" in dev_write() as long as we only complete one
> > > request. We do need to compare the additional request fields in
> > > dev_write() to find a suitable request, so that makes sense as well.
> > > We need to compare all of the fields that identify a request (optype,
> > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > "right" request (or in case there is more than one identical request,
> > > a "suitable" request).
> > >
> >
> > In my "definition" why this works is as you said the "identical
> > request". There is a more deeper definition of "when is a request
> > identical" and in my opinion it is here as: "A request A is identical
> > to request B when they get granted under the same 'time'" which is all
> > the fields you mentioned.
> >
> > Even with cancellation (F_SETLKW only) it does not matter which
> > "identical request" you cancel because the kernel and user
> > (dlm_controld) makes no relation between a lock request instance. You
> > need to have at least the same amount of "results" coming back from
> > user space as the amount you are waiting for a result for the same
> > "identical request".
>
> That's not incorrect per se, but cancellations create an additional
> difficulty: they can either succeed or fail. To indicate that a
> cancellation has succeeded, a new type of message can be introduced
> (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> belong to a locking request that is being cancelled. When cancelling a
> locking request fails, the kernel will see a "locking request granted"
> message though, and when multiple identical locking requests are
> queued and only some of them have been cancelled, it won't be obvious
> which locking request a "locking request granted" message should be
> assigned to anymore. You really don't want to mix things up in that
> case.
>
> This complication makes it a whole lot more difficult to reason about
> the correctness of the code. All that complexity is avoidable by
> sticking with a fixed mapping of requests and replies (i.e., a unique
> request identifier).
>
> To put it differently, you can shoot yourself in the foot and still
> hop along on the other leg, but it may not be the best of all possible
> ideas.
>

It makes things more complicated, I agree and the reason why this
works now is because there are a lot of "dependencies". I would love
to have an unique identifier to make it possible that we can follow an
instance handle of the original lock request.

* an unique identifier which also works with the async lock request of
lockd case.

> > > The above patch description doesn't match the code anymore, and the
> > > code doesn't fully revert the recv_list splitting of the previous
> > > version.
> > >
> >
> > This isn't a revert. Is it a new patch version, I did drop the
> > recv_setlkw_list here, dropping in means of removing the
> > recv_setlkw_list and handling everything in the recv_list. Although
> > there might be a performance impact by splitting the requests in two
> > lists as we don't need to jump over all F_SETLKW requests.
> >
> > > > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > > > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> > > >
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > > > ---
> > > > change since v2:
> > > >  - don't split recv_list into recv_setlkw_list
> > > >
> > > >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> > > >  1 file changed, 30 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > > index 31bc601ee3d8..53d17dbbb716 100644
> > > > --- a/fs/dlm/plock.c
> > > > +++ b/fs/dlm/plock.c
> > > > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> > > >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> > > >                         list_del(&op->list);
> > > >                 else
> > > > -                       list_move(&op->list, &recv_list);
> > > > +                       list_move_tail(&op->list, &recv_list);
> > >
> > > ^ This should be obsolete, but it won't hurt, either.
> > >
> >
> > No it is necessary, I tested it and looked deeper into the reason.
> > dlm_controld handles the lock requests in an ordered way over a
> > select() mechanism, but it will not always write a result back when
> > it's read the request out. This is the case for F_SETLKW but also for
> > all other plock op requests, such as F_GETLK. Instead of writing the
> > result back it will send it to corosync and the corosync select()
> > mechanism will write the result back. Corosync will keep the order to
> > write the result back. Due the fact that it's going through corosync
> > multiple non F_SETLKW can be queued up in recv_list and need to be
> > appended on the tail to later find the first entry which is non
> > F_SETLKW to find the result.
> >
> > This ordered lock request read and write the result back (for non
> > F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
> > always this way.
>
> This sounds pretty confused. Let's look at
>

As I said, yes it is a lot of specific handling of user space why this
is actually working.

> > > >                 memcpy(&info, &op->info, sizeof(info));
> > > >         }
> > > >         spin_unlock(&ops_lock);
> > > > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> > > >                 return -EINVAL;
> > > >
> > > >         spin_lock(&ops_lock);
> > > > -       list_for_each_entry(iter, &recv_list, list) {
> > > > -               if (iter->info.fsid == info.fsid &&
> > > > -                   iter->info.number == info.number &&
> > > > -                   iter->info.owner == info.owner) {
> > > > -                       list_del_init(&iter->list);
> > > > -                       memcpy(&iter->info, &info, sizeof(info));
> > > > -                       if (iter->data)
> > > > -                               do_callback = 1;
> > > > -                       else
> > > > -                               iter->done = 1;
> > > > -                       op = iter;
> > > > -                       break;
> > > > +       if (info.wait) {
> > >
> .> > We should be able to use the same list_for_each_entry() loop for
> > > F_SETLKW requests (which have info.wait set) as for all other requests
> > > as far as I can see.
> > >
> >
> > We can't match non F_SETLKW operations on all fields because F_GETLK
> > will change some fields when it's handled in user space. This is the
> > whole reason why the ordered handling is done here.
>
> I know that F_GETLK uses the l_type field to indicate the outcome of
> the operation. But that happens in dlm_posix_get() when processing the
> reply from dlm_controld; it doesn't affect info.optype or any other
> fields in struct dlm_plock_info. So we actually can compare all of the
> key fields in struct dlm_plock_info.
>

F_GETLK also uses start, end, etc. to tell the caller about the region
which is in conflict. The region which is in conflict is returned into
the result info struct. See [0] [1].
Is this more clear now that other fields are affected?

- Alex

[0] https://pagure.io/dlm/blob/main/f/dlm_controld/plock.c#_428
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/plock.c?h=v6.4-rc4#n359


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-30 14:06         ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-05-30 14:06 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > <agruenba@redhat.com> wrote:
> > >
> > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > requests and fsid, number and owner are not enough to identify a result
> > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > this is not enough in case of using classic posix locks with threads and
> > > > open filedescriptor posix locks.
> > > >
> > > > The idea to fix the issue here is to place all lock request in order. In
> > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > the right plock op can be found by the first plock_op in recv_list which
> > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > send_list to recv_mlist) and write the result immediately back.
> > > >
> > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > get a result back in an random order. To avoid a collisions in cases
> > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > operations as the lock request is the same. This is also being made in
> > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > still can't find the exact lock request for a specific result if the
> > > > lock request is the same, but if this is the case we don't care the
> > > > order how the identical lock requests get their result back to grant the
> > > > lock.
> > >
> > > When the recv_list contains multiple indistinguishable requests, this
> > > can only be because they originated from multiple threads of the same
> > > process. In that case, I agree that it doesn't matter which of those
> > > requests we "complete" in dev_write() as long as we only complete one
> > > request. We do need to compare the additional request fields in
> > > dev_write() to find a suitable request, so that makes sense as well.
> > > We need to compare all of the fields that identify a request (optype,
> > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > "right" request (or in case there is more than one identical request,
> > > a "suitable" request).
> > >
> >
> > In my "definition" why this works is as you said the "identical
> > request". There is a more deeper definition of "when is a request
> > identical" and in my opinion it is here as: "A request A is identical
> > to request B when they get granted under the same 'time'" which is all
> > the fields you mentioned.
> >
> > Even with cancellation (F_SETLKW only) it does not matter which
> > "identical request" you cancel because the kernel and user
> > (dlm_controld) makes no relation between a lock request instance. You
> > need to have at least the same amount of "results" coming back from
> > user space as the amount you are waiting for a result for the same
> > "identical request".
>
> That's not incorrect per se, but cancellations create an additional
> difficulty: they can either succeed or fail. To indicate that a
> cancellation has succeeded, a new type of message can be introduced
> (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> belong to a locking request that is being cancelled. When cancelling a
> locking request fails, the kernel will see a "locking request granted"
> message though, and when multiple identical locking requests are
> queued and only some of them have been cancelled, it won't be obvious
> which locking request a "locking request granted" message should be
> assigned to anymore. You really don't want to mix things up in that
> case.
>
> This complication makes it a whole lot more difficult to reason about
> the correctness of the code. All that complexity is avoidable by
> sticking with a fixed mapping of requests and replies (i.e., a unique
> request identifier).
>
> To put it differently, you can shoot yourself in the foot and still
> hop along on the other leg, but it may not be the best of all possible
> ideas.
>

It makes things more complicated, I agree and the reason why this
works now is because there are a lot of "dependencies". I would love
to have an unique identifier to make it possible that we can follow an
instance handle of the original lock request.

* an unique identifier which also works with the async lock request of
lockd case.

> > > The above patch description doesn't match the code anymore, and the
> > > code doesn't fully revert the recv_list splitting of the previous
> > > version.
> > >
> >
> > This isn't a revert. Is it a new patch version, I did drop the
> > recv_setlkw_list here, dropping in means of removing the
> > recv_setlkw_list and handling everything in the recv_list. Although
> > there might be a performance impact by splitting the requests in two
> > lists as we don't need to jump over all F_SETLKW requests.
> >
> > > > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > > > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> > > >
> > > > Cc: stable at vger.kernel.org
> > > > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > > > ---
> > > > change since v2:
> > > >  - don't split recv_list into recv_setlkw_list
> > > >
> > > >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> > > >  1 file changed, 30 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > > index 31bc601ee3d8..53d17dbbb716 100644
> > > > --- a/fs/dlm/plock.c
> > > > +++ b/fs/dlm/plock.c
> > > > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> > > >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> > > >                         list_del(&op->list);
> > > >                 else
> > > > -                       list_move(&op->list, &recv_list);
> > > > +                       list_move_tail(&op->list, &recv_list);
> > >
> > > ^ This should be obsolete, but it won't hurt, either.
> > >
> >
> > No it is necessary, I tested it and looked deeper into the reason.
> > dlm_controld handles the lock requests in an ordered way over a
> > select() mechanism, but it will not always write a result back when
> > it's read the request out. This is the case for F_SETLKW but also for
> > all other plock op requests, such as F_GETLK. Instead of writing the
> > result back it will send it to corosync and the corosync select()
> > mechanism will write the result back. Corosync will keep the order to
> > write the result back. Due the fact that it's going through corosync
> > multiple non F_SETLKW can be queued up in recv_list and need to be
> > appended on the tail to later find the first entry which is non
> > F_SETLKW to find the result.
> >
> > This ordered lock request read and write the result back (for non
> > F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
> > always this way.
>
> This sounds pretty confused. Let's look at
>

As I said, yes it is a lot of specific handling of user space why this
is actually working.

> > > >                 memcpy(&info, &op->info, sizeof(info));
> > > >         }
> > > >         spin_unlock(&ops_lock);
> > > > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> > > >                 return -EINVAL;
> > > >
> > > >         spin_lock(&ops_lock);
> > > > -       list_for_each_entry(iter, &recv_list, list) {
> > > > -               if (iter->info.fsid == info.fsid &&
> > > > -                   iter->info.number == info.number &&
> > > > -                   iter->info.owner == info.owner) {
> > > > -                       list_del_init(&iter->list);
> > > > -                       memcpy(&iter->info, &info, sizeof(info));
> > > > -                       if (iter->data)
> > > > -                               do_callback = 1;
> > > > -                       else
> > > > -                               iter->done = 1;
> > > > -                       op = iter;
> > > > -                       break;
> > > > +       if (info.wait) {
> > >
> .> > We should be able to use the same list_for_each_entry() loop for
> > > F_SETLKW requests (which have info.wait set) as for all other requests
> > > as far as I can see.
> > >
> >
> > We can't match non F_SETLKW operations on all fields because F_GETLK
> > will change some fields when it's handled in user space. This is the
> > whole reason why the ordered handling is done here.
>
> I know that F_GETLK uses the l_type field to indicate the outcome of
> the operation. But that happens in dlm_posix_get() when processing the
> reply from dlm_controld; it doesn't affect info.optype or any other
> fields in struct dlm_plock_info. So we actually can compare all of the
> key fields in struct dlm_plock_info.
>

F_GETLK also uses start, end, etc. to tell the caller about the region
which is in conflict. The region which is in conflict is returned into
the result info struct. See [0] [1].
Is this more clear now that other fields are affected?

- Alex

[0] https://pagure.io/dlm/blob/main/f/dlm_controld/plock.c#_428
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/plock.c?h=v6.4-rc4#n359


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-05-30 14:06         ` [Cluster-devel] " Alexander Aring
@ 2023-05-30 17:40           ` Andreas Gruenbacher
  -1 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-05-30 17:40 UTC (permalink / raw)
  To: Alexander Aring; +Cc: teigland, cluster-devel, stable

On Tue, May 30, 2023 at 4:08 PM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> > > Hi,
> > >
> > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > <agruenba@redhat.com> wrote:
> > > >
> > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > this is not enough in case of using classic posix locks with threads and
> > > > > open filedescriptor posix locks.
> > > > >
> > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > send_list to recv_mlist) and write the result immediately back.
> > > > >
> > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > operations as the lock request is the same. This is also being made in
> > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > still can't find the exact lock request for a specific result if the
> > > > > lock request is the same, but if this is the case we don't care the
> > > > > order how the identical lock requests get their result back to grant the
> > > > > lock.
> > > >
> > > > When the recv_list contains multiple indistinguishable requests, this
> > > > can only be because they originated from multiple threads of the same
> > > > process. In that case, I agree that it doesn't matter which of those
> > > > requests we "complete" in dev_write() as long as we only complete one
> > > > request. We do need to compare the additional request fields in
> > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > We need to compare all of the fields that identify a request (optype,
> > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > "right" request (or in case there is more than one identical request,
> > > > a "suitable" request).
> > > >
> > >
> > > In my "definition" why this works is as you said the "identical
> > > request". There is a more deeper definition of "when is a request
> > > identical" and in my opinion it is here as: "A request A is identical
> > > to request B when they get granted under the same 'time'" which is all
> > > the fields you mentioned.
> > >
> > > Even with cancellation (F_SETLKW only) it does not matter which
> > > "identical request" you cancel because the kernel and user
> > > (dlm_controld) makes no relation between a lock request instance. You
> > > need to have at least the same amount of "results" coming back from
> > > user space as the amount you are waiting for a result for the same
> > > "identical request".
> >
> > That's not incorrect per se, but cancellations create an additional
> > difficulty: they can either succeed or fail. To indicate that a
> > cancellation has succeeded, a new type of message can be introduced
> > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > belong to a locking request that is being cancelled. When cancelling a
> > locking request fails, the kernel will see a "locking request granted"
> > message though, and when multiple identical locking requests are
> > queued and only some of them have been cancelled, it won't be obvious
> > which locking request a "locking request granted" message should be
> > assigned to anymore. You really don't want to mix things up in that
> > case.
> >
> > This complication makes it a whole lot more difficult to reason about
> > the correctness of the code. All that complexity is avoidable by
> > sticking with a fixed mapping of requests and replies (i.e., a unique
> > request identifier).
> >
> > To put it differently, you can shoot yourself in the foot and still
> > hop along on the other leg, but it may not be the best of all possible
> > ideas.
> >
>
> It makes things more complicated, I agree and the reason why this
> works now is because there are a lot of "dependencies". I would love
> to have an unique identifier to make it possible that we can follow an
> instance handle of the original lock request.
>
> * an unique identifier which also works with the async lock request of
> lockd case.

What's the lockd case you're referring to here, and why is it relevant
for the problem at hand?

> > > > The above patch description doesn't match the code anymore, and the
> > > > code doesn't fully revert the recv_list splitting of the previous
> > > > version.
> > > >
> > >
> > > This isn't a revert. Is it a new patch version, I did drop the
> > > recv_setlkw_list here, dropping in means of removing the
> > > recv_setlkw_list and handling everything in the recv_list. Although
> > > there might be a performance impact by splitting the requests in two
> > > lists as we don't need to jump over all F_SETLKW requests.
> > >
> > > > > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > > > > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > > > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> > > > >
> > > > > Cc: stable@vger.kernel.org
> > > > > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > > > > ---
> > > > > change since v2:
> > > > >  - don't split recv_list into recv_setlkw_list
> > > > >
> > > > >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> > > > >  1 file changed, 30 insertions(+), 13 deletions(-)
> > > > >
> > > > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > > > index 31bc601ee3d8..53d17dbbb716 100644
> > > > > --- a/fs/dlm/plock.c
> > > > > +++ b/fs/dlm/plock.c
> > > > > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> > > > >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> > > > >                         list_del(&op->list);
> > > > >                 else
> > > > > -                       list_move(&op->list, &recv_list);
> > > > > +                       list_move_tail(&op->list, &recv_list);
> > > >
> > > > ^ This should be obsolete, but it won't hurt, either.
> > > >
> > >
> > > No it is necessary, I tested it and looked deeper into the reason.
> > > dlm_controld handles the lock requests in an ordered way over a
> > > select() mechanism, but it will not always write a result back when
> > > it's read the request out. This is the case for F_SETLKW but also for
> > > all other plock op requests, such as F_GETLK. Instead of writing the
> > > result back it will send it to corosync and the corosync select()
> > > mechanism will write the result back. Corosync will keep the order to
> > > write the result back. Due the fact that it's going through corosync
> > > multiple non F_SETLKW can be queued up in recv_list and need to be
> > > appended on the tail to later find the first entry which is non
> > > F_SETLKW to find the result.
> > >
> > > This ordered lock request read and write the result back (for non
> > > F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
> > > always this way.
> >
> > This sounds pretty confused. Let's look at
> >
>
> As I said, yes it is a lot of specific handling of user space why this
> is actually working.
>
> > > > >                 memcpy(&info, &op->info, sizeof(info));
> > > > >         }
> > > > >         spin_unlock(&ops_lock);
> > > > > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> > > > >                 return -EINVAL;
> > > > >
> > > > >         spin_lock(&ops_lock);
> > > > > -       list_for_each_entry(iter, &recv_list, list) {
> > > > > -               if (iter->info.fsid == info.fsid &&
> > > > > -                   iter->info.number == info.number &&
> > > > > -                   iter->info.owner == info.owner) {
> > > > > -                       list_del_init(&iter->list);
> > > > > -                       memcpy(&iter->info, &info, sizeof(info));
> > > > > -                       if (iter->data)
> > > > > -                               do_callback = 1;
> > > > > -                       else
> > > > > -                               iter->done = 1;
> > > > > -                       op = iter;
> > > > > -                       break;
> > > > > +       if (info.wait) {
> > > >
> > .> > We should be able to use the same list_for_each_entry() loop for
> > > > F_SETLKW requests (which have info.wait set) as for all other requests
> > > > as far as I can see.
> > > >
> > >
> > > We can't match non F_SETLKW operations on all fields because F_GETLK
> > > will change some fields when it's handled in user space. This is the
> > > whole reason why the ordered handling is done here.
> >
> > I know that F_GETLK uses the l_type field to indicate the outcome of
> > the operation. But that happens in dlm_posix_get() when processing the
> > reply from dlm_controld; it doesn't affect info.optype or any other
> > fields in struct dlm_plock_info. So we actually can compare all of the
> > key fields in struct dlm_plock_info.
> >
>
> F_GETLK also uses start, end, etc. to tell the caller about the region
> which is in conflict. The region which is in conflict is returned into
> the result info struct. See [0] [1].
> Is this more clear now that other fields are affected?

Ah, that sucks.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-05-30 17:40           ` Andreas Gruenbacher
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-05-30 17:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 30, 2023 at 4:08?PM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > > Hi,
> > >
> > > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > > <agruenba@redhat.com> wrote:
> > > >
> > > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > this is not enough in case of using classic posix locks with threads and
> > > > > open filedescriptor posix locks.
> > > > >
> > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > send_list to recv_mlist) and write the result immediately back.
> > > > >
> > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > operations as the lock request is the same. This is also being made in
> > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > still can't find the exact lock request for a specific result if the
> > > > > lock request is the same, but if this is the case we don't care the
> > > > > order how the identical lock requests get their result back to grant the
> > > > > lock.
> > > >
> > > > When the recv_list contains multiple indistinguishable requests, this
> > > > can only be because they originated from multiple threads of the same
> > > > process. In that case, I agree that it doesn't matter which of those
> > > > requests we "complete" in dev_write() as long as we only complete one
> > > > request. We do need to compare the additional request fields in
> > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > We need to compare all of the fields that identify a request (optype,
> > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > "right" request (or in case there is more than one identical request,
> > > > a "suitable" request).
> > > >
> > >
> > > In my "definition" why this works is as you said the "identical
> > > request". There is a more deeper definition of "when is a request
> > > identical" and in my opinion it is here as: "A request A is identical
> > > to request B when they get granted under the same 'time'" which is all
> > > the fields you mentioned.
> > >
> > > Even with cancellation (F_SETLKW only) it does not matter which
> > > "identical request" you cancel because the kernel and user
> > > (dlm_controld) makes no relation between a lock request instance. You
> > > need to have at least the same amount of "results" coming back from
> > > user space as the amount you are waiting for a result for the same
> > > "identical request".
> >
> > That's not incorrect per se, but cancellations create an additional
> > difficulty: they can either succeed or fail. To indicate that a
> > cancellation has succeeded, a new type of message can be introduced
> > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > belong to a locking request that is being cancelled. When cancelling a
> > locking request fails, the kernel will see a "locking request granted"
> > message though, and when multiple identical locking requests are
> > queued and only some of them have been cancelled, it won't be obvious
> > which locking request a "locking request granted" message should be
> > assigned to anymore. You really don't want to mix things up in that
> > case.
> >
> > This complication makes it a whole lot more difficult to reason about
> > the correctness of the code. All that complexity is avoidable by
> > sticking with a fixed mapping of requests and replies (i.e., a unique
> > request identifier).
> >
> > To put it differently, you can shoot yourself in the foot and still
> > hop along on the other leg, but it may not be the best of all possible
> > ideas.
> >
>
> It makes things more complicated, I agree and the reason why this
> works now is because there are a lot of "dependencies". I would love
> to have an unique identifier to make it possible that we can follow an
> instance handle of the original lock request.
>
> * an unique identifier which also works with the async lock request of
> lockd case.

What's the lockd case you're referring to here, and why is it relevant
for the problem at hand?

> > > > The above patch description doesn't match the code anymore, and the
> > > > code doesn't fully revert the recv_list splitting of the previous
> > > > version.
> > > >
> > >
> > > This isn't a revert. Is it a new patch version, I did drop the
> > > recv_setlkw_list here, dropping in means of removing the
> > > recv_setlkw_list and handling everything in the recv_list. Although
> > > there might be a performance impact by splitting the requests in two
> > > lists as we don't need to jump over all F_SETLKW requests.
> > >
> > > > > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > > > > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > > > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> > > > >
> > > > > Cc: stable at vger.kernel.org
> > > > > Signed-off-by: Alexander Aring <aahringo@redhat.com>
> > > > > ---
> > > > > change since v2:
> > > > >  - don't split recv_list into recv_setlkw_list
> > > > >
> > > > >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> > > > >  1 file changed, 30 insertions(+), 13 deletions(-)
> > > > >
> > > > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > > > index 31bc601ee3d8..53d17dbbb716 100644
> > > > > --- a/fs/dlm/plock.c
> > > > > +++ b/fs/dlm/plock.c
> > > > > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> > > > >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> > > > >                         list_del(&op->list);
> > > > >                 else
> > > > > -                       list_move(&op->list, &recv_list);
> > > > > +                       list_move_tail(&op->list, &recv_list);
> > > >
> > > > ^ This should be obsolete, but it won't hurt, either.
> > > >
> > >
> > > No it is necessary, I tested it and looked deeper into the reason.
> > > dlm_controld handles the lock requests in an ordered way over a
> > > select() mechanism, but it will not always write a result back when
> > > it's read the request out. This is the case for F_SETLKW but also for
> > > all other plock op requests, such as F_GETLK. Instead of writing the
> > > result back it will send it to corosync and the corosync select()
> > > mechanism will write the result back. Corosync will keep the order to
> > > write the result back. Due the fact that it's going through corosync
> > > multiple non F_SETLKW can be queued up in recv_list and need to be
> > > appended on the tail to later find the first entry which is non
> > > F_SETLKW to find the result.
> > >
> > > This ordered lock request read and write the result back (for non
> > > F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
> > > always this way.
> >
> > This sounds pretty confused. Let's look at
> >
>
> As I said, yes it is a lot of specific handling of user space why this
> is actually working.
>
> > > > >                 memcpy(&info, &op->info, sizeof(info));
> > > > >         }
> > > > >         spin_unlock(&ops_lock);
> > > > > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> > > > >                 return -EINVAL;
> > > > >
> > > > >         spin_lock(&ops_lock);
> > > > > -       list_for_each_entry(iter, &recv_list, list) {
> > > > > -               if (iter->info.fsid == info.fsid &&
> > > > > -                   iter->info.number == info.number &&
> > > > > -                   iter->info.owner == info.owner) {
> > > > > -                       list_del_init(&iter->list);
> > > > > -                       memcpy(&iter->info, &info, sizeof(info));
> > > > > -                       if (iter->data)
> > > > > -                               do_callback = 1;
> > > > > -                       else
> > > > > -                               iter->done = 1;
> > > > > -                       op = iter;
> > > > > -                       break;
> > > > > +       if (info.wait) {
> > > >
> > .> > We should be able to use the same list_for_each_entry() loop for
> > > > F_SETLKW requests (which have info.wait set) as for all other requests
> > > > as far as I can see.
> > > >
> > >
> > > We can't match non F_SETLKW operations on all fields because F_GETLK
> > > will change some fields when it's handled in user space. This is the
> > > whole reason why the ordered handling is done here.
> >
> > I know that F_GETLK uses the l_type field to indicate the outcome of
> > the operation. But that happens in dlm_posix_get() when processing the
> > reply from dlm_controld; it doesn't affect info.optype or any other
> > fields in struct dlm_plock_info. So we actually can compare all of the
> > key fields in struct dlm_plock_info.
> >
>
> F_GETLK also uses start, end, etc. to tell the caller about the region
> which is in conflict. The region which is in conflict is returned into
> the result info struct. See [0] [1].
> Is this more clear now that other fields are affected?

Ah, that sucks.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-05-30 17:40           ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-06-01 16:28             ` Alexander Aring
  -1 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-06-01 16:28 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: teigland, cluster-devel, stable

Hi,


On Tue, May 30, 2023 at 1:40 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Tue, May 30, 2023 at 4:08 PM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > Hi,
> > > >
> > > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > > <agruenba@redhat.com> wrote:
> > > > >
> > > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > open filedescriptor posix locks.
> > > > > >
> > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > >
> > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > operations as the lock request is the same. This is also being made in
> > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > still can't find the exact lock request for a specific result if the
> > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > order how the identical lock requests get their result back to grant the
> > > > > > lock.
> > > > >
> > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > can only be because they originated from multiple threads of the same
> > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > request. We do need to compare the additional request fields in
> > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > We need to compare all of the fields that identify a request (optype,
> > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > "right" request (or in case there is more than one identical request,
> > > > > a "suitable" request).
> > > > >
> > > >
> > > > In my "definition" why this works is as you said the "identical
> > > > request". There is a more deeper definition of "when is a request
> > > > identical" and in my opinion it is here as: "A request A is identical
> > > > to request B when they get granted under the same 'time'" which is all
> > > > the fields you mentioned.
> > > >
> > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > "identical request" you cancel because the kernel and user
> > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > need to have at least the same amount of "results" coming back from
> > > > user space as the amount you are waiting for a result for the same
> > > > "identical request".
> > >
> > > That's not incorrect per se, but cancellations create an additional
> > > difficulty: they can either succeed or fail. To indicate that a
> > > cancellation has succeeded, a new type of message can be introduced
> > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > belong to a locking request that is being cancelled. When cancelling a
> > > locking request fails, the kernel will see a "locking request granted"
> > > message though, and when multiple identical locking requests are
> > > queued and only some of them have been cancelled, it won't be obvious
> > > which locking request a "locking request granted" message should be
> > > assigned to anymore. You really don't want to mix things up in that
> > > case.
> > >
> > > This complication makes it a whole lot more difficult to reason about
> > > the correctness of the code. All that complexity is avoidable by
> > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > request identifier).
> > >
> > > To put it differently, you can shoot yourself in the foot and still
> > > hop along on the other leg, but it may not be the best of all possible
> > > ideas.
> > >
> >
> > It makes things more complicated, I agree and the reason why this
> > works now is because there are a lot of "dependencies". I would love
> > to have an unique identifier to make it possible that we can follow an
> > instance handle of the original lock request.
> >
> > * an unique identifier which also works with the async lock request of
> > lockd case.
>
> What's the lockd case you're referring to here, and why is it relevant
> for the problem at hand?

just mentioned that we need a solution which also works for the
asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
user lockd. [0] DLM plock handling implements the behaviour mentioned
at [0] but lm_grant() callback can also return negative values and
signals that the lock request was cancelled (on nfs layer) and then
need to tell it DLM. This however is not supported as we have a lack
of cancellation.

So far I don't see any issues with the current solution which this
patch is showing and the async lock request case.

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-06-01 16:28             ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-06-01 16:28 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On Tue, May 30, 2023 at 1:40?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Tue, May 30, 2023 at 4:08?PM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > Hi,
> > > >
> > > > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > > > <agruenba@redhat.com> wrote:
> > > > >
> > > > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > open filedescriptor posix locks.
> > > > > >
> > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > >
> > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > operations as the lock request is the same. This is also being made in
> > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > still can't find the exact lock request for a specific result if the
> > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > order how the identical lock requests get their result back to grant the
> > > > > > lock.
> > > > >
> > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > can only be because they originated from multiple threads of the same
> > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > request. We do need to compare the additional request fields in
> > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > We need to compare all of the fields that identify a request (optype,
> > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > "right" request (or in case there is more than one identical request,
> > > > > a "suitable" request).
> > > > >
> > > >
> > > > In my "definition" why this works is as you said the "identical
> > > > request". There is a more deeper definition of "when is a request
> > > > identical" and in my opinion it is here as: "A request A is identical
> > > > to request B when they get granted under the same 'time'" which is all
> > > > the fields you mentioned.
> > > >
> > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > "identical request" you cancel because the kernel and user
> > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > need to have at least the same amount of "results" coming back from
> > > > user space as the amount you are waiting for a result for the same
> > > > "identical request".
> > >
> > > That's not incorrect per se, but cancellations create an additional
> > > difficulty: they can either succeed or fail. To indicate that a
> > > cancellation has succeeded, a new type of message can be introduced
> > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > belong to a locking request that is being cancelled. When cancelling a
> > > locking request fails, the kernel will see a "locking request granted"
> > > message though, and when multiple identical locking requests are
> > > queued and only some of them have been cancelled, it won't be obvious
> > > which locking request a "locking request granted" message should be
> > > assigned to anymore. You really don't want to mix things up in that
> > > case.
> > >
> > > This complication makes it a whole lot more difficult to reason about
> > > the correctness of the code. All that complexity is avoidable by
> > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > request identifier).
> > >
> > > To put it differently, you can shoot yourself in the foot and still
> > > hop along on the other leg, but it may not be the best of all possible
> > > ideas.
> > >
> >
> > It makes things more complicated, I agree and the reason why this
> > works now is because there are a lot of "dependencies". I would love
> > to have an unique identifier to make it possible that we can follow an
> > instance handle of the original lock request.
> >
> > * an unique identifier which also works with the async lock request of
> > lockd case.
>
> What's the lockd case you're referring to here, and why is it relevant
> for the problem at hand?

just mentioned that we need a solution which also works for the
asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
user lockd. [0] DLM plock handling implements the behaviour mentioned
at [0] but lm_grant() callback can also return negative values and
signals that the lock request was cancelled (on nfs layer) and then
need to tell it DLM. This however is not supported as we have a lack
of cancellation.

So far I don't see any issues with the current solution which this
patch is showing and the async lock request case.

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-06-01 16:28             ` [Cluster-devel] " Alexander Aring
@ 2023-06-01 17:10               ` Andreas Gruenbacher
  -1 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-06-01 17:10 UTC (permalink / raw)
  To: Alexander Aring; +Cc: teigland, cluster-devel, stable

On Thu, Jun 1, 2023 at 6:28 PM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Tue, May 30, 2023 at 1:40 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > On Tue, May 30, 2023 at 4:08 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > Hi,
> > >
> > > On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > >
> > > > On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > > > <agruenba@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > open filedescriptor posix locks.
> > > > > > >
> > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > >
> > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > lock.
> > > > > >
> > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > can only be because they originated from multiple threads of the same
> > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > request. We do need to compare the additional request fields in
> > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > "right" request (or in case there is more than one identical request,
> > > > > > a "suitable" request).
> > > > > >
> > > > >
> > > > > In my "definition" why this works is as you said the "identical
> > > > > request". There is a more deeper definition of "when is a request
> > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > to request B when they get granted under the same 'time'" which is all
> > > > > the fields you mentioned.
> > > > >
> > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > "identical request" you cancel because the kernel and user
> > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > need to have at least the same amount of "results" coming back from
> > > > > user space as the amount you are waiting for a result for the same
> > > > > "identical request".
> > > >
> > > > That's not incorrect per se, but cancellations create an additional
> > > > difficulty: they can either succeed or fail. To indicate that a
> > > > cancellation has succeeded, a new type of message can be introduced
> > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > belong to a locking request that is being cancelled. When cancelling a
> > > > locking request fails, the kernel will see a "locking request granted"
> > > > message though, and when multiple identical locking requests are
> > > > queued and only some of them have been cancelled, it won't be obvious
> > > > which locking request a "locking request granted" message should be
> > > > assigned to anymore. You really don't want to mix things up in that
> > > > case.
> > > >
> > > > This complication makes it a whole lot more difficult to reason about
> > > > the correctness of the code. All that complexity is avoidable by
> > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > request identifier).
> > > >
> > > > To put it differently, you can shoot yourself in the foot and still
> > > > hop along on the other leg, but it may not be the best of all possible
> > > > ideas.
> > > >
> > >
> > > It makes things more complicated, I agree and the reason why this
> > > works now is because there are a lot of "dependencies". I would love
> > > to have an unique identifier to make it possible that we can follow an
> > > instance handle of the original lock request.
> > >
> > > * an unique identifier which also works with the async lock request of
> > > lockd case.
> >
> > What's the lockd case you're referring to here, and why is it relevant
> > for the problem at hand?
>
> just mentioned that we need a solution which also works for the
> asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> user lockd. [0] DLM plock handling implements the behaviour mentioned
> at [0] but lm_grant() callback can also return negative values and
> signals that the lock request was cancelled (on nfs layer) and then
> need to tell it DLM. This however is not supported as we have a lack
> of cancellation.

Ouch, that's a bit messy. But if the vfs_lock_file() description is
accurate, then only F_SETLK requests arriving via lockd can be
asynchronous, and F_SETLKW requests never are asynchronous. And we
only need to cancel F_SETLKW requests. It follows that we only ever
need to cancel synchronous requests.

Andreas

> So far I don't see any issues with the current solution which this
> patch is showing and the async lock request case.
>
> - Alex
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-06-01 17:10               ` Andreas Gruenbacher
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-06-01 17:10 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Jun 1, 2023 at 6:28?PM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Tue, May 30, 2023 at 1:40?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > On Tue, May 30, 2023 at 4:08?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > Hi,
> > >
> > > On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > >
> > > > On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > > > > <agruenba@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > open filedescriptor posix locks.
> > > > > > >
> > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > >
> > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > lock.
> > > > > >
> > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > can only be because they originated from multiple threads of the same
> > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > request. We do need to compare the additional request fields in
> > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > "right" request (or in case there is more than one identical request,
> > > > > > a "suitable" request).
> > > > > >
> > > > >
> > > > > In my "definition" why this works is as you said the "identical
> > > > > request". There is a more deeper definition of "when is a request
> > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > to request B when they get granted under the same 'time'" which is all
> > > > > the fields you mentioned.
> > > > >
> > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > "identical request" you cancel because the kernel and user
> > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > need to have at least the same amount of "results" coming back from
> > > > > user space as the amount you are waiting for a result for the same
> > > > > "identical request".
> > > >
> > > > That's not incorrect per se, but cancellations create an additional
> > > > difficulty: they can either succeed or fail. To indicate that a
> > > > cancellation has succeeded, a new type of message can be introduced
> > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > belong to a locking request that is being cancelled. When cancelling a
> > > > locking request fails, the kernel will see a "locking request granted"
> > > > message though, and when multiple identical locking requests are
> > > > queued and only some of them have been cancelled, it won't be obvious
> > > > which locking request a "locking request granted" message should be
> > > > assigned to anymore. You really don't want to mix things up in that
> > > > case.
> > > >
> > > > This complication makes it a whole lot more difficult to reason about
> > > > the correctness of the code. All that complexity is avoidable by
> > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > request identifier).
> > > >
> > > > To put it differently, you can shoot yourself in the foot and still
> > > > hop along on the other leg, but it may not be the best of all possible
> > > > ideas.
> > > >
> > >
> > > It makes things more complicated, I agree and the reason why this
> > > works now is because there are a lot of "dependencies". I would love
> > > to have an unique identifier to make it possible that we can follow an
> > > instance handle of the original lock request.
> > >
> > > * an unique identifier which also works with the async lock request of
> > > lockd case.
> >
> > What's the lockd case you're referring to here, and why is it relevant
> > for the problem at hand?
>
> just mentioned that we need a solution which also works for the
> asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> user lockd. [0] DLM plock handling implements the behaviour mentioned
> at [0] but lm_grant() callback can also return negative values and
> signals that the lock request was cancelled (on nfs layer) and then
> need to tell it DLM. This however is not supported as we have a lack
> of cancellation.

Ouch, that's a bit messy. But if the vfs_lock_file() description is
accurate, then only F_SETLK requests arriving via lockd can be
asynchronous, and F_SETLKW requests never are asynchronous. And we
only need to cancel F_SETLKW requests. It follows that we only ever
need to cancel synchronous requests.

Andreas

> So far I don't see any issues with the current solution which this
> patch is showing and the async lock request case.
>
> - Alex
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-06-01 17:10               ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-06-01 19:09                 ` Alexander Aring
  -1 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-06-01 19:09 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: teigland, cluster-devel, stable

Hi,

On Thu, Jun 1, 2023 at 1:11 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Thu, Jun 1, 2023 at 6:28 PM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Tue, May 30, 2023 at 1:40 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > On Tue, May 30, 2023 at 4:08 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > Hi,
> > > >
> > > > On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > > > > <agruenba@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > > open filedescriptor posix locks.
> > > > > > > >
> > > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > > >
> > > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > > lock.
> > > > > > >
> > > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > > can only be because they originated from multiple threads of the same
> > > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > > request. We do need to compare the additional request fields in
> > > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > > "right" request (or in case there is more than one identical request,
> > > > > > > a "suitable" request).
> > > > > > >
> > > > > >
> > > > > > In my "definition" why this works is as you said the "identical
> > > > > > request". There is a more deeper definition of "when is a request
> > > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > > to request B when they get granted under the same 'time'" which is all
> > > > > > the fields you mentioned.
> > > > > >
> > > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > > "identical request" you cancel because the kernel and user
> > > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > > need to have at least the same amount of "results" coming back from
> > > > > > user space as the amount you are waiting for a result for the same
> > > > > > "identical request".
> > > > >
> > > > > That's not incorrect per se, but cancellations create an additional
> > > > > difficulty: they can either succeed or fail. To indicate that a
> > > > > cancellation has succeeded, a new type of message can be introduced
> > > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > > belong to a locking request that is being cancelled. When cancelling a
> > > > > locking request fails, the kernel will see a "locking request granted"
> > > > > message though, and when multiple identical locking requests are
> > > > > queued and only some of them have been cancelled, it won't be obvious
> > > > > which locking request a "locking request granted" message should be
> > > > > assigned to anymore. You really don't want to mix things up in that
> > > > > case.
> > > > >
> > > > > This complication makes it a whole lot more difficult to reason about
> > > > > the correctness of the code. All that complexity is avoidable by
> > > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > > request identifier).
> > > > >
> > > > > To put it differently, you can shoot yourself in the foot and still
> > > > > hop along on the other leg, but it may not be the best of all possible
> > > > > ideas.
> > > > >
> > > >
> > > > It makes things more complicated, I agree and the reason why this
> > > > works now is because there are a lot of "dependencies". I would love
> > > > to have an unique identifier to make it possible that we can follow an
> > > > instance handle of the original lock request.
> > > >
> > > > * an unique identifier which also works with the async lock request of
> > > > lockd case.
> > >
> > > What's the lockd case you're referring to here, and why is it relevant
> > > for the problem at hand?
> >
> > just mentioned that we need a solution which also works for the
> > asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> > user lockd. [0] DLM plock handling implements the behaviour mentioned
> > at [0] but lm_grant() callback can also return negative values and
> > signals that the lock request was cancelled (on nfs layer) and then
> > need to tell it DLM. This however is not supported as we have a lack
> > of cancellation.
>
> Ouch, that's a bit messy. But if the vfs_lock_file() description is
> accurate, then only F_SETLK requests arriving via lockd can be
> asynchronous, and F_SETLKW requests never are asynchronous. And we
> only need to cancel F_SETLKW requests. It follows that we only ever
> need to cancel synchronous requests.
>

I looked into the code and I think the second sentence of [0] is
important regarding to F_SLEEP "Callers expecting ->lock() to return
asynchronously will only use F_SETLK, not F_SETLKW; they will set
FL_SLEEP if (and only if) the request is for a blocking lock.".
If lockd does a lock request, it will then set args.wait to 1 if it
was F_SETLKW [1]. The receiver will then create a blocking request [2]
and set FL_SLEEP [3]; it does unset FL_SLEEP when args.wait (now as
wait parameter) wasn't set. There is a case of [5] which unset wait
again, but it seems we can end with FL_SLEEP there anyway.

I think we have currently an issue here that we handle F_SETLK when
F_SLEEP (in case of async lockd request) is handled as trylock which
isn't right. Don't ask me why they are going over F_SLEEP and F_SETLK
and not just using F_SETLKW to signal that it is a blocking request.
So we actually have the F_SETLKW case but it's just signaled with
F_SETLK + FL_SLEEP?

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/clntproc.c?h=v6.4-rc4#n186
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n501
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n240
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n535
[5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n489


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-06-01 19:09                 ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-06-01 19:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Thu, Jun 1, 2023 at 1:11?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Thu, Jun 1, 2023 at 6:28?PM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Tue, May 30, 2023 at 1:40?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > On Tue, May 30, 2023 at 4:08?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > Hi,
> > > >
> > > > On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > > > > > <agruenba@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > > open filedescriptor posix locks.
> > > > > > > >
> > > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > > >
> > > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > > lock.
> > > > > > >
> > > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > > can only be because they originated from multiple threads of the same
> > > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > > request. We do need to compare the additional request fields in
> > > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > > "right" request (or in case there is more than one identical request,
> > > > > > > a "suitable" request).
> > > > > > >
> > > > > >
> > > > > > In my "definition" why this works is as you said the "identical
> > > > > > request". There is a more deeper definition of "when is a request
> > > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > > to request B when they get granted under the same 'time'" which is all
> > > > > > the fields you mentioned.
> > > > > >
> > > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > > "identical request" you cancel because the kernel and user
> > > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > > need to have at least the same amount of "results" coming back from
> > > > > > user space as the amount you are waiting for a result for the same
> > > > > > "identical request".
> > > > >
> > > > > That's not incorrect per se, but cancellations create an additional
> > > > > difficulty: they can either succeed or fail. To indicate that a
> > > > > cancellation has succeeded, a new type of message can be introduced
> > > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > > belong to a locking request that is being cancelled. When cancelling a
> > > > > locking request fails, the kernel will see a "locking request granted"
> > > > > message though, and when multiple identical locking requests are
> > > > > queued and only some of them have been cancelled, it won't be obvious
> > > > > which locking request a "locking request granted" message should be
> > > > > assigned to anymore. You really don't want to mix things up in that
> > > > > case.
> > > > >
> > > > > This complication makes it a whole lot more difficult to reason about
> > > > > the correctness of the code. All that complexity is avoidable by
> > > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > > request identifier).
> > > > >
> > > > > To put it differently, you can shoot yourself in the foot and still
> > > > > hop along on the other leg, but it may not be the best of all possible
> > > > > ideas.
> > > > >
> > > >
> > > > It makes things more complicated, I agree and the reason why this
> > > > works now is because there are a lot of "dependencies". I would love
> > > > to have an unique identifier to make it possible that we can follow an
> > > > instance handle of the original lock request.
> > > >
> > > > * an unique identifier which also works with the async lock request of
> > > > lockd case.
> > >
> > > What's the lockd case you're referring to here, and why is it relevant
> > > for the problem at hand?
> >
> > just mentioned that we need a solution which also works for the
> > asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> > user lockd. [0] DLM plock handling implements the behaviour mentioned
> > at [0] but lm_grant() callback can also return negative values and
> > signals that the lock request was cancelled (on nfs layer) and then
> > need to tell it DLM. This however is not supported as we have a lack
> > of cancellation.
>
> Ouch, that's a bit messy. But if the vfs_lock_file() description is
> accurate, then only F_SETLK requests arriving via lockd can be
> asynchronous, and F_SETLKW requests never are asynchronous. And we
> only need to cancel F_SETLKW requests. It follows that we only ever
> need to cancel synchronous requests.
>

I looked into the code and I think the second sentence of [0] is
important regarding to F_SLEEP "Callers expecting ->lock() to return
asynchronously will only use F_SETLK, not F_SETLKW; they will set
FL_SLEEP if (and only if) the request is for a blocking lock.".
If lockd does a lock request, it will then set args.wait to 1 if it
was F_SETLKW [1]. The receiver will then create a blocking request [2]
and set FL_SLEEP [3]; it does unset FL_SLEEP when args.wait (now as
wait parameter) wasn't set. There is a case of [5] which unset wait
again, but it seems we can end with FL_SLEEP there anyway.

I think we have currently an issue here that we handle F_SETLK when
F_SLEEP (in case of async lockd request) is handled as trylock which
isn't right. Don't ask me why they are going over F_SLEEP and F_SETLK
and not just using F_SETLKW to signal that it is a blocking request.
So we actually have the F_SETLKW case but it's just signaled with
F_SETLK + FL_SLEEP?

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/clntproc.c?h=v6.4-rc4#n186
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n501
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n240
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n535
[5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n489


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-06-01 19:09                 ` [Cluster-devel] " Alexander Aring
@ 2023-06-03 10:02                   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-06-03 10:02 UTC (permalink / raw)
  To: Alexander Aring; +Cc: teigland, cluster-devel, stable

On Thu, Jun 1, 2023 at 9:10 PM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Thu, Jun 1, 2023 at 1:11 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > On Thu, Jun 1, 2023 at 6:28 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > Hi,
> > >
> > > On Tue, May 30, 2023 at 1:40 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > >
> > > > On Tue, May 30, 2023 at 4:08 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > Hi,
> > > > >
> > > > > On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > > > > > <agruenba@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > > > open filedescriptor posix locks.
> > > > > > > > >
> > > > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > > > >
> > > > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > > > lock.
> > > > > > > >
> > > > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > > > can only be because they originated from multiple threads of the same
> > > > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > > > request. We do need to compare the additional request fields in
> > > > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > > > "right" request (or in case there is more than one identical request,
> > > > > > > > a "suitable" request).
> > > > > > > >
> > > > > > >
> > > > > > > In my "definition" why this works is as you said the "identical
> > > > > > > request". There is a more deeper definition of "when is a request
> > > > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > > > to request B when they get granted under the same 'time'" which is all
> > > > > > > the fields you mentioned.
> > > > > > >
> > > > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > > > "identical request" you cancel because the kernel and user
> > > > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > > > need to have at least the same amount of "results" coming back from
> > > > > > > user space as the amount you are waiting for a result for the same
> > > > > > > "identical request".
> > > > > >
> > > > > > That's not incorrect per se, but cancellations create an additional
> > > > > > difficulty: they can either succeed or fail. To indicate that a
> > > > > > cancellation has succeeded, a new type of message can be introduced
> > > > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > > > belong to a locking request that is being cancelled. When cancelling a
> > > > > > locking request fails, the kernel will see a "locking request granted"
> > > > > > message though, and when multiple identical locking requests are
> > > > > > queued and only some of them have been cancelled, it won't be obvious
> > > > > > which locking request a "locking request granted" message should be
> > > > > > assigned to anymore. You really don't want to mix things up in that
> > > > > > case.
> > > > > >
> > > > > > This complication makes it a whole lot more difficult to reason about
> > > > > > the correctness of the code. All that complexity is avoidable by
> > > > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > > > request identifier).
> > > > > >
> > > > > > To put it differently, you can shoot yourself in the foot and still
> > > > > > hop along on the other leg, but it may not be the best of all possible
> > > > > > ideas.
> > > > > >
> > > > >
> > > > > It makes things more complicated, I agree and the reason why this
> > > > > works now is because there are a lot of "dependencies". I would love
> > > > > to have an unique identifier to make it possible that we can follow an
> > > > > instance handle of the original lock request.
> > > > >
> > > > > * an unique identifier which also works with the async lock request of
> > > > > lockd case.
> > > >
> > > > What's the lockd case you're referring to here, and why is it relevant
> > > > for the problem at hand?
> > >
> > > just mentioned that we need a solution which also works for the
> > > asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> > > user lockd. [0] DLM plock handling implements the behaviour mentioned
> > > at [0] but lm_grant() callback can also return negative values and
> > > signals that the lock request was cancelled (on nfs layer) and then
> > > need to tell it DLM. This however is not supported as we have a lack
> > > of cancellation.
> >
> > Ouch, that's a bit messy. But if the vfs_lock_file() description is
> > accurate, then only F_SETLK requests arriving via lockd can be
> > asynchronous, and F_SETLKW requests never are asynchronous. And we
> > only need to cancel F_SETLKW requests. It follows that we only ever
> > need to cancel synchronous requests.
> >
>
> I looked into the code and I think the second sentence of [0] is
> important regarding to F_SLEEP "Callers expecting ->lock() to return
> asynchronously will only use F_SETLK, not F_SETLKW; they will set
> FL_SLEEP if (and only if) the request is for a blocking lock.".
> If lockd does a lock request, it will then set args.wait to 1 if it
> was F_SETLKW [1].

Hmm, this sets args.block?

> The receiver will then create a blocking request [2]
> and set FL_SLEEP [3]; it does unset FL_SLEEP when args.wait (now as
> wait parameter) wasn't set. There is a case of [5] which unset wait
> again, but it seems we can end with FL_SLEEP there anyway.
>
> I think we have currently an issue here that we handle F_SETLK when
> F_SLEEP (in case of async lockd request) is handled as trylock which
> isn't right. Don't ask me why they are going over F_SLEEP and F_SETLK
> and not just using F_SETLKW to signal that it is a blocking request.
> So we actually have the F_SETLKW case but it's just signaled with
> F_SETLK + FL_SLEEP?

It seems to me that the vfs_lock_file() description and the
distinction it makes between F_SETLK and F_SETLKW makes sense in a
scenario when locking is implemented locally in the kernel, but not in
the context of dlm_controld, where all the locking decisions are made
in user-space: in the former scenario, F_SETLK requests can be decided
immediately without sleeping, but F_SETLKW requests may sleep. In the
dlm_controld scenario, locking requests can never be decided
immediately, and the requesting task will always sleep. So from the
point of view of lockd, any request to dlm_controld should probably be
treated as asynchronous.

This makes things a bit more ugly than I was hoping for.

Thanks,
Andreas

> - Alex
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/clntproc.c?h=v6.4-rc4#n186
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n501
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n240
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n535
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n489
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-06-03 10:02                   ` Andreas Gruenbacher
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2023-06-03 10:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Jun 1, 2023 at 9:10?PM Alexander Aring <aahringo@redhat.com> wrote:
> Hi,
>
> On Thu, Jun 1, 2023 at 1:11?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > On Thu, Jun 1, 2023 at 6:28?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > Hi,
> > >
> > > On Tue, May 30, 2023 at 1:40?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > >
> > > > On Tue, May 30, 2023 at 4:08?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > Hi,
> > > > >
> > > > > On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > > > > > > <agruenba@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > > > open filedescriptor posix locks.
> > > > > > > > >
> > > > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > > > >
> > > > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > > > lock.
> > > > > > > >
> > > > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > > > can only be because they originated from multiple threads of the same
> > > > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > > > request. We do need to compare the additional request fields in
> > > > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > > > "right" request (or in case there is more than one identical request,
> > > > > > > > a "suitable" request).
> > > > > > > >
> > > > > > >
> > > > > > > In my "definition" why this works is as you said the "identical
> > > > > > > request". There is a more deeper definition of "when is a request
> > > > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > > > to request B when they get granted under the same 'time'" which is all
> > > > > > > the fields you mentioned.
> > > > > > >
> > > > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > > > "identical request" you cancel because the kernel and user
> > > > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > > > need to have at least the same amount of "results" coming back from
> > > > > > > user space as the amount you are waiting for a result for the same
> > > > > > > "identical request".
> > > > > >
> > > > > > That's not incorrect per se, but cancellations create an additional
> > > > > > difficulty: they can either succeed or fail. To indicate that a
> > > > > > cancellation has succeeded, a new type of message can be introduced
> > > > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > > > belong to a locking request that is being cancelled. When cancelling a
> > > > > > locking request fails, the kernel will see a "locking request granted"
> > > > > > message though, and when multiple identical locking requests are
> > > > > > queued and only some of them have been cancelled, it won't be obvious
> > > > > > which locking request a "locking request granted" message should be
> > > > > > assigned to anymore. You really don't want to mix things up in that
> > > > > > case.
> > > > > >
> > > > > > This complication makes it a whole lot more difficult to reason about
> > > > > > the correctness of the code. All that complexity is avoidable by
> > > > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > > > request identifier).
> > > > > >
> > > > > > To put it differently, you can shoot yourself in the foot and still
> > > > > > hop along on the other leg, but it may not be the best of all possible
> > > > > > ideas.
> > > > > >
> > > > >
> > > > > It makes things more complicated, I agree and the reason why this
> > > > > works now is because there are a lot of "dependencies". I would love
> > > > > to have an unique identifier to make it possible that we can follow an
> > > > > instance handle of the original lock request.
> > > > >
> > > > > * an unique identifier which also works with the async lock request of
> > > > > lockd case.
> > > >
> > > > What's the lockd case you're referring to here, and why is it relevant
> > > > for the problem at hand?
> > >
> > > just mentioned that we need a solution which also works for the
> > > asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> > > user lockd. [0] DLM plock handling implements the behaviour mentioned
> > > at [0] but lm_grant() callback can also return negative values and
> > > signals that the lock request was cancelled (on nfs layer) and then
> > > need to tell it DLM. This however is not supported as we have a lack
> > > of cancellation.
> >
> > Ouch, that's a bit messy. But if the vfs_lock_file() description is
> > accurate, then only F_SETLK requests arriving via lockd can be
> > asynchronous, and F_SETLKW requests never are asynchronous. And we
> > only need to cancel F_SETLKW requests. It follows that we only ever
> > need to cancel synchronous requests.
> >
>
> I looked into the code and I think the second sentence of [0] is
> important regarding to F_SLEEP "Callers expecting ->lock() to return
> asynchronously will only use F_SETLK, not F_SETLKW; they will set
> FL_SLEEP if (and only if) the request is for a blocking lock.".
> If lockd does a lock request, it will then set args.wait to 1 if it
> was F_SETLKW [1].

Hmm, this sets args.block?

> The receiver will then create a blocking request [2]
> and set FL_SLEEP [3]; it does unset FL_SLEEP when args.wait (now as
> wait parameter) wasn't set. There is a case of [5] which unset wait
> again, but it seems we can end with FL_SLEEP there anyway.
>
> I think we have currently an issue here that we handle F_SETLK when
> F_SLEEP (in case of async lockd request) is handled as trylock which
> isn't right. Don't ask me why they are going over F_SLEEP and F_SETLK
> and not just using F_SETLKW to signal that it is a blocking request.
> So we actually have the F_SETLKW case but it's just signaled with
> F_SETLK + FL_SLEEP?

It seems to me that the vfs_lock_file() description and the
distinction it makes between F_SETLK and F_SETLKW makes sense in a
scenario when locking is implemented locally in the kernel, but not in
the context of dlm_controld, where all the locking decisions are made
in user-space: in the former scenario, F_SETLK requests can be decided
immediately without sleeping, but F_SETLKW requests may sleep. In the
dlm_controld scenario, locking requests can never be decided
immediately, and the requesting task will always sleep. So from the
point of view of lockd, any request to dlm_controld should probably be
treated as asynchronous.

This makes things a bit more ugly than I was hoping for.

Thanks,
Andreas

> - Alex
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/clntproc.c?h=v6.4-rc4#n186
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n501
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n240
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n535
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc4#n489
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
  2023-06-03 10:02                   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-06-03 12:02                     ` Alexander Aring
  -1 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-06-03 12:02 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: teigland, cluster-devel, stable

Hi,

On Sat, Jun 3, 2023 at 6:03 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Thu, Jun 1, 2023 at 9:10 PM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Thu, Jun 1, 2023 at 1:11 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > On Thu, Jun 1, 2023 at 6:28 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > Hi,
> > > >
> > > > On Tue, May 30, 2023 at 1:40 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 30, 2023 at 4:08 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > > > > > > <agruenba@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > > > > open filedescriptor posix locks.
> > > > > > > > > >
> > > > > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > > > > >
> > > > > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > > > > lock.
> > > > > > > > >
> > > > > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > > > > can only be because they originated from multiple threads of the same
> > > > > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > > > > request. We do need to compare the additional request fields in
> > > > > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > > > > "right" request (or in case there is more than one identical request,
> > > > > > > > > a "suitable" request).
> > > > > > > > >
> > > > > > > >
> > > > > > > > In my "definition" why this works is as you said the "identical
> > > > > > > > request". There is a more deeper definition of "when is a request
> > > > > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > > > > to request B when they get granted under the same 'time'" which is all
> > > > > > > > the fields you mentioned.
> > > > > > > >
> > > > > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > > > > "identical request" you cancel because the kernel and user
> > > > > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > > > > need to have at least the same amount of "results" coming back from
> > > > > > > > user space as the amount you are waiting for a result for the same
> > > > > > > > "identical request".
> > > > > > >
> > > > > > > That's not incorrect per se, but cancellations create an additional
> > > > > > > difficulty: they can either succeed or fail. To indicate that a
> > > > > > > cancellation has succeeded, a new type of message can be introduced
> > > > > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > > > > belong to a locking request that is being cancelled. When cancelling a
> > > > > > > locking request fails, the kernel will see a "locking request granted"
> > > > > > > message though, and when multiple identical locking requests are
> > > > > > > queued and only some of them have been cancelled, it won't be obvious
> > > > > > > which locking request a "locking request granted" message should be
> > > > > > > assigned to anymore. You really don't want to mix things up in that
> > > > > > > case.
> > > > > > >
> > > > > > > This complication makes it a whole lot more difficult to reason about
> > > > > > > the correctness of the code. All that complexity is avoidable by
> > > > > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > > > > request identifier).
> > > > > > >
> > > > > > > To put it differently, you can shoot yourself in the foot and still
> > > > > > > hop along on the other leg, but it may not be the best of all possible
> > > > > > > ideas.
> > > > > > >
> > > > > >
> > > > > > It makes things more complicated, I agree and the reason why this
> > > > > > works now is because there are a lot of "dependencies". I would love
> > > > > > to have an unique identifier to make it possible that we can follow an
> > > > > > instance handle of the original lock request.
> > > > > >
> > > > > > * an unique identifier which also works with the async lock request of
> > > > > > lockd case.
> > > > >
> > > > > What's the lockd case you're referring to here, and why is it relevant
> > > > > for the problem at hand?
> > > >
> > > > just mentioned that we need a solution which also works for the
> > > > asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> > > > user lockd. [0] DLM plock handling implements the behaviour mentioned
> > > > at [0] but lm_grant() callback can also return negative values and
> > > > signals that the lock request was cancelled (on nfs layer) and then
> > > > need to tell it DLM. This however is not supported as we have a lack
> > > > of cancellation.
> > >
> > > Ouch, that's a bit messy. But if the vfs_lock_file() description is
> > > accurate, then only F_SETLK requests arriving via lockd can be
> > > asynchronous, and F_SETLKW requests never are asynchronous. And we
> > > only need to cancel F_SETLKW requests. It follows that we only ever
> > > need to cancel synchronous requests.
> > >
> >
> > I looked into the code and I think the second sentence of [0] is
> > important regarding to F_SLEEP "Callers expecting ->lock() to return
> > asynchronously will only use F_SETLK, not F_SETLKW; they will set
> > FL_SLEEP if (and only if) the request is for a blocking lock.".
> > If lockd does a lock request, it will then set args.wait to 1 if it
> > was F_SETLKW [1].
>
> Hmm, this sets args.block?
>

You are right but args.block gets passed as the wait parameter to nlmsvc_lock().

An example user:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svc4proc.c#n158

> > The receiver will then create a blocking request [2]
> > and set FL_SLEEP [3]; it does unset FL_SLEEP when args.wait (now as
> > wait parameter) wasn't set. There is a case of [5] which unset wait
> > again, but it seems we can end with FL_SLEEP there anyway.
> >
> > I think we have currently an issue here that we handle F_SETLK when
> > F_SLEEP (in case of async lockd request) is handled as trylock which
> > isn't right. Don't ask me why they are going over F_SLEEP and F_SETLK
> > and not just using F_SETLKW to signal that it is a blocking request.
> > So we actually have the F_SETLKW case but it's just signaled with
> > F_SETLK + FL_SLEEP?
>
> It seems to me that the vfs_lock_file() description and the
> distinction it makes between F_SETLK and F_SETLKW makes sense in a
> scenario when locking is implemented locally in the kernel, but not in
> the context of dlm_controld, where all the locking decisions are made
> in user-space: in the former scenario, F_SETLK requests can be decided
> immediately without sleeping, but F_SETLKW requests may sleep. In the
> dlm_controld scenario, locking requests can never be decided
> immediately, and the requesting task will always sleep. So from the
> point of view of lockd, any request to dlm_controld should probably be
> treated as asynchronous.
>
> This makes things a bit more ugly than I was hoping for.
>

I think they describe a "blocking" request as what you describe with
F_SETLKW. Everything is asynchronous, but F_SETLKW is the "blocking
request" by not actually meaning the lock() callback blocks in the
sense of wait_event(). There is however a semantic difference between
F_SETLK and F_SETLKW in sense when the asynchronous result comes back
and what the result is.

I looked more into this and a "somewhat" recent commit confuses me
more. It's 40595cdc93ed ("nfs: block notification on fs with its own
->lock") [0] and from what I understood, they signaled before F_SETLKW
with FL_SLEEP set, but they realized that mostly everybody does not
follow this API, so they switch to only F_SETLK and assume that the
most nfs clients do polling for a look. If you can't use F_SETLKW,
then the alternative could be polling the lock with F_SETLK, which
isn't a good design to use on DLM. In the commit message it's also
mentioned that they may bring back the old behaviour.

I think the comment in [1] is not up to date anymore.

There are also only two users of "fl_lmops" evaluation, which is fcntl
core and DLM, fuse will reject any lock request when "fl_lmops" is
set.

To be honest I have no idea how to proceed here, probably just look in
fcntl core and do what they do. I also tested dlm/next with a nfsv3
server on gfs2 by running "stress-ng --fcntl 16", I see some worried
error messages on the kernel log coming up.

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/lockd/svclock.c?h=v6.4-rc4&id=40595cdc93edf4110c0f0c0b06f8d82008f23929
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
@ 2023-06-03 12:02                     ` Alexander Aring
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Aring @ 2023-06-03 12:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Sat, Jun 3, 2023 at 6:03?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Thu, Jun 1, 2023 at 9:10?PM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Thu, Jun 1, 2023 at 1:11?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > On Thu, Jun 1, 2023 at 6:28?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > Hi,
> > > >
> > > > On Tue, May 30, 2023 at 1:40?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 30, 2023 at 4:08?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > > > > > > > <agruenba@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > > > > open filedescriptor posix locks.
> > > > > > > > > >
> > > > > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > > > > >
> > > > > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > > > > lock.
> > > > > > > > >
> > > > > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > > > > can only be because they originated from multiple threads of the same
> > > > > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > > > > request. We do need to compare the additional request fields in
> > > > > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > > > > "right" request (or in case there is more than one identical request,
> > > > > > > > > a "suitable" request).
> > > > > > > > >
> > > > > > > >
> > > > > > > > In my "definition" why this works is as you said the "identical
> > > > > > > > request". There is a more deeper definition of "when is a request
> > > > > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > > > > to request B when they get granted under the same 'time'" which is all
> > > > > > > > the fields you mentioned.
> > > > > > > >
> > > > > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > > > > "identical request" you cancel because the kernel and user
> > > > > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > > > > need to have at least the same amount of "results" coming back from
> > > > > > > > user space as the amount you are waiting for a result for the same
> > > > > > > > "identical request".
> > > > > > >
> > > > > > > That's not incorrect per se, but cancellations create an additional
> > > > > > > difficulty: they can either succeed or fail. To indicate that a
> > > > > > > cancellation has succeeded, a new type of message can be introduced
> > > > > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > > > > belong to a locking request that is being cancelled. When cancelling a
> > > > > > > locking request fails, the kernel will see a "locking request granted"
> > > > > > > message though, and when multiple identical locking requests are
> > > > > > > queued and only some of them have been cancelled, it won't be obvious
> > > > > > > which locking request a "locking request granted" message should be
> > > > > > > assigned to anymore. You really don't want to mix things up in that
> > > > > > > case.
> > > > > > >
> > > > > > > This complication makes it a whole lot more difficult to reason about
> > > > > > > the correctness of the code. All that complexity is avoidable by
> > > > > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > > > > request identifier).
> > > > > > >
> > > > > > > To put it differently, you can shoot yourself in the foot and still
> > > > > > > hop along on the other leg, but it may not be the best of all possible
> > > > > > > ideas.
> > > > > > >
> > > > > >
> > > > > > It makes things more complicated, I agree and the reason why this
> > > > > > works now is because there are a lot of "dependencies". I would love
> > > > > > to have an unique identifier to make it possible that we can follow an
> > > > > > instance handle of the original lock request.
> > > > > >
> > > > > > * an unique identifier which also works with the async lock request of
> > > > > > lockd case.
> > > > >
> > > > > What's the lockd case you're referring to here, and why is it relevant
> > > > > for the problem at hand?
> > > >
> > > > just mentioned that we need a solution which also works for the
> > > > asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> > > > user lockd. [0] DLM plock handling implements the behaviour mentioned
> > > > at [0] but lm_grant() callback can also return negative values and
> > > > signals that the lock request was cancelled (on nfs layer) and then
> > > > need to tell it DLM. This however is not supported as we have a lack
> > > > of cancellation.
> > >
> > > Ouch, that's a bit messy. But if the vfs_lock_file() description is
> > > accurate, then only F_SETLK requests arriving via lockd can be
> > > asynchronous, and F_SETLKW requests never are asynchronous. And we
> > > only need to cancel F_SETLKW requests. It follows that we only ever
> > > need to cancel synchronous requests.
> > >
> >
> > I looked into the code and I think the second sentence of [0] is
> > important regarding to F_SLEEP "Callers expecting ->lock() to return
> > asynchronously will only use F_SETLK, not F_SETLKW; they will set
> > FL_SLEEP if (and only if) the request is for a blocking lock.".
> > If lockd does a lock request, it will then set args.wait to 1 if it
> > was F_SETLKW [1].
>
> Hmm, this sets args.block?
>

You are right but args.block gets passed as the wait parameter to nlmsvc_lock().

An example user:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svc4proc.c#n158

> > The receiver will then create a blocking request [2]
> > and set FL_SLEEP [3]; it does unset FL_SLEEP when args.wait (now as
> > wait parameter) wasn't set. There is a case of [5] which unset wait
> > again, but it seems we can end with FL_SLEEP there anyway.
> >
> > I think we have currently an issue here that we handle F_SETLK when
> > F_SLEEP (in case of async lockd request) is handled as trylock which
> > isn't right. Don't ask me why they are going over F_SLEEP and F_SETLK
> > and not just using F_SETLKW to signal that it is a blocking request.
> > So we actually have the F_SETLKW case but it's just signaled with
> > F_SETLK + FL_SLEEP?
>
> It seems to me that the vfs_lock_file() description and the
> distinction it makes between F_SETLK and F_SETLKW makes sense in a
> scenario when locking is implemented locally in the kernel, but not in
> the context of dlm_controld, where all the locking decisions are made
> in user-space: in the former scenario, F_SETLK requests can be decided
> immediately without sleeping, but F_SETLKW requests may sleep. In the
> dlm_controld scenario, locking requests can never be decided
> immediately, and the requesting task will always sleep. So from the
> point of view of lockd, any request to dlm_controld should probably be
> treated as asynchronous.
>
> This makes things a bit more ugly than I was hoping for.
>

I think they describe a "blocking" request as what you describe with
F_SETLKW. Everything is asynchronous, but F_SETLKW is the "blocking
request" by not actually meaning the lock() callback blocks in the
sense of wait_event(). There is however a semantic difference between
F_SETLK and F_SETLKW in sense when the asynchronous result comes back
and what the result is.

I looked more into this and a "somewhat" recent commit confuses me
more. It's 40595cdc93ed ("nfs: block notification on fs with its own
->lock") [0] and from what I understood, they signaled before F_SETLKW
with FL_SLEEP set, but they realized that mostly everybody does not
follow this API, so they switch to only F_SETLK and assume that the
most nfs clients do polling for a look. If you can't use F_SETLKW,
then the alternative could be polling the lock with F_SETLK, which
isn't a good design to use on DLM. In the commit message it's also
mentioned that they may bring back the old behaviour.

I think the comment in [1] is not up to date anymore.

There are also only two users of "fl_lmops" evaluation, which is fcntl
core and DLM, fuse will reject any lock request when "fl_lmops" is
set.

To be honest I have no idea how to proceed here, probably just look in
fcntl core and do what they do. I also tested dlm/next with a nfsv3
server on gfs2 by running "stress-ng --fcntl 16", I see some worried
error messages on the kernel log coming up.

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/lockd/svclock.c?h=v6.4-rc4&id=40595cdc93edf4110c0f0c0b06f8d82008f23929
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-06-03 12:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24 16:02 [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions Alexander Aring
2023-05-24 16:02 ` [Cluster-devel] " Alexander Aring
2023-05-25 15:02 ` Andreas Gruenbacher
2023-05-25 15:02   ` [Cluster-devel] " Andreas Gruenbacher
2023-05-29 22:18   ` Alexander Aring
2023-05-29 22:18     ` [Cluster-devel] " Alexander Aring
2023-05-29 22:20     ` Alexander Aring
2023-05-29 22:20       ` [Cluster-devel] " Alexander Aring
2023-05-30 11:01     ` Andreas Gruenbacher
2023-05-30 11:01       ` [Cluster-devel] " Andreas Gruenbacher
2023-05-30 14:06       ` Alexander Aring
2023-05-30 14:06         ` [Cluster-devel] " Alexander Aring
2023-05-30 17:40         ` Andreas Gruenbacher
2023-05-30 17:40           ` [Cluster-devel] " Andreas Gruenbacher
2023-06-01 16:28           ` Alexander Aring
2023-06-01 16:28             ` [Cluster-devel] " Alexander Aring
2023-06-01 17:10             ` Andreas Gruenbacher
2023-06-01 17:10               ` [Cluster-devel] " Andreas Gruenbacher
2023-06-01 19:09               ` Alexander Aring
2023-06-01 19:09                 ` [Cluster-devel] " Alexander Aring
2023-06-03 10:02                 ` Andreas Gruenbacher
2023-06-03 10:02                   ` [Cluster-devel] " Andreas Gruenbacher
2023-06-03 12:02                   ` Alexander Aring
2023-06-03 12:02                     ` [Cluster-devel] " Alexander Aring

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.