netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 00/14] af_unix: Rework GC.
@ 2024-02-23 21:39 Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 01/14] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

When we pass a file descriptor to an AF_UNIX socket via SCM_RIGTHS,
the underlying struct file of the inflight fd gets its refcount bumped.
If the fd is of an AF_UNIX socket, we need to track it in case it forms
cyclic references.

Let's say we send a fd of AF_UNIX socket A to B and vice versa and
close() both sockets.

When created, each socket's struct file initially has one reference.
After the fd exchange, both refcounts are bumped up to 2.  Then, close()
decreases both to 1.  From this point on, no one can touch the file/socket.

However, the struct file has one refcount and thus never calls the
release() function of the AF_UNIX socket.

That's why we need to track all inflight AF_UNIX sockets and run garbage
collection.

This series replaces the current GC implementation that locks each inflight
socket's receive queue and requires trickiness in other places.

The new GC does not lock each socket's queue to minimise its effect and
tries to be lightweight if there is no cyclic reference or no update in
the shape of the inflight fd graph.

The new implementation is based on Tarjan's Strongly Connected Components
algorithm, and we will consider each inflight AF_UNIX socket as a vertex
and its file descriptor as an edge in a directed graph.

For the details, please see each patch.

  patch 1  -  3 : Add struct to express inflight socket graphs
  patch       4 : Optimse inflight fd counting
  patch       5 : Group SCC possibly forming a cycle
  patch 6  -  7 : Support embryo socket
  patch 8  - 10 : Make GC lightweight
  patch 11 - 12 : Detect dead cycle references
  patch      13 : Replace GC algorithm
  patch      14 : selftest

After this series is applied, we can remove the two ugly tricks for race,
scm_fp_dup() in unix_attach_fds() and spin_lock dance in unix_peek_fds()
as done in patch 14/15 of v1.


Changes:
  v3:
    * Patch 1
      * Allocate struct unix_vertex dynamically only for inflight socket
    * Patch 2
      * Rename unix_edge.entry to unix_edge.vertex_entry
      * Change edge->successor/predecessor to struct unix_sock
    * Patch 7
      * Moved after SCC detection patch
      * Fix up embryo successor during GC instead of overwriting edge
        in unix_add_edge()
        * To not allcoate unix_vertex to listener for embryo socket
        * Kept the name unix_update_edges() unchanged as it affect
          successor tracking during GC
    * Patch 12
      * Drop self_degree and check all edges
        * To not allcoate unix_vertex to listener for embryo socket

  v2: https://lore.kernel.org/netdev/20240216210556.65913-1-kuniyu@amazon.com/
    * Drop 2 patches as follow-up that removes trickiness in
      unix_attach_fds() and unix_peek_fds().

    * Patch 2
      * Fix build error when CONFIG_UNIX=n
    * Patch 3
      * Remove unnecessary INIT_LIST_HEAD()
    * Patch 7
      * Fix build warning for using goto label at the end of the loop
    * Patch 13
      * Call kfree_skb() for oob skb
    * Patch 14
      * Add test case for MSG_OOB

  v1: https://lore.kernel.org/netdev/20240203030058.60750-1-kuniyu@amazon.com/


Kuniyuki Iwashima (14):
  af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd.
  af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.
  af_unix: Link struct unix_edge when queuing skb.
  af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.
  af_unix: Detect Strongly Connected Components.
  af_unix: Save listener for embryo socket.
  af_unix: Fix up unix_edge.successor for embryo socket.
  af_unix: Save O(n) setup of Tarjan's algo.
  af_unix: Skip GC if no cycle exists.
  af_unix: Avoid Tarjan's algorithm if unnecessary.
  af_unix: Assign a unique index to SCC.
  af_unix: Detect dead SCC.
  af_unix: Replace garbage collection algorithm.
  selftest: af_unix: Test GC for SCM_RIGHTS.

 include/net/af_unix.h                         |  31 +-
 include/net/scm.h                             |   9 +
 net/core/scm.c                                |  11 +
 net/unix/af_unix.c                            |  27 +-
 net/unix/garbage.c                            | 526 +++++++++++-------
 tools/testing/selftests/net/.gitignore        |   1 +
 tools/testing/selftests/net/af_unix/Makefile  |   2 +-
 .../selftests/net/af_unix/scm_rights.c        | 286 ++++++++++
 8 files changed, 685 insertions(+), 208 deletions(-)
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 01/14] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 02/14] af_unix: Allocate struct unix_edge " Kuniyuki Iwashima
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will replace the garbage collection algorithm for AF_UNIX, where
we will consider each inflight AF_UNIX socket as a vertex and its file
descriptor as an edge in a directed graph.

This patch introduces a new struct unix_vertex representing a vertex
in the graph and adds its pointer to struct unix_sock.

When we send a fd using the SCM_RIGHTS message, we allocate struct
scm_fp_list to struct scm_cookie in scm_fp_copy().  Then, we bump
each refcount of the inflight fds' struct file and save them in
scm_fp_list.fp.

After that, unix_attach_fds() inexplicably clones scm_fp_list of
scm_cookie and sets it to skb.  (We will remove this part after
replacing GC.)

Here, we add a new function call in unix_attach_fds() to preallocate
struct unix_vertex per inflight AF_UNIX fd and link each vertex to
skb's scm_fp_list.vertices.

When sendmsg() succeeds later, if the socket of the inflight fd is
still not inflight yet, we will set the preallocated vertex to struct
unix_sock.vertex and link it to a global list unix_unvisited_vertices
under spin_lock(&unix_gc_lock).

If the socket is already inflight, we free the preallocated vertex.
This is to avoid taking the lock unnecessarily when sendmsg() could
fail later.

In the following patch, we will similarly allocate another struct
per edge, which will finally be linked to the inflight socket's
unix_vertex.edges.

And then, we will count the number of edges as unix_vertex.out_degree.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  9 +++++++++
 include/net/scm.h     |  3 +++
 net/core/scm.c        |  7 +++++++
 net/unix/af_unix.c    |  6 ++++++
 net/unix/garbage.c    | 38 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 63 insertions(+)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 627ea8e2d915..c270877a5256 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -22,9 +22,17 @@ extern unsigned int unix_tot_inflight;
 
 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
+int unix_prepare_fpl(struct scm_fp_list *fpl);
+void unix_destroy_fpl(struct scm_fp_list *fpl);
 void unix_gc(void);
 void wait_for_unix_gc(struct scm_fp_list *fpl);
 
+struct unix_vertex {
+	struct list_head edges;
+	struct list_head entry;
+	unsigned long out_degree;
+};
+
 struct sock *unix_peer_get(struct sock *sk);
 
 #define UNIX_HASH_MOD	(256 - 1)
@@ -62,6 +70,7 @@ struct unix_sock {
 	struct path		path;
 	struct mutex		iolock, bindlock;
 	struct sock		*peer;
+	struct unix_vertex	*vertex;
 	struct list_head	link;
 	unsigned long		inflight;
 	spinlock_t		lock;
diff --git a/include/net/scm.h b/include/net/scm.h
index 92276a2c5543..e34321b6e204 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -27,6 +27,9 @@ struct scm_fp_list {
 	short			count;
 	short			count_unix;
 	short			max;
+#ifdef CONFIG_UNIX
+	struct list_head	vertices;
+#endif
 	struct user_struct	*user;
 	struct file		*fp[SCM_MAX_FD];
 };
diff --git a/net/core/scm.c b/net/core/scm.c
index 9cd4b0a01cd6..87dfec1c3378 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -89,6 +89,9 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 		fpl->count_unix = 0;
 		fpl->max = SCM_MAX_FD;
 		fpl->user = NULL;
+#if IS_ENABLED(CONFIG_UNIX)
+		INIT_LIST_HEAD(&fpl->vertices);
+#endif
 	}
 	fpp = &fpl->fp[fpl->count];
 
@@ -376,8 +379,12 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
+
 		new_fpl->max = new_fpl->count;
 		new_fpl->user = get_uid(fpl->user);
+#if IS_ENABLED(CONFIG_UNIX)
+		INIT_LIST_HEAD(&new_fpl->vertices);
+#endif
 	}
 	return new_fpl;
 }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5b41e2321209..a3b25d311560 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -980,6 +980,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_destruct		= unix_sock_destructor;
 	u = unix_sk(sk);
 	u->inflight = 0;
+	u->vertex = NULL;
 	u->path.dentry = NULL;
 	u->path.mnt = NULL;
 	spin_lock_init(&u->lock);
@@ -1805,6 +1806,9 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 	for (i = scm->fp->count - 1; i >= 0; i--)
 		unix_inflight(scm->fp->user, scm->fp->fp[i]);
 
+	if (unix_prepare_fpl(UNIXCB(skb).fp))
+		return -ENOMEM;
+
 	return 0;
 }
 
@@ -1815,6 +1819,8 @@ static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 	scm->fp = UNIXCB(skb).fp;
 	UNIXCB(skb).fp = NULL;
 
+	unix_destroy_fpl(scm->fp);
+
 	for (i = scm->fp->count - 1; i >= 0; i--)
 		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
 }
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index fa39b6265238..75bdf66b81df 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -101,6 +101,44 @@ struct unix_sock *unix_get_socket(struct file *filp)
 	return NULL;
 }
 
+static void unix_free_vertices(struct scm_fp_list *fpl)
+{
+	struct unix_vertex *vertex, *next_vertex;
+
+	list_for_each_entry_safe(vertex, next_vertex, &fpl->vertices, entry) {
+		list_del(&vertex->entry);
+		kfree(vertex);
+	}
+}
+
+int unix_prepare_fpl(struct scm_fp_list *fpl)
+{
+	struct unix_vertex *vertex;
+	int i;
+
+	if (!fpl->count_unix)
+		return 0;
+
+	for (i = 0; i < fpl->count_unix; i++) {
+		vertex = kmalloc(sizeof(*vertex), GFP_KERNEL);
+		if (!vertex)
+			goto err;
+
+		list_add(&vertex->entry, &fpl->vertices);
+	}
+
+	return 0;
+
+err:
+	unix_free_vertices(fpl);
+	return -ENOMEM;
+}
+
+void unix_destroy_fpl(struct scm_fp_list *fpl)
+{
+	unix_free_vertices(fpl);
+}
+
 DEFINE_SPINLOCK(unix_gc_lock);
 unsigned int unix_tot_inflight;
 static LIST_HEAD(gc_candidates);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 02/14] af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 01/14] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 03/14] af_unix: Link struct unix_edge when queuing skb Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

As with the previous patch, we preallocate to skb's scm_fp_list an
array of struct unix_edge in the number of inflight AF_UNIX fds.

There we just preallocate memory and do not use immediately because
sendmsg() could fail after this point.  The actual use will be in
the next patch.

When we queue skb with inflight edges, we will set the inflight
socket's unix_sock as unix_edge->predecessor and the receiver's
unix_sock as successor, and then we will link the edge to the
inflight socket's unix_vertex.edges.

Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup()
so that MSG_PEEK does not change the shape of the directed graph.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h | 6 ++++++
 include/net/scm.h     | 5 +++++
 net/core/scm.c        | 2 ++
 net/unix/garbage.c    | 6 ++++++
 4 files changed, 19 insertions(+)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index c270877a5256..55c4abc26a71 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -33,6 +33,12 @@ struct unix_vertex {
 	unsigned long out_degree;
 };
 
+struct unix_edge {
+	struct unix_sock *predecessor;
+	struct unix_sock *successor;
+	struct list_head vertex_entry;
+};
+
 struct sock *unix_peer_get(struct sock *sk);
 
 #define UNIX_HASH_MOD	(256 - 1)
diff --git a/include/net/scm.h b/include/net/scm.h
index e34321b6e204..5f5154e5096d 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -23,12 +23,17 @@ struct scm_creds {
 	kgid_t	gid;
 };
 
+#ifdef CONFIG_UNIX
+struct unix_edge;
+#endif
+
 struct scm_fp_list {
 	short			count;
 	short			count_unix;
 	short			max;
 #ifdef CONFIG_UNIX
 	struct list_head	vertices;
+	struct unix_edge	*edges;
 #endif
 	struct user_struct	*user;
 	struct file		*fp[SCM_MAX_FD];
diff --git a/net/core/scm.c b/net/core/scm.c
index 87dfec1c3378..1bcc8a2d65e3 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -90,6 +90,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 		fpl->max = SCM_MAX_FD;
 		fpl->user = NULL;
 #if IS_ENABLED(CONFIG_UNIX)
+		fpl->edges = NULL;
 		INIT_LIST_HEAD(&fpl->vertices);
 #endif
 	}
@@ -383,6 +384,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		new_fpl->max = new_fpl->count;
 		new_fpl->user = get_uid(fpl->user);
 #if IS_ENABLED(CONFIG_UNIX)
+		new_fpl->edges = NULL;
 		INIT_LIST_HEAD(&new_fpl->vertices);
 #endif
 	}
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 75bdf66b81df..f31917683288 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -127,6 +127,11 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
 		list_add(&vertex->entry, &fpl->vertices);
 	}
 
+	fpl->edges = kvmalloc_array(fpl->count_unix, sizeof(*fpl->edges),
+				    GFP_KERNEL_ACCOUNT);
+	if (!fpl->edges)
+		goto err;
+
 	return 0;
 
 err:
@@ -136,6 +141,7 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
 
 void unix_destroy_fpl(struct scm_fp_list *fpl)
 {
+	kvfree(fpl->edges);
 	unix_free_vertices(fpl);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 03/14] af_unix: Link struct unix_edge when queuing skb.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 01/14] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 02/14] af_unix: Allocate struct unix_edge " Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight " Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Just before queuing skb with inflight fds, we call scm_stat_add(),
which is a good place to set up the preallocated struct unix_vertex
and struct unix_edge in UNIXCB(skb).fp.

Then, we call unix_add_edges() and construct the directed graph
as follows:

  1. Set the inflight socket's unix_sock to unix_edge.predecessor.
  2. Set the receiver's unix_sock to unix_edge.successor.
  3. Set the preallocated vertex to inflight socket's unix_sock.vertex.
  4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices.
  5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges.

Let's say we pass the fd of AF_UNIX socket A to B and the fd of B
to C.  The graph looks like this:

  +-------------------------+
  | unix_unvisited_vertices | <-------------------------.
  +-------------------------+                           |
  +                                                     |
  |     +--------------+             +--------------+   |         +--------------+
  |     |  unix_sock A | <---. .---> |  unix_sock B | <-|-. .---> |  unix_sock C |
  |     +--------------+     | |     +--------------+   | | |     +--------------+
  | .-+ |    vertex    |     | | .-+ |    vertex    |   | | |     |    vertex    |
  | |   +--------------+     | | |   +--------------+   | | |     +--------------+
  | |                        | | |                      | | |
  | |   +--------------+     | | |   +--------------+   | | |
  | '-> |  unix_vertex |     | | '-> |  unix_vertex |   | | |
  |     +--------------+     | |     +--------------+   | | |
  `---> |    entry     | +---------> |    entry     | +-' | |
        |--------------|     | |     |--------------|     | |
        |    edges     | <-. | |     |    edges     | <-. | |
        +--------------+   | | |     +--------------+   | | |
                           | | |                        | | |
    .----------------------' | | .----------------------' | |
    |                        | | |                        | |
    |   +--------------+     | | |   +--------------+     | |
    |   |   unix_edge  |     | | |   |   unix_edge  |     | |
    |   +--------------+     | | |   +--------------+     | |
    `-> | vertex_entry |     | | `-> | vertex_entry |     | |
        |--------------|     | |     |--------------|     | |
        |  predecessor | +---' |     |  predecessor | +---' |
        |--------------|       |     |--------------|       |
        |   successor  | +-----'     |   successor  | +-----'
        +--------------+             +--------------+

Henceforth, we denote such a graph as A -> B (-> C).

Now, we can express all inflight fd graphs that do not contain
embryo sockets.  We will support the particular case later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  2 +
 include/net/scm.h     |  1 +
 net/core/scm.c        |  2 +
 net/unix/af_unix.c    |  8 +++-
 net/unix/garbage.c    | 94 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 55c4abc26a71..f31ad1166346 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -22,6 +22,8 @@ extern unsigned int unix_tot_inflight;
 
 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
+void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver);
+void unix_del_edges(struct scm_fp_list *fpl);
 int unix_prepare_fpl(struct scm_fp_list *fpl);
 void unix_destroy_fpl(struct scm_fp_list *fpl);
 void unix_gc(void);
diff --git a/include/net/scm.h b/include/net/scm.h
index 5f5154e5096d..bbc5527809d1 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -32,6 +32,7 @@ struct scm_fp_list {
 	short			count_unix;
 	short			max;
 #ifdef CONFIG_UNIX
+	bool			inflight;
 	struct list_head	vertices;
 	struct unix_edge	*edges;
 #endif
diff --git a/net/core/scm.c b/net/core/scm.c
index 1bcc8a2d65e3..5763f3320358 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -90,6 +90,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 		fpl->max = SCM_MAX_FD;
 		fpl->user = NULL;
 #if IS_ENABLED(CONFIG_UNIX)
+		fpl->inflight = false;
 		fpl->edges = NULL;
 		INIT_LIST_HEAD(&fpl->vertices);
 #endif
@@ -384,6 +385,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		new_fpl->max = new_fpl->count;
 		new_fpl->user = get_uid(fpl->user);
 #if IS_ENABLED(CONFIG_UNIX)
+		new_fpl->inflight = false;
 		new_fpl->edges = NULL;
 		INIT_LIST_HEAD(&new_fpl->vertices);
 #endif
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a3b25d311560..24adbc4d5188 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1943,8 +1943,10 @@ static void scm_stat_add(struct sock *sk, struct sk_buff *skb)
 	struct scm_fp_list *fp = UNIXCB(skb).fp;
 	struct unix_sock *u = unix_sk(sk);
 
-	if (unlikely(fp && fp->count))
+	if (unlikely(fp && fp->count)) {
 		atomic_add(fp->count, &u->scm_stat.nr_fds);
+		unix_add_edges(fp, u);
+	}
 }
 
 static void scm_stat_del(struct sock *sk, struct sk_buff *skb)
@@ -1952,8 +1954,10 @@ static void scm_stat_del(struct sock *sk, struct sk_buff *skb)
 	struct scm_fp_list *fp = UNIXCB(skb).fp;
 	struct unix_sock *u = unix_sk(sk);
 
-	if (unlikely(fp && fp->count))
+	if (unlikely(fp && fp->count)) {
 		atomic_sub(fp->count, &u->scm_stat.nr_fds);
+		unix_del_edges(fp);
+	}
 }
 
 /*
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index f31917683288..96d0b1db3638 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -101,6 +101,42 @@ struct unix_sock *unix_get_socket(struct file *filp)
 	return NULL;
 }
 
+static LIST_HEAD(unix_unvisited_vertices);
+
+static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
+{
+	struct unix_vertex *vertex = edge->predecessor->vertex;
+
+	if (!vertex) {
+		vertex = list_first_entry(&fpl->vertices, typeof(*vertex), entry);
+		vertex->out_degree = 0;
+		INIT_LIST_HEAD(&vertex->edges);
+
+		list_move_tail(&vertex->entry, &unix_unvisited_vertices);
+
+		edge->predecessor->vertex = vertex;
+	}
+
+	vertex->out_degree++;
+
+	list_add_tail(&edge->vertex_entry, &vertex->edges);
+}
+
+static void unix_del_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
+{
+	struct unix_vertex *vertex = edge->predecessor->vertex;
+
+	list_del(&edge->vertex_entry);
+
+	vertex->out_degree--;
+
+	if (!vertex->out_degree) {
+		edge->predecessor->vertex = NULL;
+
+		list_move_tail(&vertex->entry, &fpl->vertices);
+	}
+}
+
 static void unix_free_vertices(struct scm_fp_list *fpl)
 {
 	struct unix_vertex *vertex, *next_vertex;
@@ -111,6 +147,60 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
 	}
 }
 
+DEFINE_SPINLOCK(unix_gc_lock);
+
+void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
+{
+	int i = 0, j = 0;
+
+	spin_lock(&unix_gc_lock);
+
+	if (!fpl->count_unix)
+		goto out;
+
+	do {
+		struct unix_sock *inflight = unix_get_socket(fpl->fp[j++]);
+		struct unix_edge *edge;
+
+		if (!inflight)
+			continue;
+
+		edge = fpl->edges + i++;
+		edge->predecessor = inflight;
+		edge->successor = receiver;
+
+		unix_add_edge(fpl, edge);
+	} while (i < fpl->count_unix);
+
+out:
+	spin_unlock(&unix_gc_lock);
+
+	fpl->inflight = true;
+
+	unix_free_vertices(fpl);
+}
+
+void unix_del_edges(struct scm_fp_list *fpl)
+{
+	int i = 0;
+
+	spin_lock(&unix_gc_lock);
+
+	if (!fpl->count_unix)
+		goto out;
+
+	do {
+		struct unix_edge *edge = fpl->edges + i++;
+
+		unix_del_edge(fpl, edge);
+	} while (i < fpl->count_unix);
+
+out:
+	spin_unlock(&unix_gc_lock);
+
+	fpl->inflight = false;
+}
+
 int unix_prepare_fpl(struct scm_fp_list *fpl)
 {
 	struct unix_vertex *vertex;
@@ -141,11 +231,13 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
 
 void unix_destroy_fpl(struct scm_fp_list *fpl)
 {
+	if (fpl->inflight)
+		unix_del_edges(fpl);
+
 	kvfree(fpl->edges);
 	unix_free_vertices(fpl);
 }
 
-DEFINE_SPINLOCK(unix_gc_lock);
 unsigned int unix_tot_inflight;
 static LIST_HEAD(gc_candidates);
 static LIST_HEAD(gc_inflight_list);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 03/14] af_unix: Link struct unix_edge when queuing skb Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-27 10:47   ` Paolo Abeni
  2024-02-23 21:39 ` [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Currently, we track the number of inflight sockets in two variables.
unix_tot_inflight is the total number of inflight AF_UNIX sockets on
the host, and user->unix_inflight is the number of inflight fds per
user.

We update them one by one in unix_inflight(), which can be done once
in batch.  Also, sendmsg() could fail even after unix_inflight(), then
we need to acquire unix_gc_lock only to decrement the counters.

Let's bulk update the counters in unix_add_edges() and unix_del_edges(),
which is called only for successfully passed fds.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 96d0b1db3638..e8fe08796d02 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -148,6 +148,7 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
 }
 
 DEFINE_SPINLOCK(unix_gc_lock);
+unsigned int unix_tot_inflight;
 
 void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
 {
@@ -172,7 +173,10 @@ void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
 		unix_add_edge(fpl, edge);
 	} while (i < fpl->count_unix);
 
+	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + fpl->count_unix);
 out:
+	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight + fpl->count);
+
 	spin_unlock(&unix_gc_lock);
 
 	fpl->inflight = true;
@@ -195,7 +199,10 @@ void unix_del_edges(struct scm_fp_list *fpl)
 		unix_del_edge(fpl, edge);
 	} while (i < fpl->count_unix);
 
+	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - fpl->count_unix);
 out:
+	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight - fpl->count);
+
 	spin_unlock(&unix_gc_lock);
 
 	fpl->inflight = false;
@@ -238,7 +245,6 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
-unsigned int unix_tot_inflight;
 static LIST_HEAD(gc_candidates);
 static LIST_HEAD(gc_inflight_list);
 
@@ -259,13 +265,8 @@ void unix_inflight(struct user_struct *user, struct file *filp)
 			WARN_ON_ONCE(list_empty(&u->link));
 		}
 		u->inflight++;
-
-		/* Paired with READ_ONCE() in wait_for_unix_gc() */
-		WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + 1);
 	}
 
-	WRITE_ONCE(user->unix_inflight, user->unix_inflight + 1);
-
 	spin_unlock(&unix_gc_lock);
 }
 
@@ -282,13 +283,8 @@ void unix_notinflight(struct user_struct *user, struct file *filp)
 		u->inflight--;
 		if (!u->inflight)
 			list_del_init(&u->link);
-
-		/* Paired with READ_ONCE() in wait_for_unix_gc() */
-		WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - 1);
 	}
 
-	WRITE_ONCE(user->unix_inflight, user->unix_inflight - 1);
-
 	spin_unlock(&unix_gc_lock);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight " Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-25  0:34   ` Jakub Kicinski
  2024-02-27 11:02   ` Paolo Abeni
  2024-02-23 21:39 ` [PATCH v3 net-next 06/14] af_unix: Save listener for embryo socket Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

In the new GC, we use a simple graph algorithm, Tarjan's Strongly
Connected Components (SCC) algorithm, to find cyclic references.

The algorithm visits every vertex exactly once using depth-first
search (DFS).  We implement it without recursion so that no one
can abuse it.

There could be multiple graphs, so we iterate unix_unvisited_vertices
in unix_walk_scc() and do DFS in __unix_walk_scc(), where we move
visited vertices to another list, unix_visited_vertices, not to
restart DFS twice on a visited vertex later in unix_walk_scc().

DFS starts by pushing an input vertex to a stack and assigning it
a unique number.  Two fields, index and lowlink, are initialised
with the number, but lowlink could be updated later during DFS.

If a vertex has an edge to an unvisited inflight vertex, we visit
it and do the same processing.  So, we will have vertices in the
stack in the order they appear and number them consecutively in
the same order.

If a vertex has a back-edge to a visited vertex in the stack,
we update the predecessor's lowlink with the successor's index.

After iterating edges from the vertex, we check if its index
equals its lowlink.

If the lowlink is different from the index, it shows there was a
back-edge.  Then, we propagate the lowlink to its predecessor and
go back to the predecessor to resume checking from the next edge
of the back-edge.

If the lowlink is the same as the index, we pop vertices before
and including the vertex from the stack.  Then, the set of vertices
is SCC, possibly forming a cycle.  At the same time, we move the
vertices to unix_visited_vertices.

When we finish the algorithm, all vertices in each SCC will be
linked via unix_vertex.scc_entry.

Let's take an example.  We have a graph including five inflight
vertices (F is not inflight):

  A -> B -> C -> D -> E (-> F)
       ^         |
       `---------'

Suppose that we start DFS from C.  We will visit C, D, and B first
and initialise their index and lowlink.  Then, the stack looks like
this:

  > B = (3, 3)  (index, lowlink)
    D = (2, 2)
    C = (1, 1)

When checking B's edge to C, we update B's lowlink with C's index
and propagate it to D.

    B = (3, 1)  (index, lowlink)
  > D = (2, 1)
    C = (1, 1)

Next, we visit E, which has no edge to an inflight vertex.

  > E = (4, 4)  (index, lowlink)
    B = (3, 1)
    D = (2, 1)
    C = (1, 1)

When we leave from E, its index and lowlink are the same, so we
pop E from the stack as single-vertex SCC.  Next, we leave from
D but do nothing because its lowlink is different from its index.

    B = (3, 1)  (index, lowlink)
    D = (2, 1)
  > C = (1, 1)

Then, we leave from C, whose index and lowlink are the same, so
we pop B, D and C as SCC.

Last, we do DFS for the rest of vertices, A, which is also a
single-vertex SCC.

Finally, each unix_vertex.scc_entry is linked as follows:

  A -.  B -> C -> D  E -.
  ^  |  ^         |  ^  |
  `--'  `---------'  `--'

We use SCC later to decide whether we can garbage-collect the
sockets.

Note that we still cannot detect SCC properly if an edge points
to an embryo socket.  The following two patches will sort it out.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  5 +++
 net/unix/garbage.c    | 82 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index f31ad1166346..67736767b616 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -32,13 +32,18 @@ void wait_for_unix_gc(struct scm_fp_list *fpl);
 struct unix_vertex {
 	struct list_head edges;
 	struct list_head entry;
+	struct list_head scc_entry;
 	unsigned long out_degree;
+	unsigned long index;
+	unsigned long lowlink;
+	bool on_stack;
 };
 
 struct unix_edge {
 	struct unix_sock *predecessor;
 	struct unix_sock *successor;
 	struct list_head vertex_entry;
+	struct list_head stack_entry;
 };
 
 struct sock *unix_peer_get(struct sock *sk);
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index e8fe08796d02..7e90663513f9 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -103,6 +103,11 @@ struct unix_sock *unix_get_socket(struct file *filp)
 
 static LIST_HEAD(unix_unvisited_vertices);
 
+enum unix_vertex_index {
+	UNIX_VERTEX_INDEX_UNVISITED,
+	UNIX_VERTEX_INDEX_START,
+};
+
 static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 {
 	struct unix_vertex *vertex = edge->predecessor->vertex;
@@ -245,6 +250,81 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
+static LIST_HEAD(unix_visited_vertices);
+
+static void __unix_walk_scc(struct unix_vertex *vertex)
+{
+	unsigned long index = UNIX_VERTEX_INDEX_START;
+	LIST_HEAD(vertex_stack);
+	struct unix_edge *edge;
+	LIST_HEAD(edge_stack);
+
+next_vertex:
+	vertex->index = index;
+	vertex->lowlink = index;
+	index++;
+
+	vertex->on_stack = true;
+	list_add(&vertex->scc_entry, &vertex_stack);
+
+	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
+		struct unix_vertex *next_vertex = edge->successor->vertex;
+
+		if (!next_vertex)
+			continue;
+
+		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
+			list_add(&edge->stack_entry, &edge_stack);
+
+			vertex = next_vertex;
+			goto next_vertex;
+prev_vertex:
+			next_vertex = vertex;
+
+			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
+			list_del_init(&edge->stack_entry);
+
+			vertex = edge->predecessor->vertex;
+
+			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
+		} else if (edge->successor->vertex->on_stack) {
+			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
+		}
+	}
+
+	if (vertex->index == vertex->lowlink) {
+		struct list_head scc;
+
+		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
+
+		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
+			list_move_tail(&vertex->entry, &unix_visited_vertices);
+
+			vertex->on_stack = false;
+		}
+
+		list_del(&scc);
+	}
+
+	if (!list_empty(&edge_stack))
+		goto prev_vertex;
+}
+
+static void unix_walk_scc(void)
+{
+	struct unix_vertex *vertex;
+
+	list_for_each_entry(vertex, &unix_unvisited_vertices, entry)
+		vertex->index = UNIX_VERTEX_INDEX_UNVISITED;
+
+	while (!list_empty(&unix_unvisited_vertices)) {
+		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
+		__unix_walk_scc(vertex);
+	}
+
+	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
+}
+
 static LIST_HEAD(gc_candidates);
 static LIST_HEAD(gc_inflight_list);
 
@@ -392,6 +472,8 @@ static void __unix_gc(struct work_struct *work)
 
 	spin_lock(&unix_gc_lock);
 
+	unix_walk_scc();
+
 	/* First, select candidates for garbage collection.  Only
 	 * in-flight sockets are considered, and from those only ones
 	 * which don't have any external reference.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 06/14] af_unix: Save listener for embryo socket.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 07/14] af_unix: Fix up unix_edge.successor " Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

This is a prep patch for the following change, where we need to
fetch the listening socket from the successor embryo socket
during GC.

We add a new field to struct unix_sock to save a pointer to a
listening socket.

We set it when connect() creates a new socket, and clear it when
accept() is called.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h | 1 +
 net/unix/af_unix.c    | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 67736767b616..dc7469191195 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -83,6 +83,7 @@ struct unix_sock {
 	struct path		path;
 	struct mutex		iolock, bindlock;
 	struct sock		*peer;
+	struct sock		*listener;
 	struct unix_vertex	*vertex;
 	struct list_head	link;
 	unsigned long		inflight;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 24adbc4d5188..af74e7ebc35a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -979,6 +979,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_max_ack_backlog	= net->unx.sysctl_max_dgram_qlen;
 	sk->sk_destruct		= unix_sock_destructor;
 	u = unix_sk(sk);
+	u->listener = NULL;
 	u->inflight = 0;
 	u->vertex = NULL;
 	u->path.dentry = NULL;
@@ -1598,6 +1599,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	newsk->sk_type		= sk->sk_type;
 	init_peercred(newsk);
 	newu = unix_sk(newsk);
+	newu->listener = other;
 	RCU_INIT_POINTER(newsk->sk_wq, &newu->peer_wq);
 	otheru = unix_sk(other);
 
@@ -1693,8 +1695,8 @@ static int unix_accept(struct socket *sock, struct socket *newsock, int flags,
 		       bool kern)
 {
 	struct sock *sk = sock->sk;
-	struct sock *tsk;
 	struct sk_buff *skb;
+	struct sock *tsk;
 	int err;
 
 	err = -EOPNOTSUPP;
@@ -1719,6 +1721,7 @@ static int unix_accept(struct socket *sock, struct socket *newsock, int flags,
 	}
 
 	tsk = skb->sk;
+	unix_sk(tsk)->listener = NULL;
 	skb_free_datagram(sk, skb);
 	wake_up_interruptible(&unix_sk(sk)->peer_wait);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 07/14] af_unix: Fix up unix_edge.successor for embryo socket.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 06/14] af_unix: Save listener for embryo socket Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 08/14] af_unix: Save O(n) setup of Tarjan's algo Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

To garbage collect inflight AF_UNIX sockets, we must define the
cyclic reference appropriately.  This is a bit tricky if the loop
consists of embryo sockets.

Suppose that the fd of AF_UNIX socket A is passed to D and the fd B
to C and that C and D are embryo sockets of A and B, respectively.
It may appear that there are two separate graphs, A (-> D) and
B (-> C), but this is not correct.

     A --. .-- B
          X
     C <-' `-> D

Now, D holds A's refcount, and C has B's refcount, so unix_release()
will never be called for A and B when we close() them.  However, no
one can call close() for D and C to free skbs holding refcounts of A
and B because C/D is in A/B's receive queue, which should have been
purged by unix_release() for A and B.

So, here's another type of cyclic reference.  When a fd of an AF_UNIX
socket is passed to an embryo socket, the reference is indirectly held
by its parent listening socket.

  .-> A                            .-> B
  |   `- sk_receive_queue          |   `- sk_receive_queue
  |      `- skb                    |      `- skb
  |         `- sk == C             |         `- sk == D
  |            `- sk_receive_queue |           `- sk_receive_queue
  |               `- skb +---------'               `- skb +--.
  |                                                          |
  `----------------------------------------------------------'

Technically, the graph must be denoted as A <-> B instead of A (-> D)
and B (-> C) to find such a cyclic reference without touching each
socket's receive queue.

  .-> A --. .-- B <-.
  |        X        |  ==  A <-> B
  `-- C <-' `-> D --'

We apply this fixup during GC by fetching the real successor by
unix_edge_successor().

When we call accept(), we clear unix_sock.listener under unix_gc_lock
not to confuse GC.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c    |  2 +-
 net/unix/garbage.c    | 17 ++++++++++++++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index dc7469191195..414463803b7e 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -24,6 +24,7 @@ void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver);
 void unix_del_edges(struct scm_fp_list *fpl);
+void unix_update_edges(struct unix_sock *receiver);
 int unix_prepare_fpl(struct scm_fp_list *fpl);
 void unix_destroy_fpl(struct scm_fp_list *fpl);
 void unix_gc(void);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index af74e7ebc35a..ae77e2dc0dae 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1721,7 +1721,7 @@ static int unix_accept(struct socket *sock, struct socket *newsock, int flags,
 	}
 
 	tsk = skb->sk;
-	unix_sk(tsk)->listener = NULL;
+	unix_update_edges(unix_sk(tsk));
 	skb_free_datagram(sk, skb);
 	wake_up_interruptible(&unix_sk(sk)->peer_wait);
 
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 7e90663513f9..a21dbbbd9bc7 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -101,6 +101,14 @@ struct unix_sock *unix_get_socket(struct file *filp)
 	return NULL;
 }
 
+static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
+{
+	if (edge->successor->listener)
+		return unix_sk(edge->successor->listener)->vertex;
+
+	return edge->successor->vertex;
+}
+
 static LIST_HEAD(unix_unvisited_vertices);
 
 enum unix_vertex_index {
@@ -213,6 +221,13 @@ void unix_del_edges(struct scm_fp_list *fpl)
 	fpl->inflight = false;
 }
 
+void unix_update_edges(struct unix_sock *receiver)
+{
+	spin_lock(&unix_gc_lock);
+	receiver->listener = NULL;
+	spin_unlock(&unix_gc_lock);
+}
+
 int unix_prepare_fpl(struct scm_fp_list *fpl)
 {
 	struct unix_vertex *vertex;
@@ -268,7 +283,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 	list_add(&vertex->scc_entry, &vertex_stack);
 
 	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
-		struct unix_vertex *next_vertex = edge->successor->vertex;
+		struct unix_vertex *next_vertex = unix_edge_successor(edge);
 
 		if (!next_vertex)
 			continue;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 08/14] af_unix: Save O(n) setup of Tarjan's algo.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 07/14] af_unix: Fix up unix_edge.successor " Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 09/14] af_unix: Skip GC if no cycle exists Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Before starting Tarjan's algorithm, we need to mark all vertices
as unvisited.  We can save this O(n) setup by reserving two special
indices (0, 1) and using two variables.

The first time we link a vertex to unix_unvisited_vertices, we set
unix_vertex_unvisited_index to index.

During DFS, we can see that the index of unvisited vertices is the
same as unix_vertex_unvisited_index.

When we finalise SCC later, we set unix_vertex_grouped_index to each
vertex's index.

Then, we can know (i) that the vertex is on the stack if the index
of a visited vertex is >= 2 and (ii) that it is not on the stack and
belongs to a different SCC if the index is unix_vertex_grouped_index.

After the whole algorithm, all indices of vertices are set as
unix_vertex_grouped_index.

Next time we start DFS, we know that all unvisited vertices have
unix_vertex_grouped_index, and we can use unix_vertex_unvisited_index
as the not-on-stack marker.

To use the same variable in __unix_walk_scc(), we can swap
unix_vertex_(grouped|unvisited)_index at the end of Tarjan's
algorithm.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  1 -
 net/unix/garbage.c    | 22 ++++++++++++----------
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 414463803b7e..ec040caaa4b5 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -37,7 +37,6 @@ struct unix_vertex {
 	unsigned long out_degree;
 	unsigned long index;
 	unsigned long lowlink;
-	bool on_stack;
 };
 
 struct unix_edge {
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a21dbbbd9bc7..a2cd24ee953b 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -112,16 +112,20 @@ static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
 static LIST_HEAD(unix_unvisited_vertices);
 
 enum unix_vertex_index {
-	UNIX_VERTEX_INDEX_UNVISITED,
+	UNIX_VERTEX_INDEX_MARK1,
+	UNIX_VERTEX_INDEX_MARK2,
 	UNIX_VERTEX_INDEX_START,
 };
 
+static unsigned long unix_vertex_unvisited_index = UNIX_VERTEX_INDEX_MARK1;
+
 static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 {
 	struct unix_vertex *vertex = edge->predecessor->vertex;
 
 	if (!vertex) {
 		vertex = list_first_entry(&fpl->vertices, typeof(*vertex), entry);
+		vertex->index = unix_vertex_unvisited_index;
 		vertex->out_degree = 0;
 		INIT_LIST_HEAD(&vertex->edges);
 
@@ -266,6 +270,7 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 }
 
 static LIST_HEAD(unix_visited_vertices);
+static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
 
 static void __unix_walk_scc(struct unix_vertex *vertex)
 {
@@ -279,7 +284,6 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 	vertex->lowlink = index;
 	index++;
 
-	vertex->on_stack = true;
 	list_add(&vertex->scc_entry, &vertex_stack);
 
 	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
@@ -288,7 +292,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 		if (!next_vertex)
 			continue;
 
-		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
+		if (next_vertex->index == unix_vertex_unvisited_index) {
 			list_add(&edge->stack_entry, &edge_stack);
 
 			vertex = next_vertex;
@@ -302,7 +306,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			vertex = edge->predecessor->vertex;
 
 			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
-		} else if (edge->successor->vertex->on_stack) {
+		} else if (next_vertex->index != unix_vertex_grouped_index) {
 			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
 		}
 	}
@@ -315,7 +319,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
 			list_move_tail(&vertex->entry, &unix_visited_vertices);
 
-			vertex->on_stack = false;
+			vertex->index = unix_vertex_grouped_index;
 		}
 
 		list_del(&scc);
@@ -327,17 +331,15 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 static void unix_walk_scc(void)
 {
-	struct unix_vertex *vertex;
-
-	list_for_each_entry(vertex, &unix_unvisited_vertices, entry)
-		vertex->index = UNIX_VERTEX_INDEX_UNVISITED;
-
 	while (!list_empty(&unix_unvisited_vertices)) {
+		struct unix_vertex *vertex;
+
 		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
 		__unix_walk_scc(vertex);
 	}
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
+	swap(unix_vertex_unvisited_index, unix_vertex_grouped_index);
 }
 
 static LIST_HEAD(gc_candidates);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 09/14] af_unix: Skip GC if no cycle exists.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 08/14] af_unix: Save O(n) setup of Tarjan's algo Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:39 ` [PATCH v3 net-next 10/14] af_unix: Avoid Tarjan's algorithm if unnecessary Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We do not need to run GC if there is no possible cyclic reference.
We use unix_graph_maybe_cyclic to decide if we should run GC.

If a fd of an AF_UNIX socket is passed to an already inflight AF_UNIX
socket, they could form a cyclic reference.  Then, we set true to
unix_graph_maybe_cyclic and later run Tarjan's algorithm to group
them into SCC.

Once we run Tarjan's algorithm, we are 100% sure whether cyclic
references exist or not.  If there is no cycle, we set false to
unix_graph_maybe_cyclic and can skip the entire garbage collection
next time.

When finalising SCC, we set true to unix_graph_maybe_cyclic if SCC
consists of multiple vertices.

Even if SCC is a single vertex, a cycle might exist as self-fd passing.
Given the corner case is rare, we detect it by checking all edges of
the vertex and set true to unix_graph_maybe_cyclic.

With this change, __unix_gc() is just a spin_lock() dance in the normal
usage.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a2cd24ee953b..28506b32dcb2 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -109,6 +109,19 @@ static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
 	return edge->successor->vertex;
 }
 
+static bool unix_graph_maybe_cyclic;
+
+static void unix_graph_update(struct unix_vertex *vertex)
+{
+	if (unix_graph_maybe_cyclic)
+		return;
+
+	if (!vertex)
+		return;
+
+	unix_graph_maybe_cyclic = true;
+}
+
 static LIST_HEAD(unix_unvisited_vertices);
 
 enum unix_vertex_index {
@@ -137,12 +150,14 @@ static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 	vertex->out_degree++;
 
 	list_add_tail(&edge->vertex_entry, &vertex->edges);
+	unix_graph_update(unix_edge_successor(edge));
 }
 
 static void unix_del_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 {
 	struct unix_vertex *vertex = edge->predecessor->vertex;
 
+	unix_graph_update(unix_edge_successor(edge));
 	list_del(&edge->vertex_entry);
 
 	vertex->out_degree--;
@@ -228,6 +243,7 @@ void unix_del_edges(struct scm_fp_list *fpl)
 void unix_update_edges(struct unix_sock *receiver)
 {
 	spin_lock(&unix_gc_lock);
+	unix_graph_update(unix_sk(receiver->listener)->vertex);
 	receiver->listener = NULL;
 	spin_unlock(&unix_gc_lock);
 }
@@ -269,6 +285,24 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
+static bool unix_scc_cyclic(struct list_head *scc)
+{
+	struct unix_vertex *vertex;
+	struct unix_edge *edge;
+
+	if (!list_is_singular(scc))
+		return true;
+
+	vertex = list_first_entry(scc, typeof(*vertex), scc_entry);
+
+	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
+		if (unix_edge_successor(edge) == vertex)
+			return true;
+	}
+
+	return false;
+}
+
 static LIST_HEAD(unix_visited_vertices);
 static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
 
@@ -322,6 +356,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			vertex->index = unix_vertex_grouped_index;
 		}
 
+		if (!unix_graph_maybe_cyclic)
+			unix_graph_maybe_cyclic = unix_scc_cyclic(&scc);
+
 		list_del(&scc);
 	}
 
@@ -331,6 +368,8 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 static void unix_walk_scc(void)
 {
+	unix_graph_maybe_cyclic = false;
+
 	while (!list_empty(&unix_unvisited_vertices)) {
 		struct unix_vertex *vertex;
 
@@ -489,6 +528,9 @@ static void __unix_gc(struct work_struct *work)
 
 	spin_lock(&unix_gc_lock);
 
+	if (!unix_graph_maybe_cyclic)
+		goto skip_gc;
+
 	unix_walk_scc();
 
 	/* First, select candidates for garbage collection.  Only
@@ -582,7 +624,7 @@ static void __unix_gc(struct work_struct *work)
 
 	/* All candidates should have been detached by now. */
 	WARN_ON_ONCE(!list_empty(&gc_candidates));
-
+skip_gc:
 	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
 	WRITE_ONCE(gc_in_progress, false);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 10/14] af_unix: Avoid Tarjan's algorithm if unnecessary.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 09/14] af_unix: Skip GC if no cycle exists Kuniyuki Iwashima
@ 2024-02-23 21:39 ` Kuniyuki Iwashima
  2024-02-23 21:40 ` [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Once a cyclic reference is formed, we need to run GC to check if
there is dead SCC.

However, we do not need to run Tarjan's algorithm if we know that
the shape of the inflight graph has not been changed.

If an edge is added/updated/deleted and the edge's successor is
inflight, we set false to unix_graph_grouped, which means we need
to re-classify SCC.

Once we finalise SCC, we set false to unix_graph_grouped.

While unix_graph_grouped is false, we can iterate the grouped
SCC using vertex->scc_entry in unix_walk_scc_fast().

list_add() and list_for_each_entry_reverse() uses seem weird, but
they are to keep the vertex order consistent and make writing test
easier.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 28506b32dcb2..1d9a0498dec5 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -110,6 +110,7 @@ static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
 }
 
 static bool unix_graph_maybe_cyclic;
+static bool unix_graph_grouped;
 
 static void unix_graph_update(struct unix_vertex *vertex)
 {
@@ -120,6 +121,7 @@ static void unix_graph_update(struct unix_vertex *vertex)
 		return;
 
 	unix_graph_maybe_cyclic = true;
+	unix_graph_grouped = false;
 }
 
 static LIST_HEAD(unix_unvisited_vertices);
@@ -141,6 +143,7 @@ static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 		vertex->index = unix_vertex_unvisited_index;
 		vertex->out_degree = 0;
 		INIT_LIST_HEAD(&vertex->edges);
+		INIT_LIST_HEAD(&vertex->scc_entry);
 
 		list_move_tail(&vertex->entry, &unix_unvisited_vertices);
 
@@ -379,6 +382,26 @@ static void unix_walk_scc(void)
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
 	swap(unix_vertex_unvisited_index, unix_vertex_grouped_index);
+
+	unix_graph_grouped = true;
+}
+
+static void unix_walk_scc_fast(void)
+{
+	while (!list_empty(&unix_unvisited_vertices)) {
+		struct unix_vertex *vertex;
+		struct list_head scc;
+
+		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
+		list_add(&scc, &vertex->scc_entry);
+
+		list_for_each_entry_reverse(vertex, &scc, scc_entry)
+			list_move_tail(&vertex->entry, &unix_visited_vertices);
+
+		list_del(&scc);
+	}
+
+	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
 }
 
 static LIST_HEAD(gc_candidates);
@@ -531,7 +554,10 @@ static void __unix_gc(struct work_struct *work)
 	if (!unix_graph_maybe_cyclic)
 		goto skip_gc;
 
-	unix_walk_scc();
+	if (unix_graph_grouped)
+		unix_walk_scc_fast();
+	else
+		unix_walk_scc();
 
 	/* First, select candidates for garbage collection.  Only
 	 * in-flight sockets are considered, and from those only ones
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2024-02-23 21:39 ` [PATCH v3 net-next 10/14] af_unix: Avoid Tarjan's algorithm if unnecessary Kuniyuki Iwashima
@ 2024-02-23 21:40 ` Kuniyuki Iwashima
  2024-02-27 11:19   ` Paolo Abeni
  2024-02-23 21:40 ` [PATCH v3 net-next 12/14] af_unix: Detect dead SCC Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:40 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The definition of the lowlink in Tarjan's algorithm is the
smallest index of a vertex that is reachable with at most one
back-edge in SCC.  This is not useful for a cross-edge.

If we start traversing from A in the following graph, the final
lowlink of D is 3.  The cross-edge here is one between D and C.

  A -> B -> D   D = (4, 3)  (index, lowlink)
  ^    |    |   C = (3, 1)
  |    V    |   B = (2, 1)
  `--- C <--'   A = (1, 1)

This is because the lowlink of D is updated with the index of C.

In the following patch, we detect a dead SCC by checking two
conditions for each vertex.

  1) vertex has no edge directed to another SCC (no bridge)
  2) vertex's out_degree is the same as the refcount of its file

If 1) is false, there is a receiver of all fds of the SCC and
its ancestor SCC.

To evaluate 1), we need to assign a unique index to each SCC and
assign it to all vertices in the SCC.

This patch changes the lowlink update logic for cross-edge so
that in the example above, the lowlink of D is updated with the
lowlink of C.

  A -> B -> D   D = (4, 1)  (index, lowlink)
  ^    |    |   C = (3, 1)
  |    V    |   B = (2, 1)
  `--- C <--'   A = (1, 1)

Then, all vertices in the same SCC have the same lowlink, and we
can quickly find the bridge connecting to different SCC if exists.

However, it is no longer called lowlink, so we rename it to
scc_index.  (It's sometimes called lowpoint.)

Also, we add a global variable to hold the last index used in DFS
so that we do not reset the initial index in each DFS.

This patch can be squashed to the SCC detection patch but is
split deliberately for anyone wondering why lowlink is not used
as used in the original Tarjan's algorithm and many reference
implementations.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c    | 15 ++++++++-------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index ec040caaa4b5..696d997a5ac9 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -36,7 +36,7 @@ struct unix_vertex {
 	struct list_head scc_entry;
 	unsigned long out_degree;
 	unsigned long index;
-	unsigned long lowlink;
+	unsigned long scc_index;
 };
 
 struct unix_edge {
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 1d9a0498dec5..0eb1610c96d7 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -308,18 +308,18 @@ static bool unix_scc_cyclic(struct list_head *scc)
 
 static LIST_HEAD(unix_visited_vertices);
 static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
+static unsigned long unix_vertex_last_index = UNIX_VERTEX_INDEX_START;
 
 static void __unix_walk_scc(struct unix_vertex *vertex)
 {
-	unsigned long index = UNIX_VERTEX_INDEX_START;
 	LIST_HEAD(vertex_stack);
 	struct unix_edge *edge;
 	LIST_HEAD(edge_stack);
 
 next_vertex:
-	vertex->index = index;
-	vertex->lowlink = index;
-	index++;
+	vertex->index = unix_vertex_last_index;
+	vertex->scc_index = unix_vertex_last_index;
+	unix_vertex_last_index++;
 
 	list_add(&vertex->scc_entry, &vertex_stack);
 
@@ -342,13 +342,13 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 			vertex = edge->predecessor->vertex;
 
-			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
+			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
 		} else if (next_vertex->index != unix_vertex_grouped_index) {
-			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
+			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
 		}
 	}
 
-	if (vertex->index == vertex->lowlink) {
+	if (vertex->index == vertex->scc_index) {
 		struct list_head scc;
 
 		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
@@ -371,6 +371,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 static void unix_walk_scc(void)
 {
+	unix_vertex_last_index = UNIX_VERTEX_INDEX_START;
 	unix_graph_maybe_cyclic = false;
 
 	while (!list_empty(&unix_unvisited_vertices)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 12/14] af_unix: Detect dead SCC.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2024-02-23 21:40 ` [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
@ 2024-02-23 21:40 ` Kuniyuki Iwashima
  2024-02-27 11:25   ` Paolo Abeni
  2024-02-23 21:40 ` [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm Kuniyuki Iwashima
  2024-02-23 21:40 ` [PATCH v3 net-next 14/14] selftest: af_unix: Test GC for SCM_RIGHTS Kuniyuki Iwashima
  13 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:40 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

When iterating SCC, we call unix_vertex_dead() for each vertex
to check if the vertex is close()d and has no bridge to another
SCC.

If both conditions are true for every vertex in SCC, we can
execute garbage collection for all skb in the SCC.

The actual garbage collection is done in the following patch,
replacing the old implementation.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 0eb1610c96d7..060e81be3614 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -288,6 +288,32 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
+static bool unix_vertex_dead(struct unix_vertex *vertex)
+{
+	struct unix_edge *edge;
+	struct unix_sock *u;
+	long total_ref;
+
+	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
+		struct unix_vertex *next_vertex = unix_edge_successor(edge);
+
+		if (!next_vertex)
+			return false;
+
+		if (next_vertex->scc_index != vertex->scc_index)
+			return false;
+	}
+
+	edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
+	u = edge->predecessor;
+	total_ref = file_count(u->sk.sk_socket->file);
+
+	if (total_ref != vertex->out_degree)
+		return false;
+
+	return true;
+}
+
 static bool unix_scc_cyclic(struct list_head *scc)
 {
 	struct unix_vertex *vertex;
@@ -350,6 +376,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 	if (vertex->index == vertex->scc_index) {
 		struct list_head scc;
+		bool dead = true;
 
 		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
 
@@ -357,6 +384,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			list_move_tail(&vertex->entry, &unix_visited_vertices);
 
 			vertex->index = unix_vertex_grouped_index;
+
+			if (dead)
+				dead = unix_vertex_dead(vertex);
 		}
 
 		if (!unix_graph_maybe_cyclic)
@@ -392,13 +422,18 @@ static void unix_walk_scc_fast(void)
 	while (!list_empty(&unix_unvisited_vertices)) {
 		struct unix_vertex *vertex;
 		struct list_head scc;
+		bool dead = true;
 
 		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
 		list_add(&scc, &vertex->scc_entry);
 
-		list_for_each_entry_reverse(vertex, &scc, scc_entry)
+		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
 			list_move_tail(&vertex->entry, &unix_visited_vertices);
 
+			if (dead)
+				dead = unix_vertex_dead(vertex);
+		}
+
 		list_del(&scc);
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2024-02-23 21:40 ` [PATCH v3 net-next 12/14] af_unix: Detect dead SCC Kuniyuki Iwashima
@ 2024-02-23 21:40 ` Kuniyuki Iwashima
  2024-02-27 11:36   ` Paolo Abeni
  2024-02-23 21:40 ` [PATCH v3 net-next 14/14] selftest: af_unix: Test GC for SCM_RIGHTS Kuniyuki Iwashima
  13 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:40 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

If we find a dead SCC during iteration, we call unix_collect_skb()
to splice all skb in the SCC to the global sk_buff_head, hitlist.

After iterating all SCC, we unlock unix_gc_lock and purge the queue.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |   8 --
 net/unix/af_unix.c    |  12 --
 net/unix/garbage.c    | 287 ++++++++----------------------------------
 3 files changed, 53 insertions(+), 254 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 696d997a5ac9..226a8da2cbe3 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -19,9 +19,6 @@ static inline struct unix_sock *unix_get_socket(struct file *filp)
 
 extern spinlock_t unix_gc_lock;
 extern unsigned int unix_tot_inflight;
-
-void unix_inflight(struct user_struct *user, struct file *fp);
-void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver);
 void unix_del_edges(struct scm_fp_list *fpl);
 void unix_update_edges(struct unix_sock *receiver);
@@ -85,12 +82,7 @@ struct unix_sock {
 	struct sock		*peer;
 	struct sock		*listener;
 	struct unix_vertex	*vertex;
-	struct list_head	link;
-	unsigned long		inflight;
 	spinlock_t		lock;
-	unsigned long		gc_flags;
-#define UNIX_GC_CANDIDATE	0
-#define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
 	wait_queue_entry_t	peer_wake;
 	struct scm_stat		scm_stat;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index ae77e2dc0dae..27ca50ab1cd1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -980,12 +980,10 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_destruct		= unix_sock_destructor;
 	u = unix_sk(sk);
 	u->listener = NULL;
-	u->inflight = 0;
 	u->vertex = NULL;
 	u->path.dentry = NULL;
 	u->path.mnt = NULL;
 	spin_lock_init(&u->lock);
-	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->iolock); /* single task reading lock */
 	mutex_init(&u->bindlock); /* single task binding lock */
 	init_waitqueue_head(&u->peer_wait);
@@ -1793,8 +1791,6 @@ static inline bool too_many_unix_fds(struct task_struct *p)
 
 static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 {
-	int i;
-
 	if (too_many_unix_fds(current))
 		return -ETOOMANYREFS;
 
@@ -1806,9 +1802,6 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 	if (!UNIXCB(skb).fp)
 		return -ENOMEM;
 
-	for (i = scm->fp->count - 1; i >= 0; i--)
-		unix_inflight(scm->fp->user, scm->fp->fp[i]);
-
 	if (unix_prepare_fpl(UNIXCB(skb).fp))
 		return -ENOMEM;
 
@@ -1817,15 +1810,10 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 
 static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 {
-	int i;
-
 	scm->fp = UNIXCB(skb).fp;
 	UNIXCB(skb).fp = NULL;
 
 	unix_destroy_fpl(scm->fp);
-
-	for (i = scm->fp->count - 1; i >= 0; i--)
-		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
 }
 
 static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 060e81be3614..59a87a997a4d 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -314,6 +314,48 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
 	return true;
 }
 
+static struct sk_buff_head hitlist;
+
+static void unix_collect_skb(struct list_head *scc)
+{
+	struct unix_vertex *vertex;
+
+	list_for_each_entry_reverse(vertex, scc, scc_entry) {
+		struct sk_buff_head *queue;
+		struct unix_edge *edge;
+		struct unix_sock *u;
+
+		edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
+		u = edge->predecessor;
+		queue = &u->sk.sk_receive_queue;
+
+		spin_lock(&queue->lock);
+
+		if (u->sk.sk_state == TCP_LISTEN) {
+			struct sk_buff *skb;
+
+			skb_queue_walk(queue, skb) {
+				struct sk_buff_head *embryo_queue = &skb->sk->sk_receive_queue;
+
+				spin_lock(&embryo_queue->lock);
+				skb_queue_splice_init(embryo_queue, &hitlist);
+				spin_unlock(&embryo_queue->lock);
+			}
+		} else {
+			skb_queue_splice_init(queue, &hitlist);
+
+#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+			if (u->oob_skb) {
+				kfree_skb(u->oob_skb);
+				u->oob_skb = NULL;
+			}
+#endif
+		}
+
+		spin_unlock(&queue->lock);
+	}
+}
+
 static bool unix_scc_cyclic(struct list_head *scc)
 {
 	struct unix_vertex *vertex;
@@ -389,7 +431,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 				dead = unix_vertex_dead(vertex);
 		}
 
-		if (!unix_graph_maybe_cyclic)
+		if (dead)
+			unix_collect_skb(&scc);
+		else if (!unix_graph_maybe_cyclic)
 			unix_graph_maybe_cyclic = unix_scc_cyclic(&scc);
 
 		list_del(&scc);
@@ -434,263 +478,38 @@ static void unix_walk_scc_fast(void)
 				dead = unix_vertex_dead(vertex);
 		}
 
+		if (dead)
+			unix_collect_skb(&scc);
+
 		list_del(&scc);
 	}
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
 }
 
-static LIST_HEAD(gc_candidates);
-static LIST_HEAD(gc_inflight_list);
-
-/* Keep the number of times in flight count for the file
- * descriptor if it is for an AF_UNIX socket.
- */
-void unix_inflight(struct user_struct *user, struct file *filp)
-{
-	struct unix_sock *u = unix_get_socket(filp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (u) {
-		if (!u->inflight) {
-			WARN_ON_ONCE(!list_empty(&u->link));
-			list_add_tail(&u->link, &gc_inflight_list);
-		} else {
-			WARN_ON_ONCE(list_empty(&u->link));
-		}
-		u->inflight++;
-	}
-
-	spin_unlock(&unix_gc_lock);
-}
-
-void unix_notinflight(struct user_struct *user, struct file *filp)
-{
-	struct unix_sock *u = unix_get_socket(filp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (u) {
-		WARN_ON_ONCE(!u->inflight);
-		WARN_ON_ONCE(list_empty(&u->link));
-
-		u->inflight--;
-		if (!u->inflight)
-			list_del_init(&u->link);
-	}
-
-	spin_unlock(&unix_gc_lock);
-}
-
-static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
-			  struct sk_buff_head *hitlist)
-{
-	struct sk_buff *skb;
-	struct sk_buff *next;
-
-	spin_lock(&x->sk_receive_queue.lock);
-	skb_queue_walk_safe(&x->sk_receive_queue, skb, next) {
-		/* Do we have file descriptors ? */
-		if (UNIXCB(skb).fp) {
-			bool hit = false;
-			/* Process the descriptors of this socket */
-			int nfd = UNIXCB(skb).fp->count;
-			struct file **fp = UNIXCB(skb).fp->fp;
-
-			while (nfd--) {
-				/* Get the socket the fd matches if it indeed does so */
-				struct unix_sock *u = unix_get_socket(*fp++);
-
-				/* Ignore non-candidates, they could have been added
-				 * to the queues after starting the garbage collection
-				 */
-				if (u && test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) {
-					hit = true;
-
-					func(u);
-				}
-			}
-			if (hit && hitlist != NULL) {
-				__skb_unlink(skb, &x->sk_receive_queue);
-				__skb_queue_tail(hitlist, skb);
-			}
-		}
-	}
-	spin_unlock(&x->sk_receive_queue.lock);
-}
-
-static void scan_children(struct sock *x, void (*func)(struct unix_sock *),
-			  struct sk_buff_head *hitlist)
-{
-	if (x->sk_state != TCP_LISTEN) {
-		scan_inflight(x, func, hitlist);
-	} else {
-		struct sk_buff *skb;
-		struct sk_buff *next;
-		struct unix_sock *u;
-		LIST_HEAD(embryos);
-
-		/* For a listening socket collect the queued embryos
-		 * and perform a scan on them as well.
-		 */
-		spin_lock(&x->sk_receive_queue.lock);
-		skb_queue_walk_safe(&x->sk_receive_queue, skb, next) {
-			u = unix_sk(skb->sk);
-
-			/* An embryo cannot be in-flight, so it's safe
-			 * to use the list link.
-			 */
-			WARN_ON_ONCE(!list_empty(&u->link));
-			list_add_tail(&u->link, &embryos);
-		}
-		spin_unlock(&x->sk_receive_queue.lock);
-
-		while (!list_empty(&embryos)) {
-			u = list_entry(embryos.next, struct unix_sock, link);
-			scan_inflight(&u->sk, func, hitlist);
-			list_del_init(&u->link);
-		}
-	}
-}
-
-static void dec_inflight(struct unix_sock *usk)
-{
-	usk->inflight--;
-}
-
-static void inc_inflight(struct unix_sock *usk)
-{
-	usk->inflight++;
-}
-
-static void inc_inflight_move_tail(struct unix_sock *u)
-{
-	u->inflight++;
-
-	/* If this still might be part of a cycle, move it to the end
-	 * of the list, so that it's checked even if it was already
-	 * passed over
-	 */
-	if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags))
-		list_move_tail(&u->link, &gc_candidates);
-}
-
 static bool gc_in_progress;
 
 static void __unix_gc(struct work_struct *work)
 {
-	struct sk_buff_head hitlist;
-	struct unix_sock *u, *next;
-	LIST_HEAD(not_cycle_list);
-	struct list_head cursor;
-
 	spin_lock(&unix_gc_lock);
 
-	if (!unix_graph_maybe_cyclic)
+	if (!unix_graph_maybe_cyclic) {
+		spin_unlock(&unix_gc_lock);
 		goto skip_gc;
+	}
+
+	__skb_queue_head_init(&hitlist);
 
 	if (unix_graph_grouped)
 		unix_walk_scc_fast();
 	else
 		unix_walk_scc();
 
-	/* First, select candidates for garbage collection.  Only
-	 * in-flight sockets are considered, and from those only ones
-	 * which don't have any external reference.
-	 *
-	 * Holding unix_gc_lock will protect these candidates from
-	 * being detached, and hence from gaining an external
-	 * reference.  Since there are no possible receivers, all
-	 * buffers currently on the candidates' queues stay there
-	 * during the garbage collection.
-	 *
-	 * We also know that no new candidate can be added onto the
-	 * receive queues.  Other, non candidate sockets _can_ be
-	 * added to queue, so we must make sure only to touch
-	 * candidates.
-	 */
-	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
-		long total_refs;
-
-		total_refs = file_count(u->sk.sk_socket->file);
-
-		WARN_ON_ONCE(!u->inflight);
-		WARN_ON_ONCE(total_refs < u->inflight);
-		if (total_refs == u->inflight) {
-			list_move_tail(&u->link, &gc_candidates);
-			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
-			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
-		}
-	}
-
-	/* Now remove all internal in-flight reference to children of
-	 * the candidates.
-	 */
-	list_for_each_entry(u, &gc_candidates, link)
-		scan_children(&u->sk, dec_inflight, NULL);
-
-	/* Restore the references for children of all candidates,
-	 * which have remaining references.  Do this recursively, so
-	 * only those remain, which form cyclic references.
-	 *
-	 * Use a "cursor" link, to make the list traversal safe, even
-	 * though elements might be moved about.
-	 */
-	list_add(&cursor, &gc_candidates);
-	while (cursor.next != &gc_candidates) {
-		u = list_entry(cursor.next, struct unix_sock, link);
-
-		/* Move cursor to after the current position. */
-		list_move(&cursor, &u->link);
-
-		if (u->inflight) {
-			list_move_tail(&u->link, &not_cycle_list);
-			__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
-			scan_children(&u->sk, inc_inflight_move_tail, NULL);
-		}
-	}
-	list_del(&cursor);
-
-	/* Now gc_candidates contains only garbage.  Restore original
-	 * inflight counters for these as well, and remove the skbuffs
-	 * which are creating the cycle(s).
-	 */
-	skb_queue_head_init(&hitlist);
-	list_for_each_entry(u, &gc_candidates, link) {
-		scan_children(&u->sk, inc_inflight, &hitlist);
-
-#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
-		if (u->oob_skb) {
-			kfree_skb(u->oob_skb);
-			u->oob_skb = NULL;
-		}
-#endif
-	}
-
-	/* not_cycle_list contains those sockets which do not make up a
-	 * cycle.  Restore these to the inflight list.
-	 */
-	while (!list_empty(&not_cycle_list)) {
-		u = list_entry(not_cycle_list.next, struct unix_sock, link);
-		__clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
-		list_move_tail(&u->link, &gc_inflight_list);
-	}
-
 	spin_unlock(&unix_gc_lock);
 
-	/* Here we are. Hitlist is filled. Die. */
 	__skb_queue_purge(&hitlist);
-
-	spin_lock(&unix_gc_lock);
-
-	/* All candidates should have been detached by now. */
-	WARN_ON_ONCE(!list_empty(&gc_candidates));
 skip_gc:
-	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
 	WRITE_ONCE(gc_in_progress, false);
-
-	spin_unlock(&unix_gc_lock);
 }
 
 static DECLARE_WORK(unix_gc_work, __unix_gc);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 14/14] selftest: af_unix: Test GC for SCM_RIGHTS.
  2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
                   ` (12 preceding siblings ...)
  2024-02-23 21:40 ` [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm Kuniyuki Iwashima
@ 2024-02-23 21:40 ` Kuniyuki Iwashima
  13 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-23 21:40 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

This patch adds test cases to verify the new GC.

We run each test for the following cases:

  * SOCK_DGRAM
  * SOCK_STREAM without embryo socket
  * SOCK_STREAM without embryo socket + MSG_OOB
  * SOCK_STREAM with embryo sockets
  * SOCK_STREAM with embryo sockets + MSG_OOB

Before and after running each test case, we ensure that there is
no AF_UNIX socket left in the netns by reading /proc/net/protocols.

We cannot use /proc/net/unix and UNIX_DIAG because the embryo socket
does not show up there.

Each test creates multiple sockets in an array.  We pass sockets in
the even index using the peer sockets in the odd index.

So, send_fd(0, 1) actually sends fd[0] to fd[2] via fd[0 + 1].

  Test 1 : A <-> A
  Test 2 : A <-> B
  Test 3 : A -> B -> C <- D
           ^.___|___.'    ^
                `---------'

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 tools/testing/selftests/net/.gitignore        |   1 +
 tools/testing/selftests/net/af_unix/Makefile  |   2 +-
 .../selftests/net/af_unix/scm_rights.c        | 286 ++++++++++++++++++
 3 files changed, 288 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 2f9d378edec3..d996a0ab0765 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -31,6 +31,7 @@ reuseport_dualstack
 rxtimestamp
 sctp_hello
 scm_pidfd
+scm_rights
 sk_bind_sendto_listen
 sk_connect_zero_addr
 socket
diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile
index 221c387a7d7f..3b83c797650d 100644
--- a/tools/testing/selftests/net/af_unix/Makefile
+++ b/tools/testing/selftests/net/af_unix/Makefile
@@ -1,4 +1,4 @@
 CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd
+TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd scm_rights
 
 include ../../lib.mk
diff --git a/tools/testing/selftests/net/af_unix/scm_rights.c b/tools/testing/selftests/net/af_unix/scm_rights.c
new file mode 100644
index 000000000000..bab606c9f1eb
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights.c
@@ -0,0 +1,286 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+
+#include "../../kselftest_harness.h"
+
+FIXTURE(scm_rights)
+{
+	int fd[16];
+};
+
+FIXTURE_VARIANT(scm_rights)
+{
+	char name[16];
+	int type;
+	int flags;
+	bool test_listener;
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, dgram)
+{
+	.name = "UNIX ",
+	.type = SOCK_DGRAM,
+	.flags = 0,
+	.test_listener = false,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = 0,
+	.test_listener = false,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream_oob)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = MSG_OOB,
+	.test_listener = false,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream_listener)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = 0,
+	.test_listener = true,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream_listener_oob)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = MSG_OOB,
+	.test_listener = true,
+};
+
+static int count_sockets(struct __test_metadata *_metadata,
+			 const FIXTURE_VARIANT(scm_rights) *variant)
+{
+	int sockets = -1, len, ret;
+	char *line = NULL;
+	size_t unused;
+	FILE *f;
+
+	f = fopen("/proc/net/protocols", "r");
+	ASSERT_NE(NULL, f);
+
+	len = strlen(variant->name);
+
+	while (getline(&line, &unused, f) != -1) {
+		int unused2;
+
+		if (strncmp(line, variant->name, len))
+			continue;
+
+		ret = sscanf(line + len, "%d %d", &unused2, &sockets);
+		ASSERT_EQ(2, ret);
+
+		break;
+	}
+
+	free(line);
+
+	ret = fclose(f);
+	ASSERT_EQ(0, ret);
+
+	return sockets;
+}
+
+FIXTURE_SETUP(scm_rights)
+{
+	int ret;
+
+	ret = unshare(CLONE_NEWNET);
+	ASSERT_EQ(0, ret);
+
+	ret = count_sockets(_metadata, variant);
+	ASSERT_EQ(0, ret);
+}
+
+FIXTURE_TEARDOWN(scm_rights)
+{
+	int ret;
+
+	sleep(1);
+
+	ret = count_sockets(_metadata, variant);
+	ASSERT_EQ(0, ret);
+}
+
+static void create_listeners(struct __test_metadata *_metadata,
+			     FIXTURE_DATA(scm_rights) *self,
+			     int n)
+{
+	struct sockaddr_un addr = {
+		.sun_family = AF_UNIX,
+	};
+	socklen_t addrlen;
+	int i, ret;
+
+	for (i = 0; i < n * 2; i += 2) {
+		self->fd[i] = socket(AF_UNIX, SOCK_STREAM, 0);
+		ASSERT_LE(0, self->fd[i]);
+
+		addrlen = sizeof(addr.sun_family);
+		ret = bind(self->fd[i], (struct sockaddr *)&addr, addrlen);
+		ASSERT_EQ(0, ret);
+
+		ret = listen(self->fd[i], -1);
+		ASSERT_EQ(0, ret);
+
+		addrlen = sizeof(addr);
+		ret = getsockname(self->fd[i], (struct sockaddr *)&addr, &addrlen);
+		ASSERT_EQ(0, ret);
+
+		self->fd[i + 1] = socket(AF_UNIX, SOCK_STREAM, 0);
+		ASSERT_LE(0, self->fd[i + 1]);
+
+		ret = connect(self->fd[i + 1], (struct sockaddr *)&addr, addrlen);
+		ASSERT_EQ(0, ret);
+	}
+}
+
+static void create_socketpairs(struct __test_metadata *_metadata,
+			       FIXTURE_DATA(scm_rights) *self,
+			       const FIXTURE_VARIANT(scm_rights) *variant,
+			       int n)
+{
+	int i, ret;
+
+	ASSERT_GE(sizeof(self->fd) / sizeof(int), n);
+
+	for (i = 0; i < n * 2; i += 2) {
+		ret = socketpair(AF_UNIX, variant->type, 0, self->fd + i);
+		ASSERT_EQ(0, ret);
+	}
+}
+
+static void __create_sockets(struct __test_metadata *_metadata,
+			     FIXTURE_DATA(scm_rights) *self,
+			     const FIXTURE_VARIANT(scm_rights) *variant,
+			     int n)
+{
+	if (variant->test_listener)
+		create_listeners(_metadata, self, n);
+	else
+		create_socketpairs(_metadata, self, variant, n);
+}
+
+static void __close_sockets(struct __test_metadata *_metadata,
+			    FIXTURE_DATA(scm_rights) *self,
+			    int n)
+{
+	int i, ret;
+
+	ASSERT_GE(sizeof(self->fd) / sizeof(int), n);
+
+	for (i = 0; i < n * 2; i++) {
+		ret = close(self->fd[i]);
+		ASSERT_EQ(0, ret);
+	}
+}
+
+void __send_fd(struct __test_metadata *_metadata,
+	       const FIXTURE_DATA(scm_rights) *self,
+	       const FIXTURE_VARIANT(scm_rights) *variant,
+	       int inflight, int receiver)
+{
+#define MSG "nop"
+#define MSGLEN 3
+	struct {
+		struct cmsghdr cmsghdr;
+		int fd[2];
+	} cmsg = {
+		.cmsghdr = {
+			.cmsg_len = CMSG_LEN(sizeof(cmsg.fd)),
+			.cmsg_level = SOL_SOCKET,
+			.cmsg_type = SCM_RIGHTS,
+		},
+		.fd = {
+			self->fd[inflight * 2],
+			self->fd[inflight * 2],
+		},
+	};
+	struct iovec iov = {
+		.iov_base = MSG,
+		.iov_len = MSGLEN,
+	};
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = &cmsg,
+		.msg_controllen = CMSG_SPACE(sizeof(cmsg.fd)),
+	};
+	int ret;
+
+	ret = sendmsg(self->fd[receiver * 2 + 1], &msg, variant->flags);
+	ASSERT_EQ(MSGLEN, ret);
+}
+
+#define create_sockets(n)					\
+	__create_sockets(_metadata, self, variant, n)
+#define close_sockets(n)					\
+	__close_sockets(_metadata, self, n)
+#define send_fd(inflight, receiver)				\
+	__send_fd(_metadata, self, variant, inflight, receiver)
+
+TEST_F(scm_rights, self_ref)
+{
+	create_sockets(2);
+
+	send_fd(0, 0);
+
+	send_fd(1, 1);
+
+	close_sockets(2);
+}
+
+TEST_F(scm_rights, triangle)
+{
+	create_sockets(6);
+
+	send_fd(0, 1);
+	send_fd(1, 2);
+	send_fd(2, 0);
+
+	send_fd(3, 4);
+	send_fd(4, 5);
+	send_fd(5, 3);
+
+	close_sockets(6);
+}
+
+TEST_F(scm_rights, cross_edge)
+{
+	create_sockets(8);
+
+	send_fd(0, 1);
+	send_fd(1, 2);
+	send_fd(2, 0);
+	send_fd(1, 3);
+	send_fd(3, 2);
+
+	send_fd(4, 5);
+	send_fd(5, 6);
+	send_fd(6, 4);
+	send_fd(5, 7);
+	send_fd(7, 6);
+
+	close_sockets(8);
+}
+
+TEST_HARNESS_MAIN
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components.
  2024-02-23 21:39 ` [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
@ 2024-02-25  0:34   ` Jakub Kicinski
  2024-02-26 19:07     ` Kuniyuki Iwashima
  2024-02-27 11:02   ` Paolo Abeni
  1 sibling, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2024-02-25  0:34 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Kuniyuki Iwashima, netdev

On Fri, 23 Feb 2024 13:39:54 -0800 Kuniyuki Iwashima wrote:
> +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> +		struct unix_vertex *next_vertex = edge->successor->vertex;
> +
> +		if (!next_vertex)
> +			continue;
> +
> +		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
> +			list_add(&edge->stack_entry, &edge_stack);
> +
> +			vertex = next_vertex;
> +			goto next_vertex;
> +prev_vertex:
> +			next_vertex = vertex;
> +
> +			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
> +			list_del_init(&edge->stack_entry);
> +
> +			vertex = edge->predecessor->vertex;
> +
> +			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> +		} else if (edge->successor->vertex->on_stack) {
> +			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
> +		}
> +	}
> +
> +	if (vertex->index == vertex->lowlink) {
> +		struct list_head scc;
> +
> +		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
> +
> +		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
> +			list_move_tail(&vertex->entry, &unix_visited_vertices);
> +
> +			vertex->on_stack = false;
> +		}
> +
> +		list_del(&scc);
> +	}
> +
> +	if (!list_empty(&edge_stack))
> +		goto prev_vertex;

coccicheck says:

net/unix/garbage.c:406:17-23: ERROR: invalid reference to the index variable of the iterator on line 425

this code looks way to complicated to untangle on a quick weekend scan,
so please LMK if this is a false positive, I'll hide the patches from
patchwork for now ;)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components.
  2024-02-25  0:34   ` Jakub Kicinski
@ 2024-02-26 19:07     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-26 19:07 UTC (permalink / raw)
  To: kuba; +Cc: davem, edumazet, kuni1840, kuniyu, netdev, pabeni

From: Jakub Kicinski <kuba@kernel.org>
Date: Sat, 24 Feb 2024 16:34:30 -0800
> On Fri, 23 Feb 2024 13:39:54 -0800 Kuniyuki Iwashima wrote:
> > +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> > +		struct unix_vertex *next_vertex = edge->successor->vertex;
> > +
> > +		if (!next_vertex)
> > +			continue;
> > +
> > +		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
> > +			list_add(&edge->stack_entry, &edge_stack);
> > +
> > +			vertex = next_vertex;
> > +			goto next_vertex;
> > +prev_vertex:
> > +			next_vertex = vertex;
> > +
> > +			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
> > +			list_del_init(&edge->stack_entry);
> > +
> > +			vertex = edge->predecessor->vertex;
> > +
> > +			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> > +		} else if (edge->successor->vertex->on_stack) {
> > +			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
> > +		}
> > +	}
> > +
> > +	if (vertex->index == vertex->lowlink) {
> > +		struct list_head scc;
> > +
> > +		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
> > +
> > +		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
> > +			list_move_tail(&vertex->entry, &unix_visited_vertices);
> > +
> > +			vertex->on_stack = false;
> > +		}
> > +
> > +		list_del(&scc);
> > +	}
> > +
> > +	if (!list_empty(&edge_stack))
> > +		goto prev_vertex;
> 
> coccicheck says:
> 
> net/unix/garbage.c:406:17-23: ERROR: invalid reference to the index variable of the iterator on line 425
> 
> this code looks way to complicated to untangle on a quick weekend scan,
> so please LMK if this is a false positive, I'll hide the patches from
> patchwork for now ;)

Yeah, I think it's false positive :)

The code above implements recursion withtout nesting call stack
and instead uses goto jump and restores the previous in-loop
variables there.

  __unix_walk_scc(struct unix_vertex *vertex)
  {
    ...
    list_for_each_entry(edge, &vertex->edges, vertex_entry) {
      if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
        __unix_walk_scc(next_vertex);

        ^-- This is rewritten with goto next_vertex & prev_vertex.

        vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
      } else {
        vertex->lowlink = min(vertex->lowlink, next_vertex->index);
      }
    }
    ...

Here's the original recursive pseudocode.
https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.
  2024-02-23 21:39 ` [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight " Kuniyuki Iwashima
@ 2024-02-27 10:47   ` Paolo Abeni
  2024-02-28  2:34     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Abeni @ 2024-02-27 10:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Fri, 2024-02-23 at 13:39 -0800, Kuniyuki Iwashima wrote:
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 96d0b1db3638..e8fe08796d02 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -148,6 +148,7 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
>  }
>  
>  DEFINE_SPINLOCK(unix_gc_lock);
> +unsigned int unix_tot_inflight;
>  
>  void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
>  {
> @@ -172,7 +173,10 @@ void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
>  		unix_add_edge(fpl, edge);
>  	} while (i < fpl->count_unix);
>  
> +	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + fpl->count_unix);
>  out:
> +	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight + fpl->count);

I'm unsure if later patches will shed some light, but why the above
statement is placed _after_ the 'out' label? fpl->count will be 0 in
such path, and the updated not needed. Why don't you place it before
the mentioned label?

> +
>  	spin_unlock(&unix_gc_lock);
>  
>  	fpl->inflight = true;
> @@ -195,7 +199,10 @@ void unix_del_edges(struct scm_fp_list *fpl)
>  		unix_del_edge(fpl, edge);
>  	} while (i < fpl->count_unix);
>  
> +	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - fpl->count_unix);
>  out:
> +	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight - fpl->count);

Same question here.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components.
  2024-02-23 21:39 ` [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
  2024-02-25  0:34   ` Jakub Kicinski
@ 2024-02-27 11:02   ` Paolo Abeni
  2024-02-28  2:49     ` Kuniyuki Iwashima
  1 sibling, 1 reply; 33+ messages in thread
From: Paolo Abeni @ 2024-02-27 11:02 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Fri, 2024-02-23 at 13:39 -0800, Kuniyuki Iwashima wrote:
> In the new GC, we use a simple graph algorithm, Tarjan's Strongly
> Connected Components (SCC) algorithm, to find cyclic references.
> 
> The algorithm visits every vertex exactly once using depth-first
> search (DFS).  We implement it without recursion so that no one
> can abuse it.
> 
> There could be multiple graphs, so we iterate unix_unvisited_vertices
> in unix_walk_scc() and do DFS in __unix_walk_scc(), where we move
> visited vertices to another list, unix_visited_vertices, not to
> restart DFS twice on a visited vertex later in unix_walk_scc().
> 
> DFS starts by pushing an input vertex to a stack and assigning it
> a unique number.  Two fields, index and lowlink, are initialised
> with the number, but lowlink could be updated later during DFS.
> 
> If a vertex has an edge to an unvisited inflight vertex, we visit
> it and do the same processing.  So, we will have vertices in the
> stack in the order they appear and number them consecutively in
> the same order.
> 
> If a vertex has a back-edge to a visited vertex in the stack,
> we update the predecessor's lowlink with the successor's index.
> 
> After iterating edges from the vertex, we check if its index
> equals its lowlink.
> 
> If the lowlink is different from the index, it shows there was a
> back-edge.  Then, we propagate the lowlink to its predecessor and
> go back to the predecessor to resume checking from the next edge
> of the back-edge.
> 
> If the lowlink is the same as the index, we pop vertices before
> and including the vertex from the stack.  Then, the set of vertices
> is SCC, possibly forming a cycle.  At the same time, we move the
> vertices to unix_visited_vertices.
> 
> When we finish the algorithm, all vertices in each SCC will be
> linked via unix_vertex.scc_entry.
> 
> Let's take an example.  We have a graph including five inflight
> vertices (F is not inflight):
> 
>   A -> B -> C -> D -> E (-> F)
>        ^         |
>        `---------'
> 
> Suppose that we start DFS from C.  We will visit C, D, and B first
> and initialise their index and lowlink.  Then, the stack looks like
> this:
> 
>   > B = (3, 3)  (index, lowlink)
>     D = (2, 2)
>     C = (1, 1)
> 
> When checking B's edge to C, we update B's lowlink with C's index
> and propagate it to D.
> 
>     B = (3, 1)  (index, lowlink)
>   > D = (2, 1)
>     C = (1, 1)
> 
> Next, we visit E, which has no edge to an inflight vertex.
> 
>   > E = (4, 4)  (index, lowlink)
>     B = (3, 1)
>     D = (2, 1)
>     C = (1, 1)
> 
> When we leave from E, its index and lowlink are the same, so we
> pop E from the stack as single-vertex SCC.  Next, we leave from
> D but do nothing because its lowlink is different from its index.
> 
>     B = (3, 1)  (index, lowlink)
>     D = (2, 1)
>   > C = (1, 1)
> 
> Then, we leave from C, whose index and lowlink are the same, so
> we pop B, D and C as SCC.
> 
> Last, we do DFS for the rest of vertices, A, which is also a
> single-vertex SCC.
> 
> Finally, each unix_vertex.scc_entry is linked as follows:
> 
>   A -.  B -> C -> D  E -.
>   ^  |  ^         |  ^  |
>   `--'  `---------'  `--'
> 
> We use SCC later to decide whether we can garbage-collect the
> sockets.
> 
> Note that we still cannot detect SCC properly if an edge points
> to an embryo socket.  The following two patches will sort it out.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/af_unix.h |  5 +++
>  net/unix/garbage.c    | 82 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 87 insertions(+)
> 
> diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> index f31ad1166346..67736767b616 100644
> --- a/include/net/af_unix.h
> +++ b/include/net/af_unix.h
> @@ -32,13 +32,18 @@ void wait_for_unix_gc(struct scm_fp_list *fpl);
>  struct unix_vertex {
>  	struct list_head edges;
>  	struct list_head entry;
> +	struct list_head scc_entry;
>  	unsigned long out_degree;
> +	unsigned long index;
> +	unsigned long lowlink;
> +	bool on_stack;
>  };
>  
>  struct unix_edge {
>  	struct unix_sock *predecessor;
>  	struct unix_sock *successor;
>  	struct list_head vertex_entry;
> +	struct list_head stack_entry;
>  };
>  
>  struct sock *unix_peer_get(struct sock *sk);
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index e8fe08796d02..7e90663513f9 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -103,6 +103,11 @@ struct unix_sock *unix_get_socket(struct file *filp)
>  
>  static LIST_HEAD(unix_unvisited_vertices);
>  
> +enum unix_vertex_index {
> +	UNIX_VERTEX_INDEX_UNVISITED,
> +	UNIX_VERTEX_INDEX_START,
> +};
> +
>  static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
>  {
>  	struct unix_vertex *vertex = edge->predecessor->vertex;
> @@ -245,6 +250,81 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
>  	unix_free_vertices(fpl);
>  }
>  
> +static LIST_HEAD(unix_visited_vertices);
> +
> +static void __unix_walk_scc(struct unix_vertex *vertex)
> +{
> +	unsigned long index = UNIX_VERTEX_INDEX_START;
> +	LIST_HEAD(vertex_stack);
> +	struct unix_edge *edge;
> +	LIST_HEAD(edge_stack);
> +
> +next_vertex:
> +	vertex->index = index;
> +	vertex->lowlink = index;
> +	index++;
> +
> +	vertex->on_stack = true;
> +	list_add(&vertex->scc_entry, &vertex_stack);
> +
> +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> +		struct unix_vertex *next_vertex = edge->successor->vertex;
> +
> +		if (!next_vertex)
> +			continue;
> +
> +		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
> +			list_add(&edge->stack_entry, &edge_stack);
> +
> +			vertex = next_vertex;
> +			goto next_vertex;
> +prev_vertex:
> +			next_vertex = vertex;
> +
> +			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
> +			list_del_init(&edge->stack_entry);
> +
> +			vertex = edge->predecessor->vertex;
> +
> +			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> +		} else if (edge->successor->vertex->on_stack) {

It looks like you can replace ^^^^^^^^^^^^^^^^^^^^ with 'next_vertex'
and that would be more readable.

IMHO more importantly: this code is fairly untrivial, I think that a
significant amount of comments would help the review and the long term
maintenance - even if everything is crystal clear and obvious to you,
restating the obvious in a comment would help me ;)

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC.
  2024-02-23 21:40 ` [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
@ 2024-02-27 11:19   ` Paolo Abeni
  2024-02-28  3:05     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Abeni @ 2024-02-27 11:19 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> The definition of the lowlink in Tarjan's algorithm is the
> smallest index of a vertex that is reachable with at most one
> back-edge in SCC.  This is not useful for a cross-edge.
> 
> If we start traversing from A in the following graph, the final
> lowlink of D is 3.  The cross-edge here is one between D and C.
> 
>   A -> B -> D   D = (4, 3)  (index, lowlink)
>   ^    |    |   C = (3, 1)
>   |    V    |   B = (2, 1)
>   `--- C <--'   A = (1, 1)
> 
> This is because the lowlink of D is updated with the index of C.
> 
> In the following patch, we detect a dead SCC by checking two
> conditions for each vertex.
> 
>   1) vertex has no edge directed to another SCC (no bridge)
>   2) vertex's out_degree is the same as the refcount of its file
> 
> If 1) is false, there is a receiver of all fds of the SCC and
> its ancestor SCC.
> 
> To evaluate 1), we need to assign a unique index to each SCC and
> assign it to all vertices in the SCC.
> 
> This patch changes the lowlink update logic for cross-edge so
> that in the example above, the lowlink of D is updated with the
> lowlink of C.
> 
>   A -> B -> D   D = (4, 1)  (index, lowlink)
>   ^    |    |   C = (3, 1)
>   |    V    |   B = (2, 1)
>   `--- C <--'   A = (1, 1)
> 
> Then, all vertices in the same SCC have the same lowlink, and we
> can quickly find the bridge connecting to different SCC if exists.
> 
> However, it is no longer called lowlink, so we rename it to
> scc_index.  (It's sometimes called lowpoint.)
> 
> Also, we add a global variable to hold the last index used in DFS
> so that we do not reset the initial index in each DFS.
> 
> This patch can be squashed to the SCC detection patch but is
> split deliberately for anyone wondering why lowlink is not used
> as used in the original Tarjan's algorithm and many reference
> implementations.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/af_unix.h |  2 +-
>  net/unix/garbage.c    | 15 ++++++++-------
>  2 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> index ec040caaa4b5..696d997a5ac9 100644
> --- a/include/net/af_unix.h
> +++ b/include/net/af_unix.h
> @@ -36,7 +36,7 @@ struct unix_vertex {
>  	struct list_head scc_entry;
>  	unsigned long out_degree;
>  	unsigned long index;
> -	unsigned long lowlink;
> +	unsigned long scc_index;
>  };
>  
>  struct unix_edge {
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 1d9a0498dec5..0eb1610c96d7 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -308,18 +308,18 @@ static bool unix_scc_cyclic(struct list_head *scc)
>  
>  static LIST_HEAD(unix_visited_vertices);
>  static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> +static unsigned long unix_vertex_last_index = UNIX_VERTEX_INDEX_START;
>  
>  static void __unix_walk_scc(struct unix_vertex *vertex)
>  {
> -	unsigned long index = UNIX_VERTEX_INDEX_START;
>  	LIST_HEAD(vertex_stack);
>  	struct unix_edge *edge;
>  	LIST_HEAD(edge_stack);
>  
>  next_vertex:
> -	vertex->index = index;
> -	vertex->lowlink = index;
> -	index++;
> +	vertex->index = unix_vertex_last_index;
> +	vertex->scc_index = unix_vertex_last_index;
> +	unix_vertex_last_index++;
>  
>  	list_add(&vertex->scc_entry, &vertex_stack);
>  
> @@ -342,13 +342,13 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
>  
>  			vertex = edge->predecessor->vertex;
>  
> -			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
>  		} else if (next_vertex->index != unix_vertex_grouped_index) {
> -			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
> +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);

I guess the above will break when unix_vertex_last_index wraps around,
or am I low on coffee? (I guess there is not such a thing as enough
coffee to allow me reviewing this whole series at once ;)

Can we expect a wrap around in host with (surprisingly very) long
uptimes? 

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/14] af_unix: Detect dead SCC.
  2024-02-23 21:40 ` [PATCH v3 net-next 12/14] af_unix: Detect dead SCC Kuniyuki Iwashima
@ 2024-02-27 11:25   ` Paolo Abeni
  2024-02-28  3:14     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Abeni @ 2024-02-27 11:25 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> When iterating SCC, we call unix_vertex_dead() for each vertex
> to check if the vertex is close()d and has no bridge to another
> SCC.
> 
> If both conditions are true for every vertex in SCC, we can
> execute garbage collection for all skb in the SCC.
> 
> The actual garbage collection is done in the following patch,
> replacing the old implementation.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/unix/garbage.c | 37 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 0eb1610c96d7..060e81be3614 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -288,6 +288,32 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
>  	unix_free_vertices(fpl);
>  }
>  
> +static bool unix_vertex_dead(struct unix_vertex *vertex)
> +{
> +	struct unix_edge *edge;
> +	struct unix_sock *u;
> +	long total_ref;
> +
> +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> +		struct unix_vertex *next_vertex = unix_edge_successor(edge);
> +
> +		if (!next_vertex)
> +			return false;
> +
> +		if (next_vertex->scc_index != vertex->scc_index)
> +			return false;
> +	}
> +
> +	edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
> +	u = edge->predecessor;
> +	total_ref = file_count(u->sk.sk_socket->file);
> +
> +	if (total_ref != vertex->out_degree)
> +		return false;
> +
> +	return true;
> +}
> +
>  static bool unix_scc_cyclic(struct list_head *scc)
>  {
>  	struct unix_vertex *vertex;
> @@ -350,6 +376,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
>  
>  	if (vertex->index == vertex->scc_index) {
>  		struct list_head scc;
> +		bool dead = true;

Very minor nit: the above variable 'sounds' alike referring to the
current vertex, while instead it tracks a global status, what about
'all_dead' or 'gc_all'?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm.
  2024-02-23 21:40 ` [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm Kuniyuki Iwashima
@ 2024-02-27 11:36   ` Paolo Abeni
  2024-02-28  3:32     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Abeni @ 2024-02-27 11:36 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 060e81be3614..59a87a997a4d 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -314,6 +314,48 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
>  	return true;
>  }
>  
> +static struct sk_buff_head hitlist;

I *think* hitlist could be replaced with a local variable in
__unix_gc(), WDYT?

> +
> +static void unix_collect_skb(struct list_head *scc)
> +{
> +	struct unix_vertex *vertex;
> +
> +	list_for_each_entry_reverse(vertex, scc, scc_entry) {
> +		struct sk_buff_head *queue;
> +		struct unix_edge *edge;
> +		struct unix_sock *u;
> +
> +		edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
> +		u = edge->predecessor;
> +		queue = &u->sk.sk_receive_queue;
> +
> +		spin_lock(&queue->lock);
> +
> +		if (u->sk.sk_state == TCP_LISTEN) {
> +			struct sk_buff *skb;
> +
> +			skb_queue_walk(queue, skb) {
> +				struct sk_buff_head *embryo_queue = &skb->sk->sk_receive_queue;
> +
> +				spin_lock(&embryo_queue->lock);

I'm wondering if and why lockdep would be happy about the above. I
think this deserve at least a comment.


> +				skb_queue_splice_init(embryo_queue, &hitlist);
> +				spin_unlock(&embryo_queue->lock);
> +			}
> +		} else {
> +			skb_queue_splice_init(queue, &hitlist);
> +
> +#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> +			if (u->oob_skb) {
> +				kfree_skb(u->oob_skb);

Similar question here. This happens under the u receive queue lock,
could we his some complex lock dependency? what about moving oob_skb to
hitlist instead?


Cheers,

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.
  2024-02-27 10:47   ` Paolo Abeni
@ 2024-02-28  2:34     ` Kuniyuki Iwashima
  2024-02-28  7:46       ` Paolo Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-28  2:34 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 27 Feb 2024 11:47:23 +0100
> On Fri, 2024-02-23 at 13:39 -0800, Kuniyuki Iwashima wrote:
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 96d0b1db3638..e8fe08796d02 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -148,6 +148,7 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
> >  }
> >  
> >  DEFINE_SPINLOCK(unix_gc_lock);
> > +unsigned int unix_tot_inflight;
> >  
> >  void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
> >  {
> > @@ -172,7 +173,10 @@ void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
> >  		unix_add_edge(fpl, edge);
> >  	} while (i < fpl->count_unix);
> >  
> > +	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + fpl->count_unix);
> >  out:
> > +	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight + fpl->count);
> 
> I'm unsure if later patches will shed some light, but why the above
> statement is placed _after_ the 'out' label? fpl->count will be 0 in
> such path, and the updated not needed. Why don't you place it before
> the mentioned label?

fpl->count is the total number of fds in skb, and fpl->count_unix
is the number of AF_UNIX fds.

So, we could reach the out: label if we pass no AF_UNIX fd but
other fds, then we count the number for each user to use in
too_many_unix_fds().

Before this change, unix_inflight() and unix_notinflight() did the
same but incremented/decremented one by one.


> 
> > +
> >  	spin_unlock(&unix_gc_lock);
> >  
> >  	fpl->inflight = true;
> > @@ -195,7 +199,10 @@ void unix_del_edges(struct scm_fp_list *fpl)
> >  		unix_del_edge(fpl, edge);
> >  	} while (i < fpl->count_unix);
> >  
> > +	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - fpl->count_unix);
> >  out:
> > +	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight - fpl->count);
> 
> Same question here.
> 
> Thanks!
> 
> Paolo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components.
  2024-02-27 11:02   ` Paolo Abeni
@ 2024-02-28  2:49     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-28  2:49 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 27 Feb 2024 12:02:27 +0100
> On Fri, 2024-02-23 at 13:39 -0800, Kuniyuki Iwashima wrote:
> > In the new GC, we use a simple graph algorithm, Tarjan's Strongly
> > Connected Components (SCC) algorithm, to find cyclic references.
> > 
> > The algorithm visits every vertex exactly once using depth-first
> > search (DFS).  We implement it without recursion so that no one
> > can abuse it.
> > 
> > There could be multiple graphs, so we iterate unix_unvisited_vertices
> > in unix_walk_scc() and do DFS in __unix_walk_scc(), where we move
> > visited vertices to another list, unix_visited_vertices, not to
> > restart DFS twice on a visited vertex later in unix_walk_scc().
> > 
> > DFS starts by pushing an input vertex to a stack and assigning it
> > a unique number.  Two fields, index and lowlink, are initialised
> > with the number, but lowlink could be updated later during DFS.
> > 
> > If a vertex has an edge to an unvisited inflight vertex, we visit
> > it and do the same processing.  So, we will have vertices in the
> > stack in the order they appear and number them consecutively in
> > the same order.
> > 
> > If a vertex has a back-edge to a visited vertex in the stack,
> > we update the predecessor's lowlink with the successor's index.
> > 
> > After iterating edges from the vertex, we check if its index
> > equals its lowlink.
> > 
> > If the lowlink is different from the index, it shows there was a
> > back-edge.  Then, we propagate the lowlink to its predecessor and
> > go back to the predecessor to resume checking from the next edge
> > of the back-edge.
> > 
> > If the lowlink is the same as the index, we pop vertices before
> > and including the vertex from the stack.  Then, the set of vertices
> > is SCC, possibly forming a cycle.  At the same time, we move the
> > vertices to unix_visited_vertices.
> > 
> > When we finish the algorithm, all vertices in each SCC will be
> > linked via unix_vertex.scc_entry.
> > 
> > Let's take an example.  We have a graph including five inflight
> > vertices (F is not inflight):
> > 
> >   A -> B -> C -> D -> E (-> F)
> >        ^         |
> >        `---------'
> > 
> > Suppose that we start DFS from C.  We will visit C, D, and B first
> > and initialise their index and lowlink.  Then, the stack looks like
> > this:
> > 
> >   > B = (3, 3)  (index, lowlink)
> >     D = (2, 2)
> >     C = (1, 1)
> > 
> > When checking B's edge to C, we update B's lowlink with C's index
> > and propagate it to D.
> > 
> >     B = (3, 1)  (index, lowlink)
> >   > D = (2, 1)
> >     C = (1, 1)
> > 
> > Next, we visit E, which has no edge to an inflight vertex.
> > 
> >   > E = (4, 4)  (index, lowlink)
> >     B = (3, 1)
> >     D = (2, 1)
> >     C = (1, 1)
> > 
> > When we leave from E, its index and lowlink are the same, so we
> > pop E from the stack as single-vertex SCC.  Next, we leave from
> > D but do nothing because its lowlink is different from its index.
> > 
> >     B = (3, 1)  (index, lowlink)
> >     D = (2, 1)
> >   > C = (1, 1)
> > 
> > Then, we leave from C, whose index and lowlink are the same, so
> > we pop B, D and C as SCC.
> > 
> > Last, we do DFS for the rest of vertices, A, which is also a
> > single-vertex SCC.
> > 
> > Finally, each unix_vertex.scc_entry is linked as follows:
> > 
> >   A -.  B -> C -> D  E -.
> >   ^  |  ^         |  ^  |
> >   `--'  `---------'  `--'
> > 
> > We use SCC later to decide whether we can garbage-collect the
> > sockets.
> > 
> > Note that we still cannot detect SCC properly if an edge points
> > to an embryo socket.  The following two patches will sort it out.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/af_unix.h |  5 +++
> >  net/unix/garbage.c    | 82 +++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 87 insertions(+)
> > 
> > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > index f31ad1166346..67736767b616 100644
> > --- a/include/net/af_unix.h
> > +++ b/include/net/af_unix.h
> > @@ -32,13 +32,18 @@ void wait_for_unix_gc(struct scm_fp_list *fpl);
> >  struct unix_vertex {
> >  	struct list_head edges;
> >  	struct list_head entry;
> > +	struct list_head scc_entry;
> >  	unsigned long out_degree;
> > +	unsigned long index;
> > +	unsigned long lowlink;
> > +	bool on_stack;
> >  };
> >  
> >  struct unix_edge {
> >  	struct unix_sock *predecessor;
> >  	struct unix_sock *successor;
> >  	struct list_head vertex_entry;
> > +	struct list_head stack_entry;
> >  };
> >  
> >  struct sock *unix_peer_get(struct sock *sk);
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index e8fe08796d02..7e90663513f9 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -103,6 +103,11 @@ struct unix_sock *unix_get_socket(struct file *filp)
> >  
> >  static LIST_HEAD(unix_unvisited_vertices);
> >  
> > +enum unix_vertex_index {
> > +	UNIX_VERTEX_INDEX_UNVISITED,
> > +	UNIX_VERTEX_INDEX_START,
> > +};
> > +
> >  static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
> >  {
> >  	struct unix_vertex *vertex = edge->predecessor->vertex;
> > @@ -245,6 +250,81 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
> >  	unix_free_vertices(fpl);
> >  }
> >  
> > +static LIST_HEAD(unix_visited_vertices);
> > +
> > +static void __unix_walk_scc(struct unix_vertex *vertex)
> > +{
> > +	unsigned long index = UNIX_VERTEX_INDEX_START;
> > +	LIST_HEAD(vertex_stack);
> > +	struct unix_edge *edge;
> > +	LIST_HEAD(edge_stack);
> > +
> > +next_vertex:
> > +	vertex->index = index;
> > +	vertex->lowlink = index;
> > +	index++;
> > +
> > +	vertex->on_stack = true;
> > +	list_add(&vertex->scc_entry, &vertex_stack);
> > +
> > +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> > +		struct unix_vertex *next_vertex = edge->successor->vertex;
> > +
> > +		if (!next_vertex)
> > +			continue;
> > +
> > +		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
> > +			list_add(&edge->stack_entry, &edge_stack);
> > +
> > +			vertex = next_vertex;
> > +			goto next_vertex;
> > +prev_vertex:
> > +			next_vertex = vertex;
> > +
> > +			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
> > +			list_del_init(&edge->stack_entry);
> > +
> > +			vertex = edge->predecessor->vertex;
> > +
> > +			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> > +		} else if (edge->successor->vertex->on_stack) {
> 
> It looks like you can replace ^^^^^^^^^^^^^^^^^^^^ with 'next_vertex'
> and that would be more readable.

Good catch, will update in v2.


> 
> IMHO more importantly: this code is fairly untrivial, I think that a
> significant amount of comments would help the review and the long term
> maintenance - even if everything is crystal clear and obvious to you,
> restating the obvious in a comment would help me ;)

Thanks for your review and sorry for bothering you..  yeah, I understand
that but it was hard to comment how the graph algorithm works without
examples.

Actually, I drew dozens of diagrams in iPad with many patterns to ensure
that the code works, so I tried to fill the gap with the long commit
message (and incremental changes for later optimisations).

I'll try to split this commit to DFS and Tarjan part to make review
a bit easier and add more useful comments.

Thank you!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC.
  2024-02-27 11:19   ` Paolo Abeni
@ 2024-02-28  3:05     ` Kuniyuki Iwashima
  2024-02-28  7:49       ` Paolo Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-28  3:05 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 27 Feb 2024 12:19:40 +0100
> On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > The definition of the lowlink in Tarjan's algorithm is the
> > smallest index of a vertex that is reachable with at most one
> > back-edge in SCC.  This is not useful for a cross-edge.
> > 
> > If we start traversing from A in the following graph, the final
> > lowlink of D is 3.  The cross-edge here is one between D and C.
> > 
> >   A -> B -> D   D = (4, 3)  (index, lowlink)
> >   ^    |    |   C = (3, 1)
> >   |    V    |   B = (2, 1)
> >   `--- C <--'   A = (1, 1)
> > 
> > This is because the lowlink of D is updated with the index of C.
> > 
> > In the following patch, we detect a dead SCC by checking two
> > conditions for each vertex.
> > 
> >   1) vertex has no edge directed to another SCC (no bridge)
> >   2) vertex's out_degree is the same as the refcount of its file
> > 
> > If 1) is false, there is a receiver of all fds of the SCC and
> > its ancestor SCC.
> > 
> > To evaluate 1), we need to assign a unique index to each SCC and
> > assign it to all vertices in the SCC.
> > 
> > This patch changes the lowlink update logic for cross-edge so
> > that in the example above, the lowlink of D is updated with the
> > lowlink of C.
> > 
> >   A -> B -> D   D = (4, 1)  (index, lowlink)
> >   ^    |    |   C = (3, 1)
> >   |    V    |   B = (2, 1)
> >   `--- C <--'   A = (1, 1)
> > 
> > Then, all vertices in the same SCC have the same lowlink, and we
> > can quickly find the bridge connecting to different SCC if exists.
> > 
> > However, it is no longer called lowlink, so we rename it to
> > scc_index.  (It's sometimes called lowpoint.)
> > 
> > Also, we add a global variable to hold the last index used in DFS
> > so that we do not reset the initial index in each DFS.
> > 
> > This patch can be squashed to the SCC detection patch but is
> > split deliberately for anyone wondering why lowlink is not used
> > as used in the original Tarjan's algorithm and many reference
> > implementations.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/af_unix.h |  2 +-
> >  net/unix/garbage.c    | 15 ++++++++-------
> >  2 files changed, 9 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > index ec040caaa4b5..696d997a5ac9 100644
> > --- a/include/net/af_unix.h
> > +++ b/include/net/af_unix.h
> > @@ -36,7 +36,7 @@ struct unix_vertex {
> >  	struct list_head scc_entry;
> >  	unsigned long out_degree;
> >  	unsigned long index;
> > -	unsigned long lowlink;
> > +	unsigned long scc_index;
> >  };
> >  
> >  struct unix_edge {
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 1d9a0498dec5..0eb1610c96d7 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -308,18 +308,18 @@ static bool unix_scc_cyclic(struct list_head *scc)
> >  
> >  static LIST_HEAD(unix_visited_vertices);
> >  static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> > +static unsigned long unix_vertex_last_index = UNIX_VERTEX_INDEX_START;
> >  
> >  static void __unix_walk_scc(struct unix_vertex *vertex)
> >  {
> > -	unsigned long index = UNIX_VERTEX_INDEX_START;
> >  	LIST_HEAD(vertex_stack);
> >  	struct unix_edge *edge;
> >  	LIST_HEAD(edge_stack);
> >  
> >  next_vertex:
> > -	vertex->index = index;
> > -	vertex->lowlink = index;
> > -	index++;
> > +	vertex->index = unix_vertex_last_index;
> > +	vertex->scc_index = unix_vertex_last_index;
> > +	unix_vertex_last_index++;
> >  
> >  	list_add(&vertex->scc_entry, &vertex_stack);
> >  
> > @@ -342,13 +342,13 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
> >  
> >  			vertex = edge->predecessor->vertex;
> >  
> > -			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> >  		} else if (next_vertex->index != unix_vertex_grouped_index) {
> > -			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
> > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> 
> I guess the above will break when unix_vertex_last_index wraps around,
> or am I low on coffee? (I guess there is not such a thing as enough
> coffee to allow me reviewing this whole series at once ;)
> 
> Can we expect a wrap around in host with (surprisingly very) long
> uptimes? 

Then, the number of inflight AF_UNIX sockets is at least 2^64 - 1.
After this series, struct unix_sock is 1024 bytes, so... the host
would have roughly

  2^10 * 2^64 == 2^74 bytes == 2^34 TBi == 17179869184 TBi

memory!

So, we need not expect a wrap around :)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/14] af_unix: Detect dead SCC.
  2024-02-27 11:25   ` Paolo Abeni
@ 2024-02-28  3:14     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-28  3:14 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 27 Feb 2024 12:25:56 +0100
> On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > When iterating SCC, we call unix_vertex_dead() for each vertex
> > to check if the vertex is close()d and has no bridge to another
> > SCC.
> > 
> > If both conditions are true for every vertex in SCC, we can
> > execute garbage collection for all skb in the SCC.
> > 
> > The actual garbage collection is done in the following patch,
> > replacing the old implementation.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  net/unix/garbage.c | 37 ++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 36 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 0eb1610c96d7..060e81be3614 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -288,6 +288,32 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
> >  	unix_free_vertices(fpl);
> >  }
> >  
> > +static bool unix_vertex_dead(struct unix_vertex *vertex)
> > +{
> > +	struct unix_edge *edge;
> > +	struct unix_sock *u;
> > +	long total_ref;
> > +
> > +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> > +		struct unix_vertex *next_vertex = unix_edge_successor(edge);
> > +
> > +		if (!next_vertex)
> > +			return false;
> > +
> > +		if (next_vertex->scc_index != vertex->scc_index)
> > +			return false;
> > +	}
> > +
> > +	edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
> > +	u = edge->predecessor;
> > +	total_ref = file_count(u->sk.sk_socket->file);
> > +
> > +	if (total_ref != vertex->out_degree)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  static bool unix_scc_cyclic(struct list_head *scc)
> >  {
> >  	struct unix_vertex *vertex;
> > @@ -350,6 +376,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
> >  
> >  	if (vertex->index == vertex->scc_index) {
> >  		struct list_head scc;
> > +		bool dead = true;
> 
> Very minor nit: the above variable 'sounds' alike referring to the
> current vertex, while instead it tracks a global status, what about
> 'all_dead' or 'gc_all'?

Exactly. I'll use scc_dead or gc_scc to be consistent
with unix_vertex_dead().

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm.
  2024-02-27 11:36   ` Paolo Abeni
@ 2024-02-28  3:32     ` Kuniyuki Iwashima
  2024-02-28  8:08       ` Paolo Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-28  3:32 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 27 Feb 2024 12:36:51 +0100
> On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 060e81be3614..59a87a997a4d 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -314,6 +314,48 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
> >  	return true;
> >  }
> >  
> > +static struct sk_buff_head hitlist;
> 
> I *think* hitlist could be replaced with a local variable in
> __unix_gc(), WDYT?

Actually it was a local variable in the first draft.

In the current GC impl, hitlist is passed down to functions,
but only the leaf function uses it, and I thought the global
variable would be easier to follow.

And now __unix_gc() is not called twice at the same time thanks
to workqueue, and hitlist can be a global variable.


> 
> > +
> > +static void unix_collect_skb(struct list_head *scc)
> > +{
> > +	struct unix_vertex *vertex;
> > +
> > +	list_for_each_entry_reverse(vertex, scc, scc_entry) {
> > +		struct sk_buff_head *queue;
> > +		struct unix_edge *edge;
> > +		struct unix_sock *u;
> > +
> > +		edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
> > +		u = edge->predecessor;
> > +		queue = &u->sk.sk_receive_queue;
> > +
> > +		spin_lock(&queue->lock);
> > +
> > +		if (u->sk.sk_state == TCP_LISTEN) {
> > +			struct sk_buff *skb;
> > +
> > +			skb_queue_walk(queue, skb) {
> > +				struct sk_buff_head *embryo_queue = &skb->sk->sk_receive_queue;
> > +
> > +				spin_lock(&embryo_queue->lock);
> 
> I'm wondering if and why lockdep would be happy about the above. I
> think this deserve at least a comment.

Ah, exactly, I guess lockdep is unhappy with it, but it would
be false positive anyway.  The inversion lock never happens.

I'll use spin_lock_nested() with a comment, or do

  - splice listener's list to local queue
  - unlock listener's queue
  - skb_queue_walk
    - lock child queue
    - splice
    - unlock child queue
  - lock listener's queue again
  - splice the child list back (to call unix_release_sock() later)


> 
> 
> > +				skb_queue_splice_init(embryo_queue, &hitlist);
> > +				spin_unlock(&embryo_queue->lock);
> > +			}
> > +		} else {
> > +			skb_queue_splice_init(queue, &hitlist);
> > +
> > +#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> > +			if (u->oob_skb) {
> > +				kfree_skb(u->oob_skb);
> 
> Similar question here. This happens under the u receive queue lock,
> could we his some complex lock dependency? what about moving oob_skb to
> hitlist instead?

oob_skb is just a pointer to skb which is put in the recv queue,
so it's already in the hitlist here.

But oob_skb has an additional refcount, so we need to call
kfree_skb() to decrement it, so we don't actually free it
here and later we do in __unix_gc().

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.
  2024-02-28  2:34     ` Kuniyuki Iwashima
@ 2024-02-28  7:46       ` Paolo Abeni
  0 siblings, 0 replies; 33+ messages in thread
From: Paolo Abeni @ 2024-02-28  7:46 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, kuni1840, netdev

On Tue, 2024-02-27 at 18:34 -0800, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 27 Feb 2024 11:47:23 +0100
> > On Fri, 2024-02-23 at 13:39 -0800, Kuniyuki Iwashima wrote:
> > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > index 96d0b1db3638..e8fe08796d02 100644
> > > --- a/net/unix/garbage.c
> > > +++ b/net/unix/garbage.c
> > > @@ -148,6 +148,7 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
> > >  }
> > >  
> > >  DEFINE_SPINLOCK(unix_gc_lock);
> > > +unsigned int unix_tot_inflight;
> > >  
> > >  void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
> > >  {
> > > @@ -172,7 +173,10 @@ void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
> > >  		unix_add_edge(fpl, edge);
> > >  	} while (i < fpl->count_unix);
> > >  
> > > +	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + fpl->count_unix);
> > >  out:
> > > +	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight + fpl->count);
> > 
> > I'm unsure if later patches will shed some light, but why the above
> > statement is placed _after_ the 'out' label? fpl->count will be 0 in
> > such path, and the updated not needed. Why don't you place it before
> > the mentioned label?
> 
> fpl->count is the total number of fds in skb, and fpl->count_unix
> is the number of AF_UNIX fds.

Ah right you are! Sorry, I misread the variable name. This code looks
good.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC.
  2024-02-28  3:05     ` Kuniyuki Iwashima
@ 2024-02-28  7:49       ` Paolo Abeni
  2024-02-28 16:25         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Abeni @ 2024-02-28  7:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, kuni1840, netdev

On Tue, 2024-02-27 at 19:05 -0800, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 27 Feb 2024 12:19:40 +0100
> > On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > > The definition of the lowlink in Tarjan's algorithm is the
> > > smallest index of a vertex that is reachable with at most one
> > > back-edge in SCC.  This is not useful for a cross-edge.
> > > 
> > > If we start traversing from A in the following graph, the final
> > > lowlink of D is 3.  The cross-edge here is one between D and C.
> > > 
> > >   A -> B -> D   D = (4, 3)  (index, lowlink)
> > >   ^    |    |   C = (3, 1)
> > >   |    V    |   B = (2, 1)
> > >   `--- C <--'   A = (1, 1)
> > > 
> > > This is because the lowlink of D is updated with the index of C.
> > > 
> > > In the following patch, we detect a dead SCC by checking two
> > > conditions for each vertex.
> > > 
> > >   1) vertex has no edge directed to another SCC (no bridge)
> > >   2) vertex's out_degree is the same as the refcount of its file
> > > 
> > > If 1) is false, there is a receiver of all fds of the SCC and
> > > its ancestor SCC.
> > > 
> > > To evaluate 1), we need to assign a unique index to each SCC and
> > > assign it to all vertices in the SCC.
> > > 
> > > This patch changes the lowlink update logic for cross-edge so
> > > that in the example above, the lowlink of D is updated with the
> > > lowlink of C.
> > > 
> > >   A -> B -> D   D = (4, 1)  (index, lowlink)
> > >   ^    |    |   C = (3, 1)
> > >   |    V    |   B = (2, 1)
> > >   `--- C <--'   A = (1, 1)
> > > 
> > > Then, all vertices in the same SCC have the same lowlink, and we
> > > can quickly find the bridge connecting to different SCC if exists.
> > > 
> > > However, it is no longer called lowlink, so we rename it to
> > > scc_index.  (It's sometimes called lowpoint.)
> > > 
> > > Also, we add a global variable to hold the last index used in DFS
> > > so that we do not reset the initial index in each DFS.
> > > 
> > > This patch can be squashed to the SCC detection patch but is
> > > split deliberately for anyone wondering why lowlink is not used
> > > as used in the original Tarjan's algorithm and many reference
> > > implementations.
> > > 
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  include/net/af_unix.h |  2 +-
> > >  net/unix/garbage.c    | 15 ++++++++-------
> > >  2 files changed, 9 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > > index ec040caaa4b5..696d997a5ac9 100644
> > > --- a/include/net/af_unix.h
> > > +++ b/include/net/af_unix.h
> > > @@ -36,7 +36,7 @@ struct unix_vertex {
> > >  	struct list_head scc_entry;
> > >  	unsigned long out_degree;
> > >  	unsigned long index;
> > > -	unsigned long lowlink;
> > > +	unsigned long scc_index;
> > >  };
> > >  
> > >  struct unix_edge {
> > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > index 1d9a0498dec5..0eb1610c96d7 100644
> > > --- a/net/unix/garbage.c
> > > +++ b/net/unix/garbage.c
> > > @@ -308,18 +308,18 @@ static bool unix_scc_cyclic(struct list_head *scc)
> > >  
> > >  static LIST_HEAD(unix_visited_vertices);
> > >  static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> > > +static unsigned long unix_vertex_last_index = UNIX_VERTEX_INDEX_START;
> > >  
> > >  static void __unix_walk_scc(struct unix_vertex *vertex)
> > >  {
> > > -	unsigned long index = UNIX_VERTEX_INDEX_START;
> > >  	LIST_HEAD(vertex_stack);
> > >  	struct unix_edge *edge;
> > >  	LIST_HEAD(edge_stack);
> > >  
> > >  next_vertex:
> > > -	vertex->index = index;
> > > -	vertex->lowlink = index;
> > > -	index++;
> > > +	vertex->index = unix_vertex_last_index;
> > > +	vertex->scc_index = unix_vertex_last_index;
> > > +	unix_vertex_last_index++;
> > >  
> > >  	list_add(&vertex->scc_entry, &vertex_stack);
> > >  
> > > @@ -342,13 +342,13 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
> > >  
> > >  			vertex = edge->predecessor->vertex;
> > >  
> > > -			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> > > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> > >  		} else if (next_vertex->index != unix_vertex_grouped_index) {
> > > -			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
> > > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> > 
> > I guess the above will break when unix_vertex_last_index wraps around,
> > or am I low on coffee? (I guess there is not such a thing as enough
> > coffee to allow me reviewing this whole series at once ;)
> > 
> > Can we expect a wrap around in host with (surprisingly very) long
> > uptimes? 
> 
> Then, the number of inflight AF_UNIX sockets is at least 2^64 - 1.

Isn't "unix_vertex_last_index" value preserved across consecutive cg
run? I though we could reach wrap around after a lot of gc runs...

Cheers, 

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm.
  2024-02-28  3:32     ` Kuniyuki Iwashima
@ 2024-02-28  8:08       ` Paolo Abeni
  2024-02-28 16:29         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Abeni @ 2024-02-28  8:08 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, kuni1840, netdev

On Tue, 2024-02-27 at 19:32 -0800, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 27 Feb 2024 12:36:51 +0100
> > On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > index 060e81be3614..59a87a997a4d 100644
> > > --- a/net/unix/garbage.c
> > > +++ b/net/unix/garbage.c
> > > @@ -314,6 +314,48 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
> > >  	return true;
> > >  }
> > >  
> > > +static struct sk_buff_head hitlist;
> > 
> > I *think* hitlist could be replaced with a local variable in
> > __unix_gc(), WDYT?
> 
> Actually it was a local variable in the first draft.
> 
> In the current GC impl, hitlist is passed down to functions,
> but only the leaf function uses it, and I thought the global
> variable would be easier to follow.
> 
> And now __unix_gc() is not called twice at the same time thanks
> to workqueue, and hitlist can be a global variable.

My personal preference would be for a local variable, unless it makes
the code significant more complex: I think it's more clear and avoid
possible false sharing issues in the data segment.

> > > +
> > > +static void unix_collect_skb(struct list_head *scc)
> > > +{
> > > +	struct unix_vertex *vertex;
> > > +
> > > +	list_for_each_entry_reverse(vertex, scc, scc_entry) {
> > > +		struct sk_buff_head *queue;
> > > +		struct unix_edge *edge;
> > > +		struct unix_sock *u;
> > > +
> > > +		edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
> > > +		u = edge->predecessor;
> > > +		queue = &u->sk.sk_receive_queue;
> > > +
> > > +		spin_lock(&queue->lock);
> > > +
> > > +		if (u->sk.sk_state == TCP_LISTEN) {
> > > +			struct sk_buff *skb;
> > > +
> > > +			skb_queue_walk(queue, skb) {
> > > +				struct sk_buff_head *embryo_queue = &skb->sk->sk_receive_queue;
> > > +
> > > +				spin_lock(&embryo_queue->lock);
> > 
> > I'm wondering if and why lockdep would be happy about the above. I
> > think this deserve at least a comment.
> 
> Ah, exactly, I guess lockdep is unhappy with it, but it would
> be false positive anyway.  The inversion lock never happens.
> 
> I'll use spin_lock_nested() with a comment, or do
> 
>   - splice listener's list to local queue
>   - unlock listener's queue
>   - skb_queue_walk
>     - lock child queue
>     - splice
>     - unlock child queue
>   - lock listener's queue again
>   - splice the child list back (to call unix_release_sock() later)

Either ways LGTM.

> > > +				skb_queue_splice_init(embryo_queue, &hitlist);
> > > +				spin_unlock(&embryo_queue->lock);
> > > +			}
> > > +		} else {
> > > +			skb_queue_splice_init(queue, &hitlist);
> > > +
> > > +#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> > > +			if (u->oob_skb) {
> > > +				kfree_skb(u->oob_skb);
> > 
> > Similar question here. This happens under the u receive queue lock,
> > could we his some complex lock dependency? what about moving oob_skb to
> > hitlist instead?
> 
> oob_skb is just a pointer to skb which is put in the recv queue,
> so it's already in the hitlist here.
> 
> But oob_skb has an additional refcount, so we need to call
> kfree_skb() to decrement it, so we don't actually free it
> here and later we do in __unix_gc().

Understood, thanks, LGTM.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC.
  2024-02-28  7:49       ` Paolo Abeni
@ 2024-02-28 16:25         ` Kuniyuki Iwashima
  2024-02-28 17:51           ` Paolo Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-28 16:25 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 28 Feb 2024 08:49:46 +0100
> On Tue, 2024-02-27 at 19:05 -0800, Kuniyuki Iwashima wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> > Date: Tue, 27 Feb 2024 12:19:40 +0100
> > > On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > > > The definition of the lowlink in Tarjan's algorithm is the
> > > > smallest index of a vertex that is reachable with at most one
> > > > back-edge in SCC.  This is not useful for a cross-edge.
> > > > 
> > > > If we start traversing from A in the following graph, the final
> > > > lowlink of D is 3.  The cross-edge here is one between D and C.
> > > > 
> > > >   A -> B -> D   D = (4, 3)  (index, lowlink)
> > > >   ^    |    |   C = (3, 1)
> > > >   |    V    |   B = (2, 1)
> > > >   `--- C <--'   A = (1, 1)
> > > > 
> > > > This is because the lowlink of D is updated with the index of C.
> > > > 
> > > > In the following patch, we detect a dead SCC by checking two
> > > > conditions for each vertex.
> > > > 
> > > >   1) vertex has no edge directed to another SCC (no bridge)
> > > >   2) vertex's out_degree is the same as the refcount of its file
> > > > 
> > > > If 1) is false, there is a receiver of all fds of the SCC and
> > > > its ancestor SCC.
> > > > 
> > > > To evaluate 1), we need to assign a unique index to each SCC and
> > > > assign it to all vertices in the SCC.
> > > > 
> > > > This patch changes the lowlink update logic for cross-edge so
> > > > that in the example above, the lowlink of D is updated with the
> > > > lowlink of C.
> > > > 
> > > >   A -> B -> D   D = (4, 1)  (index, lowlink)
> > > >   ^    |    |   C = (3, 1)
> > > >   |    V    |   B = (2, 1)
> > > >   `--- C <--'   A = (1, 1)
> > > > 
> > > > Then, all vertices in the same SCC have the same lowlink, and we
> > > > can quickly find the bridge connecting to different SCC if exists.
> > > > 
> > > > However, it is no longer called lowlink, so we rename it to
> > > > scc_index.  (It's sometimes called lowpoint.)
> > > > 
> > > > Also, we add a global variable to hold the last index used in DFS
> > > > so that we do not reset the initial index in each DFS.
> > > > 
> > > > This patch can be squashed to the SCC detection patch but is
> > > > split deliberately for anyone wondering why lowlink is not used
> > > > as used in the original Tarjan's algorithm and many reference
> > > > implementations.
> > > > 
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > ---
> > > >  include/net/af_unix.h |  2 +-
> > > >  net/unix/garbage.c    | 15 ++++++++-------
> > > >  2 files changed, 9 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > > > index ec040caaa4b5..696d997a5ac9 100644
> > > > --- a/include/net/af_unix.h
> > > > +++ b/include/net/af_unix.h
> > > > @@ -36,7 +36,7 @@ struct unix_vertex {
> > > >  	struct list_head scc_entry;
> > > >  	unsigned long out_degree;
> > > >  	unsigned long index;
> > > > -	unsigned long lowlink;
> > > > +	unsigned long scc_index;
> > > >  };
> > > >  
> > > >  struct unix_edge {
> > > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > > index 1d9a0498dec5..0eb1610c96d7 100644
> > > > --- a/net/unix/garbage.c
> > > > +++ b/net/unix/garbage.c
> > > > @@ -308,18 +308,18 @@ static bool unix_scc_cyclic(struct list_head *scc)
> > > >  
> > > >  static LIST_HEAD(unix_visited_vertices);
> > > >  static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> > > > +static unsigned long unix_vertex_last_index = UNIX_VERTEX_INDEX_START;
> > > >  
> > > >  static void __unix_walk_scc(struct unix_vertex *vertex)
> > > >  {
> > > > -	unsigned long index = UNIX_VERTEX_INDEX_START;
> > > >  	LIST_HEAD(vertex_stack);
> > > >  	struct unix_edge *edge;
> > > >  	LIST_HEAD(edge_stack);
> > > >  
> > > >  next_vertex:
> > > > -	vertex->index = index;
> > > > -	vertex->lowlink = index;
> > > > -	index++;
> > > > +	vertex->index = unix_vertex_last_index;
> > > > +	vertex->scc_index = unix_vertex_last_index;
> > > > +	unix_vertex_last_index++;
> > > >  
> > > >  	list_add(&vertex->scc_entry, &vertex_stack);
> > > >  
> > > > @@ -342,13 +342,13 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
> > > >  
> > > >  			vertex = edge->predecessor->vertex;
> > > >  
> > > > -			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> > > > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> > > >  		} else if (next_vertex->index != unix_vertex_grouped_index) {
> > > > -			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
> > > > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> > > 
> > > I guess the above will break when unix_vertex_last_index wraps around,
> > > or am I low on coffee? (I guess there is not such a thing as enough
> > > coffee to allow me reviewing this whole series at once ;)
> > > 
> > > Can we expect a wrap around in host with (surprisingly very) long
> > > uptimes? 
> > 
> > Then, the number of inflight AF_UNIX sockets is at least 2^64 - 1.
> 
> Isn't "unix_vertex_last_index" value preserved across consecutive cg
> run? I though we could reach wrap around after a lot of gc runs...

It's preserved across consecutive DFS in a single gc run, but
unix_walk_scc() always reset it.  So, if it's wrapped, there
would be too many sockets.

I used unix_vertex_last_index elsewhere in the initial draft,
but now local variable could be better here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm.
  2024-02-28  8:08       ` Paolo Abeni
@ 2024-02-28 16:29         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2024-02-28 16:29 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 28 Feb 2024 09:08:32 +0100
> On Tue, 2024-02-27 at 19:32 -0800, Kuniyuki Iwashima wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> > Date: Tue, 27 Feb 2024 12:36:51 +0100
> > > On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > > index 060e81be3614..59a87a997a4d 100644
> > > > --- a/net/unix/garbage.c
> > > > +++ b/net/unix/garbage.c
> > > > @@ -314,6 +314,48 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
> > > >  	return true;
> > > >  }
> > > >  
> > > > +static struct sk_buff_head hitlist;
> > > 
> > > I *think* hitlist could be replaced with a local variable in
> > > __unix_gc(), WDYT?
> > 
> > Actually it was a local variable in the first draft.
> > 
> > In the current GC impl, hitlist is passed down to functions,
> > but only the leaf function uses it, and I thought the global
> > variable would be easier to follow.
> > 
> > And now __unix_gc() is not called twice at the same time thanks
> > to workqueue, and hitlist can be a global variable.
> 
> My personal preference would be for a local variable, unless it makes
> the code significant more complex: I think it's more clear and avoid
> possible false sharing issues in the data segment.

I didn't think of that point.
I'll use a local variable for hitlist.

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC.
  2024-02-28 16:25         ` Kuniyuki Iwashima
@ 2024-02-28 17:51           ` Paolo Abeni
  0 siblings, 0 replies; 33+ messages in thread
From: Paolo Abeni @ 2024-02-28 17:51 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, kuni1840, netdev

On Wed, 2024-02-28 at 08:25 -0800, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Wed, 28 Feb 2024 08:49:46 +0100
> > On Tue, 2024-02-27 at 19:05 -0800, Kuniyuki Iwashima wrote:
> > > From: Paolo Abeni <pabeni@redhat.com>
> > > Date: Tue, 27 Feb 2024 12:19:40 +0100
> > > > On Fri, 2024-02-23 at 13:40 -0800, Kuniyuki Iwashima wrote:
> > > > > The definition of the lowlink in Tarjan's algorithm is the
> > > > > smallest index of a vertex that is reachable with at most one
> > > > > back-edge in SCC.  This is not useful for a cross-edge.
> > > > > 
> > > > > If we start traversing from A in the following graph, the final
> > > > > lowlink of D is 3.  The cross-edge here is one between D and C.
> > > > > 
> > > > >   A -> B -> D   D = (4, 3)  (index, lowlink)
> > > > >   ^    |    |   C = (3, 1)
> > > > >   |    V    |   B = (2, 1)
> > > > >   `--- C <--'   A = (1, 1)
> > > > > 
> > > > > This is because the lowlink of D is updated with the index of C.
> > > > > 
> > > > > In the following patch, we detect a dead SCC by checking two
> > > > > conditions for each vertex.
> > > > > 
> > > > >   1) vertex has no edge directed to another SCC (no bridge)
> > > > >   2) vertex's out_degree is the same as the refcount of its file
> > > > > 
> > > > > If 1) is false, there is a receiver of all fds of the SCC and
> > > > > its ancestor SCC.
> > > > > 
> > > > > To evaluate 1), we need to assign a unique index to each SCC and
> > > > > assign it to all vertices in the SCC.
> > > > > 
> > > > > This patch changes the lowlink update logic for cross-edge so
> > > > > that in the example above, the lowlink of D is updated with the
> > > > > lowlink of C.
> > > > > 
> > > > >   A -> B -> D   D = (4, 1)  (index, lowlink)
> > > > >   ^    |    |   C = (3, 1)
> > > > >   |    V    |   B = (2, 1)
> > > > >   `--- C <--'   A = (1, 1)
> > > > > 
> > > > > Then, all vertices in the same SCC have the same lowlink, and we
> > > > > can quickly find the bridge connecting to different SCC if exists.
> > > > > 
> > > > > However, it is no longer called lowlink, so we rename it to
> > > > > scc_index.  (It's sometimes called lowpoint.)
> > > > > 
> > > > > Also, we add a global variable to hold the last index used in DFS
> > > > > so that we do not reset the initial index in each DFS.
> > > > > 
> > > > > This patch can be squashed to the SCC detection patch but is
> > > > > split deliberately for anyone wondering why lowlink is not used
> > > > > as used in the original Tarjan's algorithm and many reference
> > > > > implementations.
> > > > > 
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > ---
> > > > >  include/net/af_unix.h |  2 +-
> > > > >  net/unix/garbage.c    | 15 ++++++++-------
> > > > >  2 files changed, 9 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > > > > index ec040caaa4b5..696d997a5ac9 100644
> > > > > --- a/include/net/af_unix.h
> > > > > +++ b/include/net/af_unix.h
> > > > > @@ -36,7 +36,7 @@ struct unix_vertex {
> > > > >  	struct list_head scc_entry;
> > > > >  	unsigned long out_degree;
> > > > >  	unsigned long index;
> > > > > -	unsigned long lowlink;
> > > > > +	unsigned long scc_index;
> > > > >  };
> > > > >  
> > > > >  struct unix_edge {
> > > > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > > > index 1d9a0498dec5..0eb1610c96d7 100644
> > > > > --- a/net/unix/garbage.c
> > > > > +++ b/net/unix/garbage.c
> > > > > @@ -308,18 +308,18 @@ static bool unix_scc_cyclic(struct list_head *scc)
> > > > >  
> > > > >  static LIST_HEAD(unix_visited_vertices);
> > > > >  static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> > > > > +static unsigned long unix_vertex_last_index = UNIX_VERTEX_INDEX_START;
> > > > >  
> > > > >  static void __unix_walk_scc(struct unix_vertex *vertex)
> > > > >  {
> > > > > -	unsigned long index = UNIX_VERTEX_INDEX_START;
> > > > >  	LIST_HEAD(vertex_stack);
> > > > >  	struct unix_edge *edge;
> > > > >  	LIST_HEAD(edge_stack);
> > > > >  
> > > > >  next_vertex:
> > > > > -	vertex->index = index;
> > > > > -	vertex->lowlink = index;
> > > > > -	index++;
> > > > > +	vertex->index = unix_vertex_last_index;
> > > > > +	vertex->scc_index = unix_vertex_last_index;
> > > > > +	unix_vertex_last_index++;
> > > > >  
> > > > >  	list_add(&vertex->scc_entry, &vertex_stack);
> > > > >  
> > > > > @@ -342,13 +342,13 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
> > > > >  
> > > > >  			vertex = edge->predecessor->vertex;
> > > > >  
> > > > > -			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
> > > > > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> > > > >  		} else if (next_vertex->index != unix_vertex_grouped_index) {
> > > > > -			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
> > > > > +			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
> > > > 
> > > > I guess the above will break when unix_vertex_last_index wraps around,
> > > > or am I low on coffee? (I guess there is not such a thing as enough
> > > > coffee to allow me reviewing this whole series at once ;)
> > > > 
> > > > Can we expect a wrap around in host with (surprisingly very) long
> > > > uptimes? 
> > > 
> > > Then, the number of inflight AF_UNIX sockets is at least 2^64 - 1.
> > 
> > Isn't "unix_vertex_last_index" value preserved across consecutive cg
> > run? I though we could reach wrap around after a lot of gc runs...
> 
> It's preserved across consecutive DFS in a single gc run, but
> unix_walk_scc() always reset it.  So, if it's wrapped, there
> would be too many sockets.

Ah, I missed that point. No wrap-around problem then!

> I used unix_vertex_last_index elsewhere in the initial draft,
> but now local variable could be better here.

You could bundle the index, hitlist, etc. in a single struct (gs_state
or whatever) and pass around a single argument, if that helps.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-02-28 17:51 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-23 21:39 [PATCH v3 net-next 00/14] af_unix: Rework GC Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 01/14] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 02/14] af_unix: Allocate struct unix_edge " Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 03/14] af_unix: Link struct unix_edge when queuing skb Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 04/14] af_unix: Bulk update unix_tot_inflight/unix_inflight " Kuniyuki Iwashima
2024-02-27 10:47   ` Paolo Abeni
2024-02-28  2:34     ` Kuniyuki Iwashima
2024-02-28  7:46       ` Paolo Abeni
2024-02-23 21:39 ` [PATCH v3 net-next 05/14] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
2024-02-25  0:34   ` Jakub Kicinski
2024-02-26 19:07     ` Kuniyuki Iwashima
2024-02-27 11:02   ` Paolo Abeni
2024-02-28  2:49     ` Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 06/14] af_unix: Save listener for embryo socket Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 07/14] af_unix: Fix up unix_edge.successor " Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 08/14] af_unix: Save O(n) setup of Tarjan's algo Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 09/14] af_unix: Skip GC if no cycle exists Kuniyuki Iwashima
2024-02-23 21:39 ` [PATCH v3 net-next 10/14] af_unix: Avoid Tarjan's algorithm if unnecessary Kuniyuki Iwashima
2024-02-23 21:40 ` [PATCH v3 net-next 11/14] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
2024-02-27 11:19   ` Paolo Abeni
2024-02-28  3:05     ` Kuniyuki Iwashima
2024-02-28  7:49       ` Paolo Abeni
2024-02-28 16:25         ` Kuniyuki Iwashima
2024-02-28 17:51           ` Paolo Abeni
2024-02-23 21:40 ` [PATCH v3 net-next 12/14] af_unix: Detect dead SCC Kuniyuki Iwashima
2024-02-27 11:25   ` Paolo Abeni
2024-02-28  3:14     ` Kuniyuki Iwashima
2024-02-23 21:40 ` [PATCH v3 net-next 13/14] af_unix: Replace garbage collection algorithm Kuniyuki Iwashima
2024-02-27 11:36   ` Paolo Abeni
2024-02-28  3:32     ` Kuniyuki Iwashima
2024-02-28  8:08       ` Paolo Abeni
2024-02-28 16:29         ` Kuniyuki Iwashima
2024-02-23 21:40 ` [PATCH v3 net-next 14/14] selftest: af_unix: Test GC for SCM_RIGHTS Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).