All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 net-next 00/15] af_unix: Rework GC.
@ 2024-03-01  2:22 Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 01/15] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
                   ` (15 more replies)
  0 siblings, 16 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

When we pass a file descriptor to an AF_UNIX socket via SCM_RIGTHS,
the underlying struct file of the inflight fd gets its refcount bumped.
If the fd is of an AF_UNIX socket, we need to track it in case it forms
cyclic references.

Let's say we send a fd of AF_UNIX socket A to B and vice versa and
close() both sockets.

When created, each socket's struct file initially has one reference.
After the fd exchange, both refcounts are bumped up to 2.  Then, close()
decreases both to 1.  From this point on, no one can touch the file/socket.

However, the struct file has one refcount and thus never calls the
release() function of the AF_UNIX socket.

That's why we need to track all inflight AF_UNIX sockets and run garbage
collection.

This series replaces the current GC implementation that locks each inflight
socket's receive queue and requires trickiness in other places.

The new GC does not lock each socket's queue to minimise its effect and
tries to be lightweight if there is no cyclic reference or no update in
the shape of the inflight fd graph.

The new implementation is based on Tarjan's Strongly Connected Components
algorithm, and we will consider each inflight AF_UNIX socket as a vertex
and its file descriptor as an edge in a directed graph.

For the details, please see each patch.

  patch 1  -  3 : Add struct to express inflight socket graphs
  patch       4 : Optimse inflight fd counting
  patch 5  -  6 : Group SCC possibly forming a cycle
  patch 7  -  8 : Support embryo socket
  patch 9  - 11 : Make GC lightweight
  patch 12 - 13 : Detect dead cycle references
  patch      14 : Replace GC algorithm
  patch      15 : selftest

After this series is applied, we can remove the two ugly tricks for race,
scm_fp_dup() in unix_attach_fds() and spin_lock dance in unix_peek_fds()
as done in patch 14/15 of v1.


Changes:
  v4:
    * Split SCC detection patch to 3 & 4
    * Add comments

    * Patch 10
      * Remove early return in unix_update_graph(), (cyclic=1, grouped=1)
        triggers access to uninit scc_index in unix_walk_scc_fast()
    * Patch 12
      * Make unix_vertex_last_index local var
    * Patch 13
      * s/dead/scc_dead/
    * Patch 14
      * Fix lockdep false-positive splat
      * Make hitlist local var

  v3: https://lore.kernel.org/netdev/20240223214003.17369-1-kuniyu@amazon.com/
    * Patch 1
      * Allocate struct unix_vertex dynamically only for inflight socket
    * Patch 2
      * Rename unix_edge.entry to unix_edge.vertex_entry
      * Change edge->successor/predecessor to struct unix_sock
    * Patch 7
      * Fix up embryo successor during GC instead of overwriting edge
        in unix_add_edge()
        * To not allcoate unix_vertex to listener for embryo socket
        * Kept the name unix_update_edges() unchanged as it affect
          successor tracking during GC
    * Patch 12
      * Drop self_degree and check all edges
        * To not allcoate unix_vertex to listener for embryo socket

  v2: https://lore.kernel.org/netdev/20240216210556.65913-1-kuniyu@amazon.com/
    * Drop 2 patches as follow-up that removes trickiness in
      unix_attach_fds() and unix_peek_fds().

    * Patch 2
      * Fix build error when CONFIG_UNIX=n
    * Patch 3
      * Remove unnecessary INIT_LIST_HEAD()
    * Patch 7
      * Fix build warning for using goto label at the end of the loop
    * Patch 13
      * Call kfree_skb() for oob skb
    * Patch 14
      * Add test case for MSG_OOB

  v1: https://lore.kernel.org/netdev/20240203030058.60750-1-kuniyu@amazon.com/


Kuniyuki Iwashima (15):
  af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd.
  af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.
  af_unix: Link struct unix_edge when queuing skb.
  af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.
  af_unix: Iterate all vertices by DFS.
  af_unix: Detect Strongly Connected Components.
  af_unix: Save listener for embryo socket.
  af_unix: Fix up unix_edge.successor for embryo socket.
  af_unix: Save O(n) setup of Tarjan's algo.
  af_unix: Skip GC if no cycle exists.
  af_unix: Avoid Tarjan's algorithm if unnecessary.
  af_unix: Assign a unique index to SCC.
  af_unix: Detect dead SCC.
  af_unix: Replace garbage collection algorithm.
  selftest: af_unix: Test GC for SCM_RIGHTS.

 include/net/af_unix.h                         |  31 +-
 include/net/scm.h                             |   9 +
 net/core/scm.c                                |  11 +
 net/unix/af_unix.c                            |  27 +-
 net/unix/garbage.c                            | 573 ++++++++++++------
 tools/testing/selftests/net/.gitignore        |   1 +
 tools/testing/selftests/net/af_unix/Makefile  |   2 +-
 .../selftests/net/af_unix/scm_rights.c        | 286 +++++++++
 8 files changed, 735 insertions(+), 205 deletions(-)
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 01/15] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 02/15] af_unix: Allocate struct unix_edge " Kuniyuki Iwashima
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will replace the garbage collection algorithm for AF_UNIX, where
we will consider each inflight AF_UNIX socket as a vertex and its file
descriptor as an edge in a directed graph.

This patch introduces a new struct unix_vertex representing a vertex
in the graph and adds its pointer to struct unix_sock.

When we send a fd using the SCM_RIGHTS message, we allocate struct
scm_fp_list to struct scm_cookie in scm_fp_copy().  Then, we bump
each refcount of the inflight fds' struct file and save them in
scm_fp_list.fp.

After that, unix_attach_fds() inexplicably clones scm_fp_list of
scm_cookie and sets it to skb.  (We will remove this part after
replacing GC.)

Here, we add a new function call in unix_attach_fds() to preallocate
struct unix_vertex per inflight AF_UNIX fd and link each vertex to
skb's scm_fp_list.vertices.

When sendmsg() succeeds later, if the socket of the inflight fd is
still not inflight yet, we will set the preallocated vertex to struct
unix_sock.vertex and link it to a global list unix_unvisited_vertices
under spin_lock(&unix_gc_lock).

If the socket is already inflight, we free the preallocated vertex.
This is to avoid taking the lock unnecessarily when sendmsg() could
fail later.

In the following patch, we will similarly allocate another struct
per edge, which will finally be linked to the inflight socket's
unix_vertex.edges.

And then, we will count the number of edges as unix_vertex.out_degree.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  9 +++++++++
 include/net/scm.h     |  3 +++
 net/core/scm.c        |  7 +++++++
 net/unix/af_unix.c    |  6 ++++++
 net/unix/garbage.c    | 38 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 63 insertions(+)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 627ea8e2d915..c270877a5256 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -22,9 +22,17 @@ extern unsigned int unix_tot_inflight;
 
 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
+int unix_prepare_fpl(struct scm_fp_list *fpl);
+void unix_destroy_fpl(struct scm_fp_list *fpl);
 void unix_gc(void);
 void wait_for_unix_gc(struct scm_fp_list *fpl);
 
+struct unix_vertex {
+	struct list_head edges;
+	struct list_head entry;
+	unsigned long out_degree;
+};
+
 struct sock *unix_peer_get(struct sock *sk);
 
 #define UNIX_HASH_MOD	(256 - 1)
@@ -62,6 +70,7 @@ struct unix_sock {
 	struct path		path;
 	struct mutex		iolock, bindlock;
 	struct sock		*peer;
+	struct unix_vertex	*vertex;
 	struct list_head	link;
 	unsigned long		inflight;
 	spinlock_t		lock;
diff --git a/include/net/scm.h b/include/net/scm.h
index 92276a2c5543..e34321b6e204 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -27,6 +27,9 @@ struct scm_fp_list {
 	short			count;
 	short			count_unix;
 	short			max;
+#ifdef CONFIG_UNIX
+	struct list_head	vertices;
+#endif
 	struct user_struct	*user;
 	struct file		*fp[SCM_MAX_FD];
 };
diff --git a/net/core/scm.c b/net/core/scm.c
index 9cd4b0a01cd6..87dfec1c3378 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -89,6 +89,9 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 		fpl->count_unix = 0;
 		fpl->max = SCM_MAX_FD;
 		fpl->user = NULL;
+#if IS_ENABLED(CONFIG_UNIX)
+		INIT_LIST_HEAD(&fpl->vertices);
+#endif
 	}
 	fpp = &fpl->fp[fpl->count];
 
@@ -376,8 +379,12 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
+
 		new_fpl->max = new_fpl->count;
 		new_fpl->user = get_uid(fpl->user);
+#if IS_ENABLED(CONFIG_UNIX)
+		INIT_LIST_HEAD(&new_fpl->vertices);
+#endif
 	}
 	return new_fpl;
 }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5b41e2321209..a3b25d311560 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -980,6 +980,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_destruct		= unix_sock_destructor;
 	u = unix_sk(sk);
 	u->inflight = 0;
+	u->vertex = NULL;
 	u->path.dentry = NULL;
 	u->path.mnt = NULL;
 	spin_lock_init(&u->lock);
@@ -1805,6 +1806,9 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 	for (i = scm->fp->count - 1; i >= 0; i--)
 		unix_inflight(scm->fp->user, scm->fp->fp[i]);
 
+	if (unix_prepare_fpl(UNIXCB(skb).fp))
+		return -ENOMEM;
+
 	return 0;
 }
 
@@ -1815,6 +1819,8 @@ static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 	scm->fp = UNIXCB(skb).fp;
 	UNIXCB(skb).fp = NULL;
 
+	unix_destroy_fpl(scm->fp);
+
 	for (i = scm->fp->count - 1; i >= 0; i--)
 		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
 }
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index fa39b6265238..75bdf66b81df 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -101,6 +101,44 @@ struct unix_sock *unix_get_socket(struct file *filp)
 	return NULL;
 }
 
+static void unix_free_vertices(struct scm_fp_list *fpl)
+{
+	struct unix_vertex *vertex, *next_vertex;
+
+	list_for_each_entry_safe(vertex, next_vertex, &fpl->vertices, entry) {
+		list_del(&vertex->entry);
+		kfree(vertex);
+	}
+}
+
+int unix_prepare_fpl(struct scm_fp_list *fpl)
+{
+	struct unix_vertex *vertex;
+	int i;
+
+	if (!fpl->count_unix)
+		return 0;
+
+	for (i = 0; i < fpl->count_unix; i++) {
+		vertex = kmalloc(sizeof(*vertex), GFP_KERNEL);
+		if (!vertex)
+			goto err;
+
+		list_add(&vertex->entry, &fpl->vertices);
+	}
+
+	return 0;
+
+err:
+	unix_free_vertices(fpl);
+	return -ENOMEM;
+}
+
+void unix_destroy_fpl(struct scm_fp_list *fpl)
+{
+	unix_free_vertices(fpl);
+}
+
 DEFINE_SPINLOCK(unix_gc_lock);
 unsigned int unix_tot_inflight;
 static LIST_HEAD(gc_candidates);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 02/15] af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 01/15] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-05  8:47   ` Paolo Abeni
  2024-03-01  2:22 ` [PATCH v4 net-next 03/15] af_unix: Link struct unix_edge when queuing skb Kuniyuki Iwashima
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

As with the previous patch, we preallocate to skb's scm_fp_list an
array of struct unix_edge in the number of inflight AF_UNIX fds.

There we just preallocate memory and do not use immediately because
sendmsg() could fail after this point.  The actual use will be in
the next patch.

When we queue skb with inflight edges, we will set the inflight
socket's unix_sock as unix_edge->predecessor and the receiver's
unix_sock as successor, and then we will link the edge to the
inflight socket's unix_vertex.edges.

Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup()
so that MSG_PEEK does not change the shape of the directed graph.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h | 6 ++++++
 include/net/scm.h     | 5 +++++
 net/core/scm.c        | 2 ++
 net/unix/garbage.c    | 6 ++++++
 4 files changed, 19 insertions(+)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index c270877a5256..55c4abc26a71 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -33,6 +33,12 @@ struct unix_vertex {
 	unsigned long out_degree;
 };
 
+struct unix_edge {
+	struct unix_sock *predecessor;
+	struct unix_sock *successor;
+	struct list_head vertex_entry;
+};
+
 struct sock *unix_peer_get(struct sock *sk);
 
 #define UNIX_HASH_MOD	(256 - 1)
diff --git a/include/net/scm.h b/include/net/scm.h
index e34321b6e204..5f5154e5096d 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -23,12 +23,17 @@ struct scm_creds {
 	kgid_t	gid;
 };
 
+#ifdef CONFIG_UNIX
+struct unix_edge;
+#endif
+
 struct scm_fp_list {
 	short			count;
 	short			count_unix;
 	short			max;
 #ifdef CONFIG_UNIX
 	struct list_head	vertices;
+	struct unix_edge	*edges;
 #endif
 	struct user_struct	*user;
 	struct file		*fp[SCM_MAX_FD];
diff --git a/net/core/scm.c b/net/core/scm.c
index 87dfec1c3378..1bcc8a2d65e3 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -90,6 +90,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 		fpl->max = SCM_MAX_FD;
 		fpl->user = NULL;
 #if IS_ENABLED(CONFIG_UNIX)
+		fpl->edges = NULL;
 		INIT_LIST_HEAD(&fpl->vertices);
 #endif
 	}
@@ -383,6 +384,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		new_fpl->max = new_fpl->count;
 		new_fpl->user = get_uid(fpl->user);
 #if IS_ENABLED(CONFIG_UNIX)
+		new_fpl->edges = NULL;
 		INIT_LIST_HEAD(&new_fpl->vertices);
 #endif
 	}
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 75bdf66b81df..f31917683288 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -127,6 +127,11 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
 		list_add(&vertex->entry, &fpl->vertices);
 	}
 
+	fpl->edges = kvmalloc_array(fpl->count_unix, sizeof(*fpl->edges),
+				    GFP_KERNEL_ACCOUNT);
+	if (!fpl->edges)
+		goto err;
+
 	return 0;
 
 err:
@@ -136,6 +141,7 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
 
 void unix_destroy_fpl(struct scm_fp_list *fpl)
 {
+	kvfree(fpl->edges);
 	unix_free_vertices(fpl);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 03/15] af_unix: Link struct unix_edge when queuing skb.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 01/15] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 02/15] af_unix: Allocate struct unix_edge " Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 04/15] af_unix: Bulk update unix_tot_inflight/unix_inflight " Kuniyuki Iwashima
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Just before queuing skb with inflight fds, we call scm_stat_add(),
which is a good place to set up the preallocated struct unix_vertex
and struct unix_edge in UNIXCB(skb).fp.

Then, we call unix_add_edges() and construct the directed graph
as follows:

  1. Set the inflight socket's unix_sock to unix_edge.predecessor.
  2. Set the receiver's unix_sock to unix_edge.successor.
  3. Set the preallocated vertex to inflight socket's unix_sock.vertex.
  4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices.
  5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges.

Let's say we pass the fd of AF_UNIX socket A to B and the fd of B
to C.  The graph looks like this:

  +-------------------------+
  | unix_unvisited_vertices | <-------------------------.
  +-------------------------+                           |
  +                                                     |
  |     +--------------+             +--------------+   |         +--------------+
  |     |  unix_sock A | <---. .---> |  unix_sock B | <-|-. .---> |  unix_sock C |
  |     +--------------+     | |     +--------------+   | | |     +--------------+
  | .-+ |    vertex    |     | | .-+ |    vertex    |   | | |     |    vertex    |
  | |   +--------------+     | | |   +--------------+   | | |     +--------------+
  | |                        | | |                      | | |
  | |   +--------------+     | | |   +--------------+   | | |
  | '-> |  unix_vertex |     | | '-> |  unix_vertex |   | | |
  |     +--------------+     | |     +--------------+   | | |
  `---> |    entry     | +---------> |    entry     | +-' | |
        |--------------|     | |     |--------------|     | |
        |    edges     | <-. | |     |    edges     | <-. | |
        +--------------+   | | |     +--------------+   | | |
                           | | |                        | | |
    .----------------------' | | .----------------------' | |
    |                        | | |                        | |
    |   +--------------+     | | |   +--------------+     | |
    |   |   unix_edge  |     | | |   |   unix_edge  |     | |
    |   +--------------+     | | |   +--------------+     | |
    `-> | vertex_entry |     | | `-> | vertex_entry |     | |
        |--------------|     | |     |--------------|     | |
        |  predecessor | +---' |     |  predecessor | +---' |
        |--------------|       |     |--------------|       |
        |   successor  | +-----'     |   successor  | +-----'
        +--------------+             +--------------+

Henceforth, we denote such a graph as A -> B (-> C).

Now, we can express all inflight fd graphs that do not contain
embryo sockets.  We will support the particular case later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  2 +
 include/net/scm.h     |  1 +
 net/core/scm.c        |  2 +
 net/unix/af_unix.c    |  8 +++-
 net/unix/garbage.c    | 90 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 55c4abc26a71..f31ad1166346 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -22,6 +22,8 @@ extern unsigned int unix_tot_inflight;
 
 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
+void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver);
+void unix_del_edges(struct scm_fp_list *fpl);
 int unix_prepare_fpl(struct scm_fp_list *fpl);
 void unix_destroy_fpl(struct scm_fp_list *fpl);
 void unix_gc(void);
diff --git a/include/net/scm.h b/include/net/scm.h
index 5f5154e5096d..bbc5527809d1 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -32,6 +32,7 @@ struct scm_fp_list {
 	short			count_unix;
 	short			max;
 #ifdef CONFIG_UNIX
+	bool			inflight;
 	struct list_head	vertices;
 	struct unix_edge	*edges;
 #endif
diff --git a/net/core/scm.c b/net/core/scm.c
index 1bcc8a2d65e3..5763f3320358 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -90,6 +90,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 		fpl->max = SCM_MAX_FD;
 		fpl->user = NULL;
 #if IS_ENABLED(CONFIG_UNIX)
+		fpl->inflight = false;
 		fpl->edges = NULL;
 		INIT_LIST_HEAD(&fpl->vertices);
 #endif
@@ -384,6 +385,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		new_fpl->max = new_fpl->count;
 		new_fpl->user = get_uid(fpl->user);
 #if IS_ENABLED(CONFIG_UNIX)
+		new_fpl->inflight = false;
 		new_fpl->edges = NULL;
 		INIT_LIST_HEAD(&new_fpl->vertices);
 #endif
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a3b25d311560..24adbc4d5188 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1943,8 +1943,10 @@ static void scm_stat_add(struct sock *sk, struct sk_buff *skb)
 	struct scm_fp_list *fp = UNIXCB(skb).fp;
 	struct unix_sock *u = unix_sk(sk);
 
-	if (unlikely(fp && fp->count))
+	if (unlikely(fp && fp->count)) {
 		atomic_add(fp->count, &u->scm_stat.nr_fds);
+		unix_add_edges(fp, u);
+	}
 }
 
 static void scm_stat_del(struct sock *sk, struct sk_buff *skb)
@@ -1952,8 +1954,10 @@ static void scm_stat_del(struct sock *sk, struct sk_buff *skb)
 	struct scm_fp_list *fp = UNIXCB(skb).fp;
 	struct unix_sock *u = unix_sk(sk);
 
-	if (unlikely(fp && fp->count))
+	if (unlikely(fp && fp->count)) {
 		atomic_sub(fp->count, &u->scm_stat.nr_fds);
+		unix_del_edges(fp);
+	}
 }
 
 /*
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index f31917683288..36d665936096 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -101,6 +101,38 @@ struct unix_sock *unix_get_socket(struct file *filp)
 	return NULL;
 }
 
+static LIST_HEAD(unix_unvisited_vertices);
+
+static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
+{
+	struct unix_vertex *vertex = edge->predecessor->vertex;
+
+	if (!vertex) {
+		vertex = list_first_entry(&fpl->vertices, typeof(*vertex), entry);
+		vertex->out_degree = 0;
+		INIT_LIST_HEAD(&vertex->edges);
+
+		list_move_tail(&vertex->entry, &unix_unvisited_vertices);
+		edge->predecessor->vertex = vertex;
+	}
+
+	vertex->out_degree++;
+	list_add_tail(&edge->vertex_entry, &vertex->edges);
+}
+
+static void unix_del_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
+{
+	struct unix_vertex *vertex = edge->predecessor->vertex;
+
+	list_del(&edge->vertex_entry);
+	vertex->out_degree--;
+
+	if (!vertex->out_degree) {
+		edge->predecessor->vertex = NULL;
+		list_move_tail(&vertex->entry, &fpl->vertices);
+	}
+}
+
 static void unix_free_vertices(struct scm_fp_list *fpl)
 {
 	struct unix_vertex *vertex, *next_vertex;
@@ -111,6 +143,60 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
 	}
 }
 
+DEFINE_SPINLOCK(unix_gc_lock);
+
+void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
+{
+	int i = 0, j = 0;
+
+	spin_lock(&unix_gc_lock);
+
+	if (!fpl->count_unix)
+		goto out;
+
+	do {
+		struct unix_sock *inflight = unix_get_socket(fpl->fp[j++]);
+		struct unix_edge *edge;
+
+		if (!inflight)
+			continue;
+
+		edge = fpl->edges + i++;
+		edge->predecessor = inflight;
+		edge->successor = receiver;
+
+		unix_add_edge(fpl, edge);
+	} while (i < fpl->count_unix);
+
+out:
+	spin_unlock(&unix_gc_lock);
+
+	fpl->inflight = true;
+
+	unix_free_vertices(fpl);
+}
+
+void unix_del_edges(struct scm_fp_list *fpl)
+{
+	int i = 0;
+
+	spin_lock(&unix_gc_lock);
+
+	if (!fpl->count_unix)
+		goto out;
+
+	do {
+		struct unix_edge *edge = fpl->edges + i++;
+
+		unix_del_edge(fpl, edge);
+	} while (i < fpl->count_unix);
+
+out:
+	spin_unlock(&unix_gc_lock);
+
+	fpl->inflight = false;
+}
+
 int unix_prepare_fpl(struct scm_fp_list *fpl)
 {
 	struct unix_vertex *vertex;
@@ -141,11 +227,13 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
 
 void unix_destroy_fpl(struct scm_fp_list *fpl)
 {
+	if (fpl->inflight)
+		unix_del_edges(fpl);
+
 	kvfree(fpl->edges);
 	unix_free_vertices(fpl);
 }
 
-DEFINE_SPINLOCK(unix_gc_lock);
 unsigned int unix_tot_inflight;
 static LIST_HEAD(gc_candidates);
 static LIST_HEAD(gc_inflight_list);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 04/15] af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 03/15] af_unix: Link struct unix_edge when queuing skb Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Currently, we track the number of inflight sockets in two variables.
unix_tot_inflight is the total number of inflight AF_UNIX sockets on
the host, and user->unix_inflight is the number of inflight fds per
user.

We update them one by one in unix_inflight(), which can be done once
in batch.  Also, sendmsg() could fail even after unix_inflight(), then
we need to acquire unix_gc_lock only to decrement the counters.

Let's bulk update the counters in unix_add_edges() and unix_del_edges(),
which is called only for successfully passed fds.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 36d665936096..84c8ea524a98 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -144,6 +144,7 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
 }
 
 DEFINE_SPINLOCK(unix_gc_lock);
+unsigned int unix_tot_inflight;
 
 void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
 {
@@ -168,7 +169,10 @@ void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
 		unix_add_edge(fpl, edge);
 	} while (i < fpl->count_unix);
 
+	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + fpl->count_unix);
 out:
+	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight + fpl->count);
+
 	spin_unlock(&unix_gc_lock);
 
 	fpl->inflight = true;
@@ -191,7 +195,10 @@ void unix_del_edges(struct scm_fp_list *fpl)
 		unix_del_edge(fpl, edge);
 	} while (i < fpl->count_unix);
 
+	WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - fpl->count_unix);
 out:
+	WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight - fpl->count);
+
 	spin_unlock(&unix_gc_lock);
 
 	fpl->inflight = false;
@@ -234,7 +241,6 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
-unsigned int unix_tot_inflight;
 static LIST_HEAD(gc_candidates);
 static LIST_HEAD(gc_inflight_list);
 
@@ -255,13 +261,8 @@ void unix_inflight(struct user_struct *user, struct file *filp)
 			WARN_ON_ONCE(list_empty(&u->link));
 		}
 		u->inflight++;
-
-		/* Paired with READ_ONCE() in wait_for_unix_gc() */
-		WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + 1);
 	}
 
-	WRITE_ONCE(user->unix_inflight, user->unix_inflight + 1);
-
 	spin_unlock(&unix_gc_lock);
 }
 
@@ -278,13 +279,8 @@ void unix_notinflight(struct user_struct *user, struct file *filp)
 		u->inflight--;
 		if (!u->inflight)
 			list_del_init(&u->link);
-
-		/* Paired with READ_ONCE() in wait_for_unix_gc() */
-		WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - 1);
 	}
 
-	WRITE_ONCE(user->unix_inflight, user->unix_inflight - 1);
-
 	spin_unlock(&unix_gc_lock);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 04/15] af_unix: Bulk update unix_tot_inflight/unix_inflight " Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-05  8:53   ` Paolo Abeni
  2024-03-01  2:22 ` [PATCH v4 net-next 06/15] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The new GC will use a depth first search graph algorithm to find
cyclic references.  The algorithm visits every vertex exactly once.

Here, we implement its DFS part without recursion so that no one
can abuse it.

unix_walk_scc() marks every vertex unvisited by initialising index
as UNIX_VERTEX_INDEX_UNVISITED, iterates inflight vertices in
unix_unvisited_vertices, and call __unix_walk_scc() to start DFS
from an arbitrary vertex.

__unix_walk_scc() iterates all edges starting from the vertex and
explores the neighbour vertices with DFS using edge_stack.

After visiting all neighbours, __unix_walk_scc() moves the visited
vertex to unix_visited_vertices so that unix_walk_scc() will not
restart DFS from the visited vertex.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  2 ++
 net/unix/garbage.c    | 74 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index f31ad1166346..970a91da2239 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -33,12 +33,14 @@ struct unix_vertex {
 	struct list_head edges;
 	struct list_head entry;
 	unsigned long out_degree;
+	unsigned long index;
 };
 
 struct unix_edge {
 	struct unix_sock *predecessor;
 	struct unix_sock *successor;
 	struct list_head vertex_entry;
+	struct list_head stack_entry;
 };
 
 struct sock *unix_peer_get(struct sock *sk);
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 84c8ea524a98..8b16ab9e240e 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -103,6 +103,11 @@ struct unix_sock *unix_get_socket(struct file *filp)
 
 static LIST_HEAD(unix_unvisited_vertices);
 
+enum unix_vertex_index {
+	UNIX_VERTEX_INDEX_UNVISITED,
+	UNIX_VERTEX_INDEX_START,
+};
+
 static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 {
 	struct unix_vertex *vertex = edge->predecessor->vertex;
@@ -241,6 +246,73 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
+static LIST_HEAD(unix_visited_vertices);
+
+static void __unix_walk_scc(struct unix_vertex *vertex)
+{
+	unsigned long index = UNIX_VERTEX_INDEX_START;
+	struct unix_edge *edge;
+	LIST_HEAD(edge_stack);
+
+next_vertex:
+	vertex->index = index;
+	index++;
+
+	/* Explore neighbour vertices (receivers of the current vertex's fd). */
+	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
+		struct unix_vertex *next_vertex = edge->successor->vertex;
+
+		if (!next_vertex)
+			continue;
+
+		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
+			/* Iterative deepening depth first search
+			 *
+			 *   1. Push a forward edge to edge_stack and set
+			 *      the successor to vertex for the next iteration.
+			 */
+			list_add(&edge->stack_entry, &edge_stack);
+
+			vertex = next_vertex;
+			goto next_vertex;
+
+			/*   2. Pop the edge directed to the current vertex
+			 *      and restore the ancestor for backtracking.
+			 */
+prev_vertex:
+			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
+			list_del_init(&edge->stack_entry);
+
+			vertex = edge->predecessor->vertex;
+		}
+	}
+
+	/* Don't restart DFS from this vertex in unix_walk_scc(). */
+	list_move_tail(&vertex->entry, &unix_visited_vertices);
+
+	/* Need backtracking ? */
+	if (!list_empty(&edge_stack))
+		goto prev_vertex;
+}
+
+static void unix_walk_scc(void)
+{
+	struct unix_vertex *vertex;
+
+	list_for_each_entry(vertex, &unix_unvisited_vertices, entry)
+		vertex->index = UNIX_VERTEX_INDEX_UNVISITED;
+
+	/* Visit every vertex exactly once.
+	 * __unix_walk_scc() moves visited vertices to unix_visited_vertices.
+	 */
+	while (!list_empty(&unix_unvisited_vertices)) {
+		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
+		__unix_walk_scc(vertex);
+	}
+
+	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
+}
+
 static LIST_HEAD(gc_candidates);
 static LIST_HEAD(gc_inflight_list);
 
@@ -388,6 +460,8 @@ static void __unix_gc(struct work_struct *work)
 
 	spin_lock(&unix_gc_lock);
 
+	unix_walk_scc();
+
 	/* First, select candidates for garbage collection.  Only
 	 * in-flight sockets are considered, and from those only ones
 	 * which don't have any external reference.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 06/15] af_unix: Detect Strongly Connected Components.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 07/15] af_unix: Save listener for embryo socket Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

In the new GC, we use a simple graph algorithm, Tarjan's Strongly
Connected Components (SCC) algorithm, to find cyclic references.

The algorithm visits every vertex exactly once using depth-first
search (DFS).

DFS starts by pushing an input vertex to a stack and assigning it
a unique number.  Two fields, index and lowlink, are initialised
with the number, but lowlink could be updated later during DFS.

If a vertex has an edge to an unvisited inflight vertex, we visit
it and do the same processing.  So, we will have vertices in the
stack in the order they appear and number them consecutively in
the same order.

If a vertex has a back-edge to a visited vertex in the stack,
we update the predecessor's lowlink with the successor's index.

After iterating edges from the vertex, we check if its index
equals its lowlink.

If the lowlink is different from the index, it shows there was a
back-edge.  Then, we go backtracking and propagate the lowlink to
its predecessor and resume the previous edge iteration from the
next edge.

If the lowlink is the same as the index, we pop vertices before
and including the vertex from the stack.  Then, the set of vertices
is SCC, possibly forming a cycle.  At the same time, we move the
vertices to unix_visited_vertices.

When we finish the algorithm, all vertices in each SCC will be
linked via unix_vertex.scc_entry.

Let's take an example.  We have a graph including five inflight
vertices (F is not inflight):

  A -> B -> C -> D -> E (-> F)
       ^         |
       `---------'

Suppose that we start DFS from C.  We will visit C, D, and B first
and initialise their index and lowlink.  Then, the stack looks like
this:

  > B = (3, 3)  (index, lowlink)
    D = (2, 2)
    C = (1, 1)

When checking B's edge to C, we update B's lowlink with C's index
and propagate it to D.

    B = (3, 1)  (index, lowlink)
  > D = (2, 1)
    C = (1, 1)

Next, we visit E, which has no edge to an inflight vertex.

  > E = (4, 4)  (index, lowlink)
    B = (3, 1)
    D = (2, 1)
    C = (1, 1)

When we leave from E, its index and lowlink are the same, so we
pop E from the stack as single-vertex SCC.  Next, we leave from
D but do nothing because its lowlink is different from its index.

    B = (3, 1)  (index, lowlink)
    D = (2, 1)
  > C = (1, 1)

Then, we leave from C, whose index and lowlink are the same, so
we pop B, D and C as SCC.

Last, we do DFS for the rest of vertices, A, which is also a
single-vertex SCC.

Finally, each unix_vertex.scc_entry is linked as follows:

  A -.  B -> C -> D  E -.
  ^  |  ^         |  ^  |
  `--'  `---------'  `--'

We use SCC later to decide whether we can garbage-collect the
sockets.

Note that we still cannot detect SCC properly if an edge points
to an embryo socket.  The following two patches will sort it out.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  3 +++
 net/unix/garbage.c    | 46 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 970a91da2239..67736767b616 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -32,8 +32,11 @@ void wait_for_unix_gc(struct scm_fp_list *fpl);
 struct unix_vertex {
 	struct list_head edges;
 	struct list_head entry;
+	struct list_head scc_entry;
 	unsigned long out_degree;
 	unsigned long index;
+	unsigned long lowlink;
+	bool on_stack;
 };
 
 struct unix_edge {
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 8b16ab9e240e..33aadaa35346 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -251,11 +251,19 @@ static LIST_HEAD(unix_visited_vertices);
 static void __unix_walk_scc(struct unix_vertex *vertex)
 {
 	unsigned long index = UNIX_VERTEX_INDEX_START;
+	LIST_HEAD(vertex_stack);
 	struct unix_edge *edge;
 	LIST_HEAD(edge_stack);
 
 next_vertex:
+	/* Push vertex to vertex_stack.
+	 * The vertex will be popped when finalising SCC later.
+	 */
+	vertex->on_stack = true;
+	list_add(&vertex->scc_entry, &vertex_stack);
+
 	vertex->index = index;
+	vertex->lowlink = index;
 	index++;
 
 	/* Explore neighbour vertices (receivers of the current vertex's fd). */
@@ -283,12 +291,46 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
 			list_del_init(&edge->stack_entry);
 
+			next_vertex = vertex;
 			vertex = edge->predecessor->vertex;
+
+			/* If the successor has a smaller lowlink, two vertices
+			 * are in the same SCC, so propagate the smaller lowlink
+			 * to skip SCC finalisation.
+			 */
+			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
+		} else if (next_vertex->on_stack) {
+			/* Loop detected by a back/cross edge.
+			 *
+			 * The successor is on vertex_stack, so two vertices are
+			 * in the same SCC.  If the successor has a smaller index,
+			 * propagate it to skip SCC finalisation.
+			 */
+			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
+		} else {
+			/* The successor was already grouped as another SCC */
 		}
 	}
 
-	/* Don't restart DFS from this vertex in unix_walk_scc(). */
-	list_move_tail(&vertex->entry, &unix_visited_vertices);
+	if (vertex->index == vertex->lowlink) {
+		struct list_head scc;
+
+		/* SCC finalised.
+		 *
+		 * If the lowlink was not updated, all the vertices above on
+		 * vertex_stack are in the same SCC.  Group them using scc_entry.
+		 */
+		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
+
+		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
+			/* Don't restart DFS from this vertex in unix_walk_scc(). */
+			list_move_tail(&vertex->entry, &unix_visited_vertices);
+
+			vertex->on_stack = false;
+		}
+
+		list_del(&scc);
+	}
 
 	/* Need backtracking ? */
 	if (!list_empty(&edge_stack))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 07/15] af_unix: Save listener for embryo socket.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 06/15] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 08/15] af_unix: Fix up unix_edge.successor " Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

This is a prep patch for the following change, where we need to
fetch the listening socket from the successor embryo socket
during GC.

We add a new field to struct unix_sock to save a pointer to a
listening socket.

We set it when connect() creates a new socket, and clear it when
accept() is called.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h | 1 +
 net/unix/af_unix.c    | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 67736767b616..dc7469191195 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -83,6 +83,7 @@ struct unix_sock {
 	struct path		path;
 	struct mutex		iolock, bindlock;
 	struct sock		*peer;
+	struct sock		*listener;
 	struct unix_vertex	*vertex;
 	struct list_head	link;
 	unsigned long		inflight;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 24adbc4d5188..af74e7ebc35a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -979,6 +979,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_max_ack_backlog	= net->unx.sysctl_max_dgram_qlen;
 	sk->sk_destruct		= unix_sock_destructor;
 	u = unix_sk(sk);
+	u->listener = NULL;
 	u->inflight = 0;
 	u->vertex = NULL;
 	u->path.dentry = NULL;
@@ -1598,6 +1599,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	newsk->sk_type		= sk->sk_type;
 	init_peercred(newsk);
 	newu = unix_sk(newsk);
+	newu->listener = other;
 	RCU_INIT_POINTER(newsk->sk_wq, &newu->peer_wq);
 	otheru = unix_sk(other);
 
@@ -1693,8 +1695,8 @@ static int unix_accept(struct socket *sock, struct socket *newsock, int flags,
 		       bool kern)
 {
 	struct sock *sk = sock->sk;
-	struct sock *tsk;
 	struct sk_buff *skb;
+	struct sock *tsk;
 	int err;
 
 	err = -EOPNOTSUPP;
@@ -1719,6 +1721,7 @@ static int unix_accept(struct socket *sock, struct socket *newsock, int flags,
 	}
 
 	tsk = skb->sk;
+	unix_sk(tsk)->listener = NULL;
 	skb_free_datagram(sk, skb);
 	wake_up_interruptible(&unix_sk(sk)->peer_wait);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 08/15] af_unix: Fix up unix_edge.successor for embryo socket.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 07/15] af_unix: Save listener for embryo socket Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 09/15] af_unix: Save O(n) setup of Tarjan's algo Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

To garbage collect inflight AF_UNIX sockets, we must define the
cyclic reference appropriately.  This is a bit tricky if the loop
consists of embryo sockets.

Suppose that the fd of AF_UNIX socket A is passed to D and the fd B
to C and that C and D are embryo sockets of A and B, respectively.
It may appear that there are two separate graphs, A (-> D) and
B (-> C), but this is not correct.

     A --. .-- B
          X
     C <-' `-> D

Now, D holds A's refcount, and C has B's refcount, so unix_release()
will never be called for A and B when we close() them.  However, no
one can call close() for D and C to free skbs holding refcounts of A
and B because C/D is in A/B's receive queue, which should have been
purged by unix_release() for A and B.

So, here's another type of cyclic reference.  When a fd of an AF_UNIX
socket is passed to an embryo socket, the reference is indirectly held
by its parent listening socket.

  .-> A                            .-> B
  |   `- sk_receive_queue          |   `- sk_receive_queue
  |      `- skb                    |      `- skb
  |         `- sk == C             |         `- sk == D
  |            `- sk_receive_queue |           `- sk_receive_queue
  |               `- skb +---------'               `- skb +-.
  |                                                         |
  `---------------------------------------------------------'

Technically, the graph must be denoted as A <-> B instead of A (-> D)
and B (-> C) to find such a cyclic reference without touching each
socket's receive queue.

  .-> A --. .-- B <-.
  |        X        |  ==  A <-> B
  `-- C <-' `-> D --'

We apply this fixup during GC by fetching the real successor by
unix_edge_successor().

When we call accept(), we clear unix_sock.listener under unix_gc_lock
not to confuse GC.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c    |  2 +-
 net/unix/garbage.c    | 20 +++++++++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index dc7469191195..414463803b7e 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -24,6 +24,7 @@ void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver);
 void unix_del_edges(struct scm_fp_list *fpl);
+void unix_update_edges(struct unix_sock *receiver);
 int unix_prepare_fpl(struct scm_fp_list *fpl);
 void unix_destroy_fpl(struct scm_fp_list *fpl);
 void unix_gc(void);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index af74e7ebc35a..ae77e2dc0dae 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1721,7 +1721,7 @@ static int unix_accept(struct socket *sock, struct socket *newsock, int flags,
 	}
 
 	tsk = skb->sk;
-	unix_sk(tsk)->listener = NULL;
+	unix_update_edges(unix_sk(tsk));
 	skb_free_datagram(sk, skb);
 	wake_up_interruptible(&unix_sk(sk)->peer_wait);
 
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 33aadaa35346..40a40028261b 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -101,6 +101,17 @@ struct unix_sock *unix_get_socket(struct file *filp)
 	return NULL;
 }
 
+static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
+{
+	/* If an embryo socket has a fd, the listener has
+	 * indirectly holds the fd's refcnt.
+	 */
+	if (edge->successor->listener)
+		return unix_sk(edge->successor->listener)->vertex;
+
+	return edge->successor->vertex;
+}
+
 static LIST_HEAD(unix_unvisited_vertices);
 
 enum unix_vertex_index {
@@ -209,6 +220,13 @@ void unix_del_edges(struct scm_fp_list *fpl)
 	fpl->inflight = false;
 }
 
+void unix_update_edges(struct unix_sock *receiver)
+{
+	spin_lock(&unix_gc_lock);
+	receiver->listener = NULL;
+	spin_unlock(&unix_gc_lock);
+}
+
 int unix_prepare_fpl(struct scm_fp_list *fpl)
 {
 	struct unix_vertex *vertex;
@@ -268,7 +286,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 	/* Explore neighbour vertices (receivers of the current vertex's fd). */
 	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
-		struct unix_vertex *next_vertex = edge->successor->vertex;
+		struct unix_vertex *next_vertex = unix_edge_successor(edge);
 
 		if (!next_vertex)
 			continue;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 09/15] af_unix: Save O(n) setup of Tarjan's algo.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 08/15] af_unix: Fix up unix_edge.successor " Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 10/15] af_unix: Skip GC if no cycle exists Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Before starting Tarjan's algorithm, we need to mark all vertices
as unvisited.  We can save this O(n) setup by reserving two special
indices (0, 1) and using two variables.

The first time we link a vertex to unix_unvisited_vertices, we set
unix_vertex_unvisited_index to index.

During DFS, we can see that the index of unvisited vertices is the
same as unix_vertex_unvisited_index.

When we finalise SCC later, we set unix_vertex_grouped_index to each
vertex's index.

Then, we can know (i) that the vertex is on the stack if the index
of a visited vertex is >= 2 and (ii) that it is not on the stack and
belongs to a different SCC if the index is unix_vertex_grouped_index.

After the whole algorithm, all indices of vertices are set as
unix_vertex_grouped_index.

Next time we start DFS, we know that all unvisited vertices have
unix_vertex_grouped_index, and we can use unix_vertex_unvisited_index
as the not-on-stack marker.

To use the same variable in __unix_walk_scc(), we can swap
unix_vertex_(grouped|unvisited)_index at the end of Tarjan's
algorithm.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  1 -
 net/unix/garbage.c    | 26 +++++++++++++++-----------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 414463803b7e..ec040caaa4b5 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -37,7 +37,6 @@ struct unix_vertex {
 	unsigned long out_degree;
 	unsigned long index;
 	unsigned long lowlink;
-	bool on_stack;
 };
 
 struct unix_edge {
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 40a40028261b..3f55656c377e 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -115,16 +115,20 @@ static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
 static LIST_HEAD(unix_unvisited_vertices);
 
 enum unix_vertex_index {
-	UNIX_VERTEX_INDEX_UNVISITED,
+	UNIX_VERTEX_INDEX_MARK1,
+	UNIX_VERTEX_INDEX_MARK2,
 	UNIX_VERTEX_INDEX_START,
 };
 
+static unsigned long unix_vertex_unvisited_index = UNIX_VERTEX_INDEX_MARK1;
+
 static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 {
 	struct unix_vertex *vertex = edge->predecessor->vertex;
 
 	if (!vertex) {
 		vertex = list_first_entry(&fpl->vertices, typeof(*vertex), entry);
+		vertex->index = unix_vertex_unvisited_index;
 		vertex->out_degree = 0;
 		INIT_LIST_HEAD(&vertex->edges);
 
@@ -265,6 +269,7 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 }
 
 static LIST_HEAD(unix_visited_vertices);
+static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
 
 static void __unix_walk_scc(struct unix_vertex *vertex)
 {
@@ -274,10 +279,10 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 	LIST_HEAD(edge_stack);
 
 next_vertex:
-	/* Push vertex to vertex_stack.
+	/* Push vertex to vertex_stack and mark it as on-stack
+	 * (index >= UNIX_VERTEX_INDEX_START).
 	 * The vertex will be popped when finalising SCC later.
 	 */
-	vertex->on_stack = true;
 	list_add(&vertex->scc_entry, &vertex_stack);
 
 	vertex->index = index;
@@ -291,7 +296,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 		if (!next_vertex)
 			continue;
 
-		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
+		if (next_vertex->index == unix_vertex_unvisited_index) {
 			/* Iterative deepening depth first search
 			 *
 			 *   1. Push a forward edge to edge_stack and set
@@ -317,7 +322,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			 * to skip SCC finalisation.
 			 */
 			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
-		} else if (next_vertex->on_stack) {
+		} else if (next_vertex->index != unix_vertex_grouped_index) {
 			/* Loop detected by a back/cross edge.
 			 *
 			 * The successor is on vertex_stack, so two vertices are
@@ -344,7 +349,8 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			/* Don't restart DFS from this vertex in unix_walk_scc(). */
 			list_move_tail(&vertex->entry, &unix_visited_vertices);
 
-			vertex->on_stack = false;
+			/* Mark vertex as off-stack. */
+			vertex->index = unix_vertex_grouped_index;
 		}
 
 		list_del(&scc);
@@ -357,20 +363,18 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 static void unix_walk_scc(void)
 {
-	struct unix_vertex *vertex;
-
-	list_for_each_entry(vertex, &unix_unvisited_vertices, entry)
-		vertex->index = UNIX_VERTEX_INDEX_UNVISITED;
-
 	/* Visit every vertex exactly once.
 	 * __unix_walk_scc() moves visited vertices to unix_visited_vertices.
 	 */
 	while (!list_empty(&unix_unvisited_vertices)) {
+		struct unix_vertex *vertex;
+
 		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
 		__unix_walk_scc(vertex);
 	}
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
+	swap(unix_vertex_unvisited_index, unix_vertex_grouped_index);
 }
 
 static LIST_HEAD(gc_candidates);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 10/15] af_unix: Skip GC if no cycle exists.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 09/15] af_unix: Save O(n) setup of Tarjan's algo Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 11/15] af_unix: Avoid Tarjan's algorithm if unnecessary Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We do not need to run GC if there is no possible cyclic reference.
We use unix_graph_maybe_cyclic to decide if we should run GC.

If a fd of an AF_UNIX socket is passed to an already inflight AF_UNIX
socket, they could form a cyclic reference.  Then, we set true to
unix_graph_maybe_cyclic and later run Tarjan's algorithm to group
them into SCC.

Once we run Tarjan's algorithm, we are 100% sure whether cyclic
references exist or not.  If there is no cycle, we set false to
unix_graph_maybe_cyclic and can skip the entire garbage collection
next time.

When finalising SCC, we set true to unix_graph_maybe_cyclic if SCC
consists of multiple vertices.

Even if SCC is a single vertex, a cycle might exist as self-fd passing.
Given the corner case is rare, we detect it by checking all edges of
the vertex and set true to unix_graph_maybe_cyclic.

With this change, __unix_gc() is just a spin_lock() dance in the normal
usage.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 48 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 3f55656c377e..a9bd087b430c 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -112,6 +112,19 @@ static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
 	return edge->successor->vertex;
 }
 
+static bool unix_graph_maybe_cyclic;
+
+static void unix_update_graph(struct unix_vertex *vertex)
+{
+	/* If the receiver socket is not inflight, no cyclic
+	 * reference could be formed.
+	 */
+	if (!vertex)
+		return;
+
+	unix_graph_maybe_cyclic = true;
+}
+
 static LIST_HEAD(unix_unvisited_vertices);
 
 enum unix_vertex_index {
@@ -138,12 +151,16 @@ static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 
 	vertex->out_degree++;
 	list_add_tail(&edge->vertex_entry, &vertex->edges);
+
+	unix_update_graph(unix_edge_successor(edge));
 }
 
 static void unix_del_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 {
 	struct unix_vertex *vertex = edge->predecessor->vertex;
 
+	unix_update_graph(unix_edge_successor(edge));
+
 	list_del(&edge->vertex_entry);
 	vertex->out_degree--;
 
@@ -227,6 +244,7 @@ void unix_del_edges(struct scm_fp_list *fpl)
 void unix_update_edges(struct unix_sock *receiver)
 {
 	spin_lock(&unix_gc_lock);
+	unix_update_graph(unix_sk(receiver->listener)->vertex);
 	receiver->listener = NULL;
 	spin_unlock(&unix_gc_lock);
 }
@@ -268,6 +286,26 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
+static bool unix_scc_cyclic(struct list_head *scc)
+{
+	struct unix_vertex *vertex;
+	struct unix_edge *edge;
+
+	/* Multiple vertices form a circle ? */
+	if (!list_is_singular(scc))
+		return true;
+
+	vertex = list_first_entry(scc, typeof(*vertex), scc_entry);
+
+	/* Self-reference or a embryo-listener circle exists ? */
+	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
+		if (unix_edge_successor(edge) == vertex)
+			return true;
+	}
+
+	return false;
+}
+
 static LIST_HEAD(unix_visited_vertices);
 static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
 
@@ -353,6 +391,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			vertex->index = unix_vertex_grouped_index;
 		}
 
+		if (!unix_graph_maybe_cyclic)
+			unix_graph_maybe_cyclic = unix_scc_cyclic(&scc);
+
 		list_del(&scc);
 	}
 
@@ -363,6 +404,8 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 static void unix_walk_scc(void)
 {
+	unix_graph_maybe_cyclic = false;
+
 	/* Visit every vertex exactly once.
 	 * __unix_walk_scc() moves visited vertices to unix_visited_vertices.
 	 */
@@ -524,6 +567,9 @@ static void __unix_gc(struct work_struct *work)
 
 	spin_lock(&unix_gc_lock);
 
+	if (!unix_graph_maybe_cyclic)
+		goto skip_gc;
+
 	unix_walk_scc();
 
 	/* First, select candidates for garbage collection.  Only
@@ -617,7 +663,7 @@ static void __unix_gc(struct work_struct *work)
 
 	/* All candidates should have been detached by now. */
 	WARN_ON_ONCE(!list_empty(&gc_candidates));
-
+skip_gc:
 	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
 	WRITE_ONCE(gc_in_progress, false);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 11/15] af_unix: Avoid Tarjan's algorithm if unnecessary.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 10/15] af_unix: Skip GC if no cycle exists Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 12/15] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Once a cyclic reference is formed, we need to run GC to check if
there is dead SCC.

However, we do not need to run Tarjan's algorithm if we know that
the shape of the inflight graph has not been changed.

If an edge is added/updated/deleted and the edge's successor is
inflight, we set false to unix_graph_grouped, which means we need
to re-classify SCC.

Once we finalise SCC, we set true to unix_graph_grouped.

While unix_graph_grouped is true, we can iterate the grouped
SCC using vertex->scc_entry in unix_walk_scc_fast().

list_add() and list_for_each_entry_reverse() uses seem weird, but
they are to keep the vertex order consistent and make writing test
easier.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a9bd087b430c..60cee04d0301 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -113,6 +113,7 @@ static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
 }
 
 static bool unix_graph_maybe_cyclic;
+static bool unix_graph_grouped;
 
 static void unix_update_graph(struct unix_vertex *vertex)
 {
@@ -123,6 +124,7 @@ static void unix_update_graph(struct unix_vertex *vertex)
 		return;
 
 	unix_graph_maybe_cyclic = true;
+	unix_graph_grouped = false;
 }
 
 static LIST_HEAD(unix_unvisited_vertices);
@@ -144,6 +146,7 @@ static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
 		vertex->index = unix_vertex_unvisited_index;
 		vertex->out_degree = 0;
 		INIT_LIST_HEAD(&vertex->edges);
+		INIT_LIST_HEAD(&vertex->scc_entry);
 
 		list_move_tail(&vertex->entry, &unix_unvisited_vertices);
 		edge->predecessor->vertex = vertex;
@@ -418,6 +421,26 @@ static void unix_walk_scc(void)
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
 	swap(unix_vertex_unvisited_index, unix_vertex_grouped_index);
+
+	unix_graph_grouped = true;
+}
+
+static void unix_walk_scc_fast(void)
+{
+	while (!list_empty(&unix_unvisited_vertices)) {
+		struct unix_vertex *vertex;
+		struct list_head scc;
+
+		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
+		list_add(&scc, &vertex->scc_entry);
+
+		list_for_each_entry_reverse(vertex, &scc, scc_entry)
+			list_move_tail(&vertex->entry, &unix_visited_vertices);
+
+		list_del(&scc);
+	}
+
+	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
 }
 
 static LIST_HEAD(gc_candidates);
@@ -570,7 +593,10 @@ static void __unix_gc(struct work_struct *work)
 	if (!unix_graph_maybe_cyclic)
 		goto skip_gc;
 
-	unix_walk_scc();
+	if (unix_graph_grouped)
+		unix_walk_scc_fast();
+	else
+		unix_walk_scc();
 
 	/* First, select candidates for garbage collection.  Only
 	 * in-flight sockets are considered, and from those only ones
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 12/15] af_unix: Assign a unique index to SCC.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 11/15] af_unix: Avoid Tarjan's algorithm if unnecessary Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-05  8:44   ` Paolo Abeni
  2024-03-01  2:22 ` [PATCH v4 net-next 13/15] af_unix: Detect dead SCC Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The definition of the lowlink in Tarjan's algorithm is the
smallest index of a vertex that is reachable with at most one
back-edge in SCC.  This is not useful for a cross-edge.

If we start traversing from A in the following graph, the final
lowlink of D is 3.  The cross-edge here is one between D and C.

  A -> B -> D   D = (4, 3)  (index, lowlink)
  ^    |    |   C = (3, 1)
  |    V    |   B = (2, 1)
  `--- C <--'   A = (1, 1)

This is because the lowlink of D is updated with the index of C.

In the following patch, we detect a dead SCC by checking two
conditions for each vertex.

  1) vertex has no edge directed to another SCC (no bridge)
  2) vertex's out_degree is the same as the refcount of its file

If 1) is false, there is a receiver of all fds of the SCC and
its ancestor SCC.

To evaluate 1), we need to assign a unique index to each SCC and
assign it to all vertices in the SCC.

This patch changes the lowlink update logic for cross-edge so
that in the example above, the lowlink of D is updated with the
lowlink of C.

  A -> B -> D   D = (4, 1)  (index, lowlink)
  ^    |    |   C = (3, 1)
  |    V    |   B = (2, 1)
  `--- C <--'   A = (1, 1)

Then, all vertices in the same SCC have the same lowlink, and we
can quickly find the bridge connecting to different SCC if exists.

However, it is no longer called lowlink, so we rename it to
scc_index.  (It's sometimes called lowpoint.)

Also, we add a global variable to hold the last index used in DFS
so that we do not reset the initial index in each DFS.

This patch can be squashed to the SCC detection patch but is
split deliberately for anyone wondering why lowlink is not used
as used in the original Tarjan's algorithm and many reference
implementations.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c    | 29 +++++++++++++++--------------
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index ec040caaa4b5..696d997a5ac9 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -36,7 +36,7 @@ struct unix_vertex {
 	struct list_head scc_entry;
 	unsigned long out_degree;
 	unsigned long index;
-	unsigned long lowlink;
+	unsigned long scc_index;
 };
 
 struct unix_edge {
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 60cee04d0301..e825894d8bca 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -312,9 +312,8 @@ static bool unix_scc_cyclic(struct list_head *scc)
 static LIST_HEAD(unix_visited_vertices);
 static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
 
-static void __unix_walk_scc(struct unix_vertex *vertex)
+static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_index)
 {
-	unsigned long index = UNIX_VERTEX_INDEX_START;
 	LIST_HEAD(vertex_stack);
 	struct unix_edge *edge;
 	LIST_HEAD(edge_stack);
@@ -326,9 +325,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 	 */
 	list_add(&vertex->scc_entry, &vertex_stack);
 
-	vertex->index = index;
-	vertex->lowlink = index;
-	index++;
+	vertex->index = *last_index;
+	vertex->scc_index = *last_index;
+	(*last_index)++;
 
 	/* Explore neighbour vertices (receivers of the current vertex's fd). */
 	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
@@ -358,30 +357,30 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 			next_vertex = vertex;
 			vertex = edge->predecessor->vertex;
 
-			/* If the successor has a smaller lowlink, two vertices
-			 * are in the same SCC, so propagate the smaller lowlink
+			/* If the successor has a smaller scc_index, two vertices
+			 * are in the same SCC, so propagate the smaller scc_index
 			 * to skip SCC finalisation.
 			 */
-			vertex->lowlink = min(vertex->lowlink, next_vertex->lowlink);
+			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
 		} else if (next_vertex->index != unix_vertex_grouped_index) {
 			/* Loop detected by a back/cross edge.
 			 *
-			 * The successor is on vertex_stack, so two vertices are
-			 * in the same SCC.  If the successor has a smaller index,
+			 * The successor is on vertex_stack, so two vertices are in
+			 * the same SCC.  If the successor has a smaller *scc_index*,
 			 * propagate it to skip SCC finalisation.
 			 */
-			vertex->lowlink = min(vertex->lowlink, next_vertex->index);
+			vertex->scc_index = min(vertex->scc_index, next_vertex->scc_index);
 		} else {
 			/* The successor was already grouped as another SCC */
 		}
 	}
 
-	if (vertex->index == vertex->lowlink) {
+	if (vertex->index == vertex->scc_index) {
 		struct list_head scc;
 
 		/* SCC finalised.
 		 *
-		 * If the lowlink was not updated, all the vertices above on
+		 * If the scc_index was not updated, all the vertices above on
 		 * vertex_stack are in the same SCC.  Group them using scc_entry.
 		 */
 		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
@@ -407,6 +406,8 @@ static void __unix_walk_scc(struct unix_vertex *vertex)
 
 static void unix_walk_scc(void)
 {
+	unsigned long last_index = UNIX_VERTEX_INDEX_START;
+
 	unix_graph_maybe_cyclic = false;
 
 	/* Visit every vertex exactly once.
@@ -416,7 +417,7 @@ static void unix_walk_scc(void)
 		struct unix_vertex *vertex;
 
 		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
-		__unix_walk_scc(vertex);
+		__unix_walk_scc(vertex, &last_index);
 	}
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 13/15] af_unix: Detect dead SCC.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 12/15] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 14/15] af_unix: Replace garbage collection algorithm Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

When iterating SCC, we call unix_vertex_dead() for each vertex
to check if the vertex is close()d and has no bridge to another
SCC.

If both conditions are true for every vertex in SCC, we can
execute garbage collection for all skb in the SCC.

The actual garbage collection is done in the following patch,
replacing the old implementation.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index e825894d8bca..684e2c843b1e 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -289,6 +289,39 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
 	unix_free_vertices(fpl);
 }
 
+static bool unix_vertex_dead(struct unix_vertex *vertex)
+{
+	struct unix_edge *edge;
+	struct unix_sock *u;
+	long total_ref;
+
+	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
+		struct unix_vertex *next_vertex = unix_edge_successor(edge);
+
+		/* The vertex's fd can be received by a non-inflight socket. */
+		if (!next_vertex)
+			return false;
+
+		/* The vertex's fd can be received by an inflight socket in
+		 * another SCC.
+		 */
+		if (next_vertex->scc_index != vertex->scc_index)
+			return false;
+	}
+
+	/* No receiver exists out of the same SCC. */
+
+	edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
+	u = edge->predecessor;
+	total_ref = file_count(u->sk.sk_socket->file);
+
+	/* If not close()d, total_ref > out_degree. */
+	if (total_ref != vertex->out_degree)
+		return false;
+
+	return true;
+}
+
 static bool unix_scc_cyclic(struct list_head *scc)
 {
 	struct unix_vertex *vertex;
@@ -377,6 +410,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_inde
 
 	if (vertex->index == vertex->scc_index) {
 		struct list_head scc;
+		bool scc_dead = true;
 
 		/* SCC finalised.
 		 *
@@ -391,6 +425,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_inde
 
 			/* Mark vertex as off-stack. */
 			vertex->index = unix_vertex_grouped_index;
+
+			if (scc_dead)
+				scc_dead = unix_vertex_dead(vertex);
 		}
 
 		if (!unix_graph_maybe_cyclic)
@@ -431,13 +468,18 @@ static void unix_walk_scc_fast(void)
 	while (!list_empty(&unix_unvisited_vertices)) {
 		struct unix_vertex *vertex;
 		struct list_head scc;
+		bool scc_dead = true;
 
 		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
 		list_add(&scc, &vertex->scc_entry);
 
-		list_for_each_entry_reverse(vertex, &scc, scc_entry)
+		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
 			list_move_tail(&vertex->entry, &unix_visited_vertices);
 
+			if (scc_dead)
+				scc_dead = unix_vertex_dead(vertex);
+		}
+
 		list_del(&scc);
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 14/15] af_unix: Replace garbage collection algorithm.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (12 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 13/15] af_unix: Detect dead SCC Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-01  2:22 ` [PATCH v4 net-next 15/15] selftest: af_unix: Test GC for SCM_RIGHTS Kuniyuki Iwashima
  2024-03-04 16:18 ` [PATCH v4 net-next 00/15] af_unix: Rework GC Paolo Abeni
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

If we find a dead SCC during iteration, we call unix_collect_skb()
to splice all skb in the SCC to the global sk_buff_head, hitlist.

After iterating all SCC, we unlock unix_gc_lock and purge the queue.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h |   8 --
 net/unix/af_unix.c    |  12 --
 net/unix/garbage.c    | 302 +++++++++---------------------------------
 3 files changed, 64 insertions(+), 258 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 696d997a5ac9..226a8da2cbe3 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -19,9 +19,6 @@ static inline struct unix_sock *unix_get_socket(struct file *filp)
 
 extern spinlock_t unix_gc_lock;
 extern unsigned int unix_tot_inflight;
-
-void unix_inflight(struct user_struct *user, struct file *fp);
-void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver);
 void unix_del_edges(struct scm_fp_list *fpl);
 void unix_update_edges(struct unix_sock *receiver);
@@ -85,12 +82,7 @@ struct unix_sock {
 	struct sock		*peer;
 	struct sock		*listener;
 	struct unix_vertex	*vertex;
-	struct list_head	link;
-	unsigned long		inflight;
 	spinlock_t		lock;
-	unsigned long		gc_flags;
-#define UNIX_GC_CANDIDATE	0
-#define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
 	wait_queue_entry_t	peer_wake;
 	struct scm_stat		scm_stat;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index ae77e2dc0dae..27ca50ab1cd1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -980,12 +980,10 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_destruct		= unix_sock_destructor;
 	u = unix_sk(sk);
 	u->listener = NULL;
-	u->inflight = 0;
 	u->vertex = NULL;
 	u->path.dentry = NULL;
 	u->path.mnt = NULL;
 	spin_lock_init(&u->lock);
-	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->iolock); /* single task reading lock */
 	mutex_init(&u->bindlock); /* single task binding lock */
 	init_waitqueue_head(&u->peer_wait);
@@ -1793,8 +1791,6 @@ static inline bool too_many_unix_fds(struct task_struct *p)
 
 static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 {
-	int i;
-
 	if (too_many_unix_fds(current))
 		return -ETOOMANYREFS;
 
@@ -1806,9 +1802,6 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 	if (!UNIXCB(skb).fp)
 		return -ENOMEM;
 
-	for (i = scm->fp->count - 1; i >= 0; i--)
-		unix_inflight(scm->fp->user, scm->fp->fp[i]);
-
 	if (unix_prepare_fpl(UNIXCB(skb).fp))
 		return -ENOMEM;
 
@@ -1817,15 +1810,10 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 
 static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 {
-	int i;
-
 	scm->fp = UNIXCB(skb).fp;
 	UNIXCB(skb).fp = NULL;
 
 	unix_destroy_fpl(scm->fp);
-
-	for (i = scm->fp->count - 1; i >= 0; i--)
-		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
 }
 
 static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 684e2c843b1e..8c0a4daf8fb3 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -322,6 +322,52 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
 	return true;
 }
 
+enum unix_recv_queue_lock_class {
+	U_RECVQ_LOCK_NORMAL,
+	U_RECVQ_LOCK_EMBRYO,
+};
+
+static void unix_collect_skb(struct list_head *scc, struct sk_buff_head *hitlist)
+{
+	struct unix_vertex *vertex;
+
+	list_for_each_entry_reverse(vertex, scc, scc_entry) {
+		struct sk_buff_head *queue;
+		struct unix_edge *edge;
+		struct unix_sock *u;
+
+		edge = list_first_entry(&vertex->edges, typeof(*edge), vertex_entry);
+		u = edge->predecessor;
+		queue = &u->sk.sk_receive_queue;
+
+		spin_lock(&queue->lock);
+
+		if (u->sk.sk_state == TCP_LISTEN) {
+			struct sk_buff *skb;
+
+			skb_queue_walk(queue, skb) {
+				struct sk_buff_head *embryo_queue = &skb->sk->sk_receive_queue;
+
+				/* listener -> embryo order, the inversion never happens. */
+				spin_lock_nested(&embryo_queue->lock, U_RECVQ_LOCK_EMBRYO);
+				skb_queue_splice_init(embryo_queue, hitlist);
+				spin_unlock(&embryo_queue->lock);
+			}
+		} else {
+			skb_queue_splice_init(queue, hitlist);
+
+#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+			if (u->oob_skb) {
+				kfree_skb(u->oob_skb);
+				u->oob_skb = NULL;
+			}
+#endif
+		}
+
+		spin_unlock(&queue->lock);
+	}
+}
+
 static bool unix_scc_cyclic(struct list_head *scc)
 {
 	struct unix_vertex *vertex;
@@ -345,7 +391,8 @@ static bool unix_scc_cyclic(struct list_head *scc)
 static LIST_HEAD(unix_visited_vertices);
 static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
 
-static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_index)
+static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_index,
+			    struct sk_buff_head *hitlist)
 {
 	LIST_HEAD(vertex_stack);
 	struct unix_edge *edge;
@@ -430,7 +477,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_inde
 				scc_dead = unix_vertex_dead(vertex);
 		}
 
-		if (!unix_graph_maybe_cyclic)
+		if (scc_dead)
+			unix_collect_skb(&scc, hitlist);
+		else if (!unix_graph_maybe_cyclic)
 			unix_graph_maybe_cyclic = unix_scc_cyclic(&scc);
 
 		list_del(&scc);
@@ -441,7 +490,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_inde
 		goto prev_vertex;
 }
 
-static void unix_walk_scc(void)
+static void unix_walk_scc(struct sk_buff_head *hitlist)
 {
 	unsigned long last_index = UNIX_VERTEX_INDEX_START;
 
@@ -454,7 +503,7 @@ static void unix_walk_scc(void)
 		struct unix_vertex *vertex;
 
 		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
-		__unix_walk_scc(vertex, &last_index);
+		__unix_walk_scc(vertex, &last_index, hitlist);
 	}
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
@@ -463,7 +512,7 @@ static void unix_walk_scc(void)
 	unix_graph_grouped = true;
 }
 
-static void unix_walk_scc_fast(void)
+static void unix_walk_scc_fast(struct sk_buff_head *hitlist)
 {
 	while (!list_empty(&unix_unvisited_vertices)) {
 		struct unix_vertex *vertex;
@@ -480,263 +529,40 @@ static void unix_walk_scc_fast(void)
 				scc_dead = unix_vertex_dead(vertex);
 		}
 
+		if (scc_dead)
+			unix_collect_skb(&scc, hitlist);
+
 		list_del(&scc);
 	}
 
 	list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
 }
 
-static LIST_HEAD(gc_candidates);
-static LIST_HEAD(gc_inflight_list);
-
-/* Keep the number of times in flight count for the file
- * descriptor if it is for an AF_UNIX socket.
- */
-void unix_inflight(struct user_struct *user, struct file *filp)
-{
-	struct unix_sock *u = unix_get_socket(filp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (u) {
-		if (!u->inflight) {
-			WARN_ON_ONCE(!list_empty(&u->link));
-			list_add_tail(&u->link, &gc_inflight_list);
-		} else {
-			WARN_ON_ONCE(list_empty(&u->link));
-		}
-		u->inflight++;
-	}
-
-	spin_unlock(&unix_gc_lock);
-}
-
-void unix_notinflight(struct user_struct *user, struct file *filp)
-{
-	struct unix_sock *u = unix_get_socket(filp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (u) {
-		WARN_ON_ONCE(!u->inflight);
-		WARN_ON_ONCE(list_empty(&u->link));
-
-		u->inflight--;
-		if (!u->inflight)
-			list_del_init(&u->link);
-	}
-
-	spin_unlock(&unix_gc_lock);
-}
-
-static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
-			  struct sk_buff_head *hitlist)
-{
-	struct sk_buff *skb;
-	struct sk_buff *next;
-
-	spin_lock(&x->sk_receive_queue.lock);
-	skb_queue_walk_safe(&x->sk_receive_queue, skb, next) {
-		/* Do we have file descriptors ? */
-		if (UNIXCB(skb).fp) {
-			bool hit = false;
-			/* Process the descriptors of this socket */
-			int nfd = UNIXCB(skb).fp->count;
-			struct file **fp = UNIXCB(skb).fp->fp;
-
-			while (nfd--) {
-				/* Get the socket the fd matches if it indeed does so */
-				struct unix_sock *u = unix_get_socket(*fp++);
-
-				/* Ignore non-candidates, they could have been added
-				 * to the queues after starting the garbage collection
-				 */
-				if (u && test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) {
-					hit = true;
-
-					func(u);
-				}
-			}
-			if (hit && hitlist != NULL) {
-				__skb_unlink(skb, &x->sk_receive_queue);
-				__skb_queue_tail(hitlist, skb);
-			}
-		}
-	}
-	spin_unlock(&x->sk_receive_queue.lock);
-}
-
-static void scan_children(struct sock *x, void (*func)(struct unix_sock *),
-			  struct sk_buff_head *hitlist)
-{
-	if (x->sk_state != TCP_LISTEN) {
-		scan_inflight(x, func, hitlist);
-	} else {
-		struct sk_buff *skb;
-		struct sk_buff *next;
-		struct unix_sock *u;
-		LIST_HEAD(embryos);
-
-		/* For a listening socket collect the queued embryos
-		 * and perform a scan on them as well.
-		 */
-		spin_lock(&x->sk_receive_queue.lock);
-		skb_queue_walk_safe(&x->sk_receive_queue, skb, next) {
-			u = unix_sk(skb->sk);
-
-			/* An embryo cannot be in-flight, so it's safe
-			 * to use the list link.
-			 */
-			WARN_ON_ONCE(!list_empty(&u->link));
-			list_add_tail(&u->link, &embryos);
-		}
-		spin_unlock(&x->sk_receive_queue.lock);
-
-		while (!list_empty(&embryos)) {
-			u = list_entry(embryos.next, struct unix_sock, link);
-			scan_inflight(&u->sk, func, hitlist);
-			list_del_init(&u->link);
-		}
-	}
-}
-
-static void dec_inflight(struct unix_sock *usk)
-{
-	usk->inflight--;
-}
-
-static void inc_inflight(struct unix_sock *usk)
-{
-	usk->inflight++;
-}
-
-static void inc_inflight_move_tail(struct unix_sock *u)
-{
-	u->inflight++;
-
-	/* If this still might be part of a cycle, move it to the end
-	 * of the list, so that it's checked even if it was already
-	 * passed over
-	 */
-	if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags))
-		list_move_tail(&u->link, &gc_candidates);
-}
-
 static bool gc_in_progress;
 
 static void __unix_gc(struct work_struct *work)
 {
 	struct sk_buff_head hitlist;
-	struct unix_sock *u, *next;
-	LIST_HEAD(not_cycle_list);
-	struct list_head cursor;
 
 	spin_lock(&unix_gc_lock);
 
-	if (!unix_graph_maybe_cyclic)
+	if (!unix_graph_maybe_cyclic) {
+		spin_unlock(&unix_gc_lock);
 		goto skip_gc;
-
-	if (unix_graph_grouped)
-		unix_walk_scc_fast();
-	else
-		unix_walk_scc();
-
-	/* First, select candidates for garbage collection.  Only
-	 * in-flight sockets are considered, and from those only ones
-	 * which don't have any external reference.
-	 *
-	 * Holding unix_gc_lock will protect these candidates from
-	 * being detached, and hence from gaining an external
-	 * reference.  Since there are no possible receivers, all
-	 * buffers currently on the candidates' queues stay there
-	 * during the garbage collection.
-	 *
-	 * We also know that no new candidate can be added onto the
-	 * receive queues.  Other, non candidate sockets _can_ be
-	 * added to queue, so we must make sure only to touch
-	 * candidates.
-	 */
-	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
-		long total_refs;
-
-		total_refs = file_count(u->sk.sk_socket->file);
-
-		WARN_ON_ONCE(!u->inflight);
-		WARN_ON_ONCE(total_refs < u->inflight);
-		if (total_refs == u->inflight) {
-			list_move_tail(&u->link, &gc_candidates);
-			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
-			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
-		}
-	}
-
-	/* Now remove all internal in-flight reference to children of
-	 * the candidates.
-	 */
-	list_for_each_entry(u, &gc_candidates, link)
-		scan_children(&u->sk, dec_inflight, NULL);
-
-	/* Restore the references for children of all candidates,
-	 * which have remaining references.  Do this recursively, so
-	 * only those remain, which form cyclic references.
-	 *
-	 * Use a "cursor" link, to make the list traversal safe, even
-	 * though elements might be moved about.
-	 */
-	list_add(&cursor, &gc_candidates);
-	while (cursor.next != &gc_candidates) {
-		u = list_entry(cursor.next, struct unix_sock, link);
-
-		/* Move cursor to after the current position. */
-		list_move(&cursor, &u->link);
-
-		if (u->inflight) {
-			list_move_tail(&u->link, &not_cycle_list);
-			__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
-			scan_children(&u->sk, inc_inflight_move_tail, NULL);
-		}
 	}
-	list_del(&cursor);
 
-	/* Now gc_candidates contains only garbage.  Restore original
-	 * inflight counters for these as well, and remove the skbuffs
-	 * which are creating the cycle(s).
-	 */
-	skb_queue_head_init(&hitlist);
-	list_for_each_entry(u, &gc_candidates, link) {
-		scan_children(&u->sk, inc_inflight, &hitlist);
+	__skb_queue_head_init(&hitlist);
 
-#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
-		if (u->oob_skb) {
-			kfree_skb(u->oob_skb);
-			u->oob_skb = NULL;
-		}
-#endif
-	}
-
-	/* not_cycle_list contains those sockets which do not make up a
-	 * cycle.  Restore these to the inflight list.
-	 */
-	while (!list_empty(&not_cycle_list)) {
-		u = list_entry(not_cycle_list.next, struct unix_sock, link);
-		__clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
-		list_move_tail(&u->link, &gc_inflight_list);
-	}
+	if (unix_graph_grouped)
+		unix_walk_scc_fast(&hitlist);
+	else
+		unix_walk_scc(&hitlist);
 
 	spin_unlock(&unix_gc_lock);
 
-	/* Here we are. Hitlist is filled. Die. */
 	__skb_queue_purge(&hitlist);
-
-	spin_lock(&unix_gc_lock);
-
-	/* All candidates should have been detached by now. */
-	WARN_ON_ONCE(!list_empty(&gc_candidates));
 skip_gc:
-	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
 	WRITE_ONCE(gc_in_progress, false);
-
-	spin_unlock(&unix_gc_lock);
 }
 
 static DECLARE_WORK(unix_gc_work, __unix_gc);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 net-next 15/15] selftest: af_unix: Test GC for SCM_RIGHTS.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (13 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 14/15] af_unix: Replace garbage collection algorithm Kuniyuki Iwashima
@ 2024-03-01  2:22 ` Kuniyuki Iwashima
  2024-03-04 16:18 ` [PATCH v4 net-next 00/15] af_unix: Rework GC Paolo Abeni
  15 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-01  2:22 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

This patch adds test cases to verify the new GC.

We run each test for the following cases:

  * SOCK_DGRAM
  * SOCK_STREAM without embryo socket
  * SOCK_STREAM without embryo socket + MSG_OOB
  * SOCK_STREAM with embryo sockets
  * SOCK_STREAM with embryo sockets + MSG_OOB

Before and after running each test case, we ensure that there is
no AF_UNIX socket left in the netns by reading /proc/net/protocols.

We cannot use /proc/net/unix and UNIX_DIAG because the embryo socket
does not show up there.

Each test creates multiple sockets in an array.  We pass sockets in
the even index using the peer sockets in the odd index.

So, send_fd(0, 1) actually sends fd[0] to fd[2] via fd[0 + 1].

  Test 1 : A <-> A
  Test 2 : A <-> B
  Test 3 : A -> B -> C <- D
           ^.___|___.'    ^
                `---------'

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 tools/testing/selftests/net/.gitignore        |   1 +
 tools/testing/selftests/net/af_unix/Makefile  |   2 +-
 .../selftests/net/af_unix/scm_rights.c        | 286 ++++++++++++++++++
 3 files changed, 288 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 2f9d378edec3..d996a0ab0765 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -31,6 +31,7 @@ reuseport_dualstack
 rxtimestamp
 sctp_hello
 scm_pidfd
+scm_rights
 sk_bind_sendto_listen
 sk_connect_zero_addr
 socket
diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile
index 221c387a7d7f..3b83c797650d 100644
--- a/tools/testing/selftests/net/af_unix/Makefile
+++ b/tools/testing/selftests/net/af_unix/Makefile
@@ -1,4 +1,4 @@
 CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd
+TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd scm_rights
 
 include ../../lib.mk
diff --git a/tools/testing/selftests/net/af_unix/scm_rights.c b/tools/testing/selftests/net/af_unix/scm_rights.c
new file mode 100644
index 000000000000..bab606c9f1eb
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights.c
@@ -0,0 +1,286 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+
+#include "../../kselftest_harness.h"
+
+FIXTURE(scm_rights)
+{
+	int fd[16];
+};
+
+FIXTURE_VARIANT(scm_rights)
+{
+	char name[16];
+	int type;
+	int flags;
+	bool test_listener;
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, dgram)
+{
+	.name = "UNIX ",
+	.type = SOCK_DGRAM,
+	.flags = 0,
+	.test_listener = false,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = 0,
+	.test_listener = false,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream_oob)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = MSG_OOB,
+	.test_listener = false,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream_listener)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = 0,
+	.test_listener = true,
+};
+
+FIXTURE_VARIANT_ADD(scm_rights, stream_listener_oob)
+{
+	.name = "UNIX-STREAM ",
+	.type = SOCK_STREAM,
+	.flags = MSG_OOB,
+	.test_listener = true,
+};
+
+static int count_sockets(struct __test_metadata *_metadata,
+			 const FIXTURE_VARIANT(scm_rights) *variant)
+{
+	int sockets = -1, len, ret;
+	char *line = NULL;
+	size_t unused;
+	FILE *f;
+
+	f = fopen("/proc/net/protocols", "r");
+	ASSERT_NE(NULL, f);
+
+	len = strlen(variant->name);
+
+	while (getline(&line, &unused, f) != -1) {
+		int unused2;
+
+		if (strncmp(line, variant->name, len))
+			continue;
+
+		ret = sscanf(line + len, "%d %d", &unused2, &sockets);
+		ASSERT_EQ(2, ret);
+
+		break;
+	}
+
+	free(line);
+
+	ret = fclose(f);
+	ASSERT_EQ(0, ret);
+
+	return sockets;
+}
+
+FIXTURE_SETUP(scm_rights)
+{
+	int ret;
+
+	ret = unshare(CLONE_NEWNET);
+	ASSERT_EQ(0, ret);
+
+	ret = count_sockets(_metadata, variant);
+	ASSERT_EQ(0, ret);
+}
+
+FIXTURE_TEARDOWN(scm_rights)
+{
+	int ret;
+
+	sleep(1);
+
+	ret = count_sockets(_metadata, variant);
+	ASSERT_EQ(0, ret);
+}
+
+static void create_listeners(struct __test_metadata *_metadata,
+			     FIXTURE_DATA(scm_rights) *self,
+			     int n)
+{
+	struct sockaddr_un addr = {
+		.sun_family = AF_UNIX,
+	};
+	socklen_t addrlen;
+	int i, ret;
+
+	for (i = 0; i < n * 2; i += 2) {
+		self->fd[i] = socket(AF_UNIX, SOCK_STREAM, 0);
+		ASSERT_LE(0, self->fd[i]);
+
+		addrlen = sizeof(addr.sun_family);
+		ret = bind(self->fd[i], (struct sockaddr *)&addr, addrlen);
+		ASSERT_EQ(0, ret);
+
+		ret = listen(self->fd[i], -1);
+		ASSERT_EQ(0, ret);
+
+		addrlen = sizeof(addr);
+		ret = getsockname(self->fd[i], (struct sockaddr *)&addr, &addrlen);
+		ASSERT_EQ(0, ret);
+
+		self->fd[i + 1] = socket(AF_UNIX, SOCK_STREAM, 0);
+		ASSERT_LE(0, self->fd[i + 1]);
+
+		ret = connect(self->fd[i + 1], (struct sockaddr *)&addr, addrlen);
+		ASSERT_EQ(0, ret);
+	}
+}
+
+static void create_socketpairs(struct __test_metadata *_metadata,
+			       FIXTURE_DATA(scm_rights) *self,
+			       const FIXTURE_VARIANT(scm_rights) *variant,
+			       int n)
+{
+	int i, ret;
+
+	ASSERT_GE(sizeof(self->fd) / sizeof(int), n);
+
+	for (i = 0; i < n * 2; i += 2) {
+		ret = socketpair(AF_UNIX, variant->type, 0, self->fd + i);
+		ASSERT_EQ(0, ret);
+	}
+}
+
+static void __create_sockets(struct __test_metadata *_metadata,
+			     FIXTURE_DATA(scm_rights) *self,
+			     const FIXTURE_VARIANT(scm_rights) *variant,
+			     int n)
+{
+	if (variant->test_listener)
+		create_listeners(_metadata, self, n);
+	else
+		create_socketpairs(_metadata, self, variant, n);
+}
+
+static void __close_sockets(struct __test_metadata *_metadata,
+			    FIXTURE_DATA(scm_rights) *self,
+			    int n)
+{
+	int i, ret;
+
+	ASSERT_GE(sizeof(self->fd) / sizeof(int), n);
+
+	for (i = 0; i < n * 2; i++) {
+		ret = close(self->fd[i]);
+		ASSERT_EQ(0, ret);
+	}
+}
+
+void __send_fd(struct __test_metadata *_metadata,
+	       const FIXTURE_DATA(scm_rights) *self,
+	       const FIXTURE_VARIANT(scm_rights) *variant,
+	       int inflight, int receiver)
+{
+#define MSG "nop"
+#define MSGLEN 3
+	struct {
+		struct cmsghdr cmsghdr;
+		int fd[2];
+	} cmsg = {
+		.cmsghdr = {
+			.cmsg_len = CMSG_LEN(sizeof(cmsg.fd)),
+			.cmsg_level = SOL_SOCKET,
+			.cmsg_type = SCM_RIGHTS,
+		},
+		.fd = {
+			self->fd[inflight * 2],
+			self->fd[inflight * 2],
+		},
+	};
+	struct iovec iov = {
+		.iov_base = MSG,
+		.iov_len = MSGLEN,
+	};
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = &cmsg,
+		.msg_controllen = CMSG_SPACE(sizeof(cmsg.fd)),
+	};
+	int ret;
+
+	ret = sendmsg(self->fd[receiver * 2 + 1], &msg, variant->flags);
+	ASSERT_EQ(MSGLEN, ret);
+}
+
+#define create_sockets(n)					\
+	__create_sockets(_metadata, self, variant, n)
+#define close_sockets(n)					\
+	__close_sockets(_metadata, self, n)
+#define send_fd(inflight, receiver)				\
+	__send_fd(_metadata, self, variant, inflight, receiver)
+
+TEST_F(scm_rights, self_ref)
+{
+	create_sockets(2);
+
+	send_fd(0, 0);
+
+	send_fd(1, 1);
+
+	close_sockets(2);
+}
+
+TEST_F(scm_rights, triangle)
+{
+	create_sockets(6);
+
+	send_fd(0, 1);
+	send_fd(1, 2);
+	send_fd(2, 0);
+
+	send_fd(3, 4);
+	send_fd(4, 5);
+	send_fd(5, 3);
+
+	close_sockets(6);
+}
+
+TEST_F(scm_rights, cross_edge)
+{
+	create_sockets(8);
+
+	send_fd(0, 1);
+	send_fd(1, 2);
+	send_fd(2, 0);
+	send_fd(1, 3);
+	send_fd(3, 2);
+
+	send_fd(4, 5);
+	send_fd(5, 6);
+	send_fd(6, 4);
+	send_fd(5, 7);
+	send_fd(7, 6);
+
+	close_sockets(8);
+}
+
+TEST_HARNESS_MAIN
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 00/15] af_unix: Rework GC.
  2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
                   ` (14 preceding siblings ...)
  2024-03-01  2:22 ` [PATCH v4 net-next 15/15] selftest: af_unix: Test GC for SCM_RIGHTS Kuniyuki Iwashima
@ 2024-03-04 16:18 ` Paolo Abeni
  2024-03-04 17:31   ` Kuniyuki Iwashima
  15 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2024-03-04 16:18 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> When we pass a file descriptor to an AF_UNIX socket via SCM_RIGTHS,
> the underlying struct file of the inflight fd gets its refcount bumped.
> If the fd is of an AF_UNIX socket, we need to track it in case it forms
> cyclic references.
> 
> Let's say we send a fd of AF_UNIX socket A to B and vice versa and
> close() both sockets.
> 
> When created, each socket's struct file initially has one reference.
> After the fd exchange, both refcounts are bumped up to 2.  Then, close()
> decreases both to 1.  From this point on, no one can touch the file/socket.
> 
> However, the struct file has one refcount and thus never calls the
> release() function of the AF_UNIX socket.
> 
> That's why we need to track all inflight AF_UNIX sockets and run garbage
> collection.
> 
> This series replaces the current GC implementation that locks each inflight
> socket's receive queue and requires trickiness in other places.
> 
> The new GC does not lock each socket's queue to minimise its effect and
> tries to be lightweight if there is no cyclic reference or no update in
> the shape of the inflight fd graph.
> 
> The new implementation is based on Tarjan's Strongly Connected Components
> algorithm, and we will consider each inflight AF_UNIX socket as a vertex
> and its file descriptor as an edge in a directed graph.
> 
> For the details, please see each patch.
> 
>   patch 1  -  3 : Add struct to express inflight socket graphs
>   patch       4 : Optimse inflight fd counting
>   patch 5  -  6 : Group SCC possibly forming a cycle
>   patch 7  -  8 : Support embryo socket
>   patch 9  - 11 : Make GC lightweight
>   patch 12 - 13 : Detect dead cycle references
>   patch      14 : Replace GC algorithm
>   patch      15 : selftest
> 
> After this series is applied, we can remove the two ugly tricks for race,
> scm_fp_dup() in unix_attach_fds() and spin_lock dance in unix_peek_fds()
> as done in patch 14/15 of v1.

I plan to have a better look tomorrow.

Generally speaking we have a timing problem. This looks great, but also
quite complex, and thus the potential for untrivial regressions. 

We are very late in this cycle and likely there will not be rc8, would
you be ok to eventually postpone this to the next cycle?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 00/15] af_unix: Rework GC.
  2024-03-04 16:18 ` [PATCH v4 net-next 00/15] af_unix: Rework GC Paolo Abeni
@ 2024-03-04 17:31   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-04 17:31 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Mon, 04 Mar 2024 17:18:32 +0100
> On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> > When we pass a file descriptor to an AF_UNIX socket via SCM_RIGTHS,
> > the underlying struct file of the inflight fd gets its refcount bumped.
> > If the fd is of an AF_UNIX socket, we need to track it in case it forms
> > cyclic references.
> > 
> > Let's say we send a fd of AF_UNIX socket A to B and vice versa and
> > close() both sockets.
> > 
> > When created, each socket's struct file initially has one reference.
> > After the fd exchange, both refcounts are bumped up to 2.  Then, close()
> > decreases both to 1.  From this point on, no one can touch the file/socket.
> > 
> > However, the struct file has one refcount and thus never calls the
> > release() function of the AF_UNIX socket.
> > 
> > That's why we need to track all inflight AF_UNIX sockets and run garbage
> > collection.
> > 
> > This series replaces the current GC implementation that locks each inflight
> > socket's receive queue and requires trickiness in other places.
> > 
> > The new GC does not lock each socket's queue to minimise its effect and
> > tries to be lightweight if there is no cyclic reference or no update in
> > the shape of the inflight fd graph.
> > 
> > The new implementation is based on Tarjan's Strongly Connected Components
> > algorithm, and we will consider each inflight AF_UNIX socket as a vertex
> > and its file descriptor as an edge in a directed graph.
> > 
> > For the details, please see each patch.
> > 
> >   patch 1  -  3 : Add struct to express inflight socket graphs
> >   patch       4 : Optimse inflight fd counting
> >   patch 5  -  6 : Group SCC possibly forming a cycle
> >   patch 7  -  8 : Support embryo socket
> >   patch 9  - 11 : Make GC lightweight
> >   patch 12 - 13 : Detect dead cycle references
> >   patch      14 : Replace GC algorithm
> >   patch      15 : selftest
> > 
> > After this series is applied, we can remove the two ugly tricks for race,
> > scm_fp_dup() in unix_attach_fds() and spin_lock dance in unix_peek_fds()
> > as done in patch 14/15 of v1.
> 
> I plan to have a better look tomorrow.

Thanks for your time!

> 
> Generally speaking we have a timing problem. This looks great, but also
> quite complex, and thus the potential for untrivial regressions. 
> 
> We are very late in this cycle and likely there will not be rc8, would
> you be ok to eventually postpone this to the next cycle?

No problem at all, I was also thinking about that after reading the
announcement :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 12/15] af_unix: Assign a unique index to SCC.
  2024-03-01  2:22 ` [PATCH v4 net-next 12/15] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
@ 2024-03-05  8:44   ` Paolo Abeni
  2024-03-06 20:59     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2024-03-05  8:44 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> The definition of the lowlink in Tarjan's algorithm is the
> smallest index of a vertex that is reachable with at most one
> back-edge in SCC.  This is not useful for a cross-edge.
> 
> If we start traversing from A in the following graph, the final
> lowlink of D is 3.  The cross-edge here is one between D and C.
> 
>   A -> B -> D   D = (4, 3)  (index, lowlink)
>   ^    |    |   C = (3, 1)
>   |    V    |   B = (2, 1)
>   `--- C <--'   A = (1, 1)
> 
> This is because the lowlink of D is updated with the index of C.
> 
> In the following patch, we detect a dead SCC by checking two
> conditions for each vertex.
> 
>   1) vertex has no edge directed to another SCC (no bridge)
>   2) vertex's out_degree is the same as the refcount of its file
> 
> If 1) is false, there is a receiver of all fds of the SCC and
> its ancestor SCC.
> 
> To evaluate 1), we need to assign a unique index to each SCC and
> assign it to all vertices in the SCC.
> 
> This patch changes the lowlink update logic for cross-edge so
> that in the example above, the lowlink of D is updated with the
> lowlink of C.
> 
>   A -> B -> D   D = (4, 1)  (index, lowlink)
>   ^    |    |   C = (3, 1)
>   |    V    |   B = (2, 1)
>   `--- C <--'   A = (1, 1)
> 
> Then, all vertices in the same SCC have the same lowlink, and we
> can quickly find the bridge connecting to different SCC if exists.
> 
> However, it is no longer called lowlink, so we rename it to
> scc_index.  (It's sometimes called lowpoint.)

I'm wondering if there is any reference to this variation of Tarjan's
algorithm you can point, to help understanding, future memory,
reviewing.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 02/15] af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.
  2024-03-01  2:22 ` [PATCH v4 net-next 02/15] af_unix: Allocate struct unix_edge " Kuniyuki Iwashima
@ 2024-03-05  8:47   ` Paolo Abeni
  2024-03-06 19:44     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2024-03-05  8:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> As with the previous patch, we preallocate to skb's scm_fp_list an
> array of struct unix_edge in the number of inflight AF_UNIX fds.
> 
> There we just preallocate memory and do not use immediately because
> sendmsg() could fail after this point.  The actual use will be in
> the next patch.
> 
> When we queue skb with inflight edges, we will set the inflight
> socket's unix_sock as unix_edge->predecessor and the receiver's
> unix_sock as successor, and then we will link the edge to the
> inflight socket's unix_vertex.edges.
> 
> Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup()
> so that MSG_PEEK does not change the shape of the directed graph.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/af_unix.h | 6 ++++++
>  include/net/scm.h     | 5 +++++
>  net/core/scm.c        | 2 ++
>  net/unix/garbage.c    | 6 ++++++
>  4 files changed, 19 insertions(+)
> 
> diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> index c270877a5256..55c4abc26a71 100644
> --- a/include/net/af_unix.h
> +++ b/include/net/af_unix.h
> @@ -33,6 +33,12 @@ struct unix_vertex {
>  	unsigned long out_degree;
>  };
>  
> +struct unix_edge {
> +	struct unix_sock *predecessor;
> +	struct unix_sock *successor;
> +	struct list_head vertex_entry;
> +};
> +
>  struct sock *unix_peer_get(struct sock *sk);
>  
>  #define UNIX_HASH_MOD	(256 - 1)
> diff --git a/include/net/scm.h b/include/net/scm.h
> index e34321b6e204..5f5154e5096d 100644
> --- a/include/net/scm.h
> +++ b/include/net/scm.h
> @@ -23,12 +23,17 @@ struct scm_creds {
>  	kgid_t	gid;
>  };
>  
> +#ifdef CONFIG_UNIX
> +struct unix_edge;
> +#endif
> +
>  struct scm_fp_list {
>  	short			count;
>  	short			count_unix;
>  	short			max;
>  #ifdef CONFIG_UNIX
>  	struct list_head	vertices;
> +	struct unix_edge	*edges;
>  #endif
>  	struct user_struct	*user;
>  	struct file		*fp[SCM_MAX_FD];
> diff --git a/net/core/scm.c b/net/core/scm.c
> index 87dfec1c3378..1bcc8a2d65e3 100644
> --- a/net/core/scm.c
> +++ b/net/core/scm.c
> @@ -90,6 +90,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
>  		fpl->max = SCM_MAX_FD;
>  		fpl->user = NULL;
>  #if IS_ENABLED(CONFIG_UNIX)
> +		fpl->edges = NULL;
>  		INIT_LIST_HEAD(&fpl->vertices);
>  #endif
>  	}
> @@ -383,6 +384,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
>  		new_fpl->max = new_fpl->count;
>  		new_fpl->user = get_uid(fpl->user);
>  #if IS_ENABLED(CONFIG_UNIX)
> +		new_fpl->edges = NULL;
>  		INIT_LIST_HEAD(&new_fpl->vertices);
>  #endif
>  	}
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 75bdf66b81df..f31917683288 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -127,6 +127,11 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
>  		list_add(&vertex->entry, &fpl->vertices);
>  	}
>  
> +	fpl->edges = kvmalloc_array(fpl->count_unix, sizeof(*fpl->edges),
> +				    GFP_KERNEL_ACCOUNT);

If I read correctly, the total amount of additionally memory used will
be proportional to vertices num + edges num. Can you please provide a
the real figures for some reasonable unix fd numbers (say a small
table), to verify this memory usage is reasonable?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS.
  2024-03-01  2:22 ` [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS Kuniyuki Iwashima
@ 2024-03-05  8:53   ` Paolo Abeni
  2024-03-06 20:14     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2024-03-05  8:53 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> The new GC will use a depth first search graph algorithm to find
> cyclic references.  The algorithm visits every vertex exactly once.
> 
> Here, we implement its DFS part without recursion so that no one
> can abuse it.
> 
> unix_walk_scc() marks every vertex unvisited by initialising index
> as UNIX_VERTEX_INDEX_UNVISITED, iterates inflight vertices in
> unix_unvisited_vertices, and call __unix_walk_scc() to start DFS
> from an arbitrary vertex.
> 
> __unix_walk_scc() iterates all edges starting from the vertex and
> explores the neighbour vertices with DFS using edge_stack.
> 
> After visiting all neighbours, __unix_walk_scc() moves the visited
> vertex to unix_visited_vertices so that unix_walk_scc() will not
> restart DFS from the visited vertex.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/af_unix.h |  2 ++
>  net/unix/garbage.c    | 74 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 76 insertions(+)
> 
> diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> index f31ad1166346..970a91da2239 100644
> --- a/include/net/af_unix.h
> +++ b/include/net/af_unix.h
> @@ -33,12 +33,14 @@ struct unix_vertex {
>  	struct list_head edges;
>  	struct list_head entry;
>  	unsigned long out_degree;
> +	unsigned long index;
>  };
>  
>  struct unix_edge {
>  	struct unix_sock *predecessor;
>  	struct unix_sock *successor;
>  	struct list_head vertex_entry;
> +	struct list_head stack_entry;
>  };
>  
>  struct sock *unix_peer_get(struct sock *sk);
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 84c8ea524a98..8b16ab9e240e 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -103,6 +103,11 @@ struct unix_sock *unix_get_socket(struct file *filp)
>  
>  static LIST_HEAD(unix_unvisited_vertices);
>  
> +enum unix_vertex_index {
> +	UNIX_VERTEX_INDEX_UNVISITED,
> +	UNIX_VERTEX_INDEX_START,
> +};
> +
>  static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
>  {
>  	struct unix_vertex *vertex = edge->predecessor->vertex;
> @@ -241,6 +246,73 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
>  	unix_free_vertices(fpl);
>  }
>  
> +static LIST_HEAD(unix_visited_vertices);
> +
> +static void __unix_walk_scc(struct unix_vertex *vertex)
> +{
> +	unsigned long index = UNIX_VERTEX_INDEX_START;
> +	struct unix_edge *edge;
> +	LIST_HEAD(edge_stack);
> +
> +next_vertex:
> +	vertex->index = index;
> +	index++;
> +
> +	/* Explore neighbour vertices (receivers of the current vertex's fd). */
> +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> +		struct unix_vertex *next_vertex = edge->successor->vertex;
> +
> +		if (!next_vertex)
> +			continue;
> +
> +		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
> +			/* Iterative deepening depth first search
> +			 *
> +			 *   1. Push a forward edge to edge_stack and set
> +			 *      the successor to vertex for the next iteration.
> +			 */
> +			list_add(&edge->stack_entry, &edge_stack);
> +
> +			vertex = next_vertex;
> +			goto next_vertex;
> +
> +			/*   2. Pop the edge directed to the current vertex
> +			 *      and restore the ancestor for backtracking.
> +			 */
> +prev_vertex:
> +			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
> +			list_del_init(&edge->stack_entry);
> +
> +			vertex = edge->predecessor->vertex;
> +		}
> +	}
> +
> +	/* Don't restart DFS from this vertex in unix_walk_scc(). */
> +	list_move_tail(&vertex->entry, &unix_visited_vertices);
> +
> +	/* Need backtracking ? */
> +	if (!list_empty(&edge_stack))
> +		goto prev_vertex;
> +}
> +
> +static void unix_walk_scc(void)
> +{
> +	struct unix_vertex *vertex;
> +
> +	list_for_each_entry(vertex, &unix_unvisited_vertices, entry)
> +		vertex->index = UNIX_VERTEX_INDEX_UNVISITED;
> +
> +	/* Visit every vertex exactly once.
> +	 * __unix_walk_scc() moves visited vertices to unix_visited_vertices.
> +	 */
> +	while (!list_empty(&unix_unvisited_vertices)) {
> +		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
> +		__unix_walk_scc(vertex);

If I read correctly this is the single loop that could use the CPU for
a possibly long time. I'm wondering if adding a cond_resched_lock()
here (and possibly move the gc to a specific thread instead of a
workqueue) would make sense.

It could avoid possibly very long starvation of the system wq and the
used CPU.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 02/15] af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.
  2024-03-05  8:47   ` Paolo Abeni
@ 2024-03-06 19:44     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-06 19:44 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 05 Mar 2024 09:47:04 +0100
> On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> > As with the previous patch, we preallocate to skb's scm_fp_list an
> > array of struct unix_edge in the number of inflight AF_UNIX fds.
> > 
> > There we just preallocate memory and do not use immediately because
> > sendmsg() could fail after this point.  The actual use will be in
> > the next patch.
> > 
> > When we queue skb with inflight edges, we will set the inflight
> > socket's unix_sock as unix_edge->predecessor and the receiver's
> > unix_sock as successor, and then we will link the edge to the
> > inflight socket's unix_vertex.edges.
> > 
> > Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup()
> > so that MSG_PEEK does not change the shape of the directed graph.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/af_unix.h | 6 ++++++
> >  include/net/scm.h     | 5 +++++
> >  net/core/scm.c        | 2 ++
> >  net/unix/garbage.c    | 6 ++++++
> >  4 files changed, 19 insertions(+)
> > 
> > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > index c270877a5256..55c4abc26a71 100644
> > --- a/include/net/af_unix.h
> > +++ b/include/net/af_unix.h
> > @@ -33,6 +33,12 @@ struct unix_vertex {
> >  	unsigned long out_degree;
> >  };
> >  
> > +struct unix_edge {
> > +	struct unix_sock *predecessor;
> > +	struct unix_sock *successor;
> > +	struct list_head vertex_entry;
> > +};
> > +
> >  struct sock *unix_peer_get(struct sock *sk);
> >  
> >  #define UNIX_HASH_MOD	(256 - 1)
> > diff --git a/include/net/scm.h b/include/net/scm.h
> > index e34321b6e204..5f5154e5096d 100644
> > --- a/include/net/scm.h
> > +++ b/include/net/scm.h
> > @@ -23,12 +23,17 @@ struct scm_creds {
> >  	kgid_t	gid;
> >  };
> >  
> > +#ifdef CONFIG_UNIX
> > +struct unix_edge;
> > +#endif
> > +
> >  struct scm_fp_list {
> >  	short			count;
> >  	short			count_unix;
> >  	short			max;
> >  #ifdef CONFIG_UNIX
> >  	struct list_head	vertices;
> > +	struct unix_edge	*edges;
> >  #endif
> >  	struct user_struct	*user;
> >  	struct file		*fp[SCM_MAX_FD];
> > diff --git a/net/core/scm.c b/net/core/scm.c
> > index 87dfec1c3378..1bcc8a2d65e3 100644
> > --- a/net/core/scm.c
> > +++ b/net/core/scm.c
> > @@ -90,6 +90,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
> >  		fpl->max = SCM_MAX_FD;
> >  		fpl->user = NULL;
> >  #if IS_ENABLED(CONFIG_UNIX)
> > +		fpl->edges = NULL;
> >  		INIT_LIST_HEAD(&fpl->vertices);
> >  #endif
> >  	}
> > @@ -383,6 +384,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
> >  		new_fpl->max = new_fpl->count;
> >  		new_fpl->user = get_uid(fpl->user);
> >  #if IS_ENABLED(CONFIG_UNIX)
> > +		new_fpl->edges = NULL;
> >  		INIT_LIST_HEAD(&new_fpl->vertices);
> >  #endif
> >  	}
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 75bdf66b81df..f31917683288 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -127,6 +127,11 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
> >  		list_add(&vertex->entry, &fpl->vertices);
> >  	}
> >  
> > +	fpl->edges = kvmalloc_array(fpl->count_unix, sizeof(*fpl->edges),
> > +				    GFP_KERNEL_ACCOUNT);
> 
> If I read correctly, the total amount of additionally memory used will
> be proportional to vertices num + edges num.

Correct.


> Can you please provide a
> the real figures for some reasonable unix fd numbers (say a small
> table), to verify this memory usage is reasonable?

First, regarding per-sock change.

Before this series, unix_sock is 1920 bytes.
After this series, unix_sock is 1856 bytes, and unix_vertex is 72 bytes.

The delta is (1856 + 72) - 1920 = 8 bytes.

But, there is 8-bytes hole in unix_sock after the series, and the last
oob_skb can be moved there.  Then, the size decrease 4 bytes in total.
So, we can ignore the delta with a follow-up patch moving oob_skb (and
other fields) to pack unix_sock.

---8<---
$ pahole -C unix_sock vmlinux
struct unix_sock {
...
	spinlock_t                 lock;                 /*  1592    64 */

	/* XXX 8 bytes hole, try to pack */
...
	/* XXX 4 bytes hole, try to pack */

	struct sk_buff *           oob_skb;              /*  1840     8 */

	/* size: 1856, cachelines: 29, members: 13 */
---8<---


Second, regarding unix_edge.

sk_buff is 224 bytes, and scm_fp_list, which has struct file
pointer with the fixed-size array fp[253], is 2064 bytes.

After this series, unix_edge (48 bytes) is added to each AF_UNIX
fds.

Before this series: 224 + 2064 = 2288 bytes
After this series : 2288 + 48 * n bytes

In our distro, systemd is the major user receiving AF_UNIX fd at
boot time, ssh login, etc, and each skb has only one AF_UNIX fd.
In that case, skb has 48 bytes (2%) increase in size.

If we fully populate skb with 253 AF_UNIX fds, it will have about
12KB increase, but this is unlikely.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS.
  2024-03-05  8:53   ` Paolo Abeni
@ 2024-03-06 20:14     ` Kuniyuki Iwashima
  2024-03-07  9:08       ` Paolo Abeni
  0 siblings, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-06 20:14 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 05 Mar 2024 09:53:17 +0100
> On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> > The new GC will use a depth first search graph algorithm to find
> > cyclic references.  The algorithm visits every vertex exactly once.
> > 
> > Here, we implement its DFS part without recursion so that no one
> > can abuse it.
> > 
> > unix_walk_scc() marks every vertex unvisited by initialising index
> > as UNIX_VERTEX_INDEX_UNVISITED, iterates inflight vertices in
> > unix_unvisited_vertices, and call __unix_walk_scc() to start DFS
> > from an arbitrary vertex.
> > 
> > __unix_walk_scc() iterates all edges starting from the vertex and
> > explores the neighbour vertices with DFS using edge_stack.
> > 
> > After visiting all neighbours, __unix_walk_scc() moves the visited
> > vertex to unix_visited_vertices so that unix_walk_scc() will not
> > restart DFS from the visited vertex.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/af_unix.h |  2 ++
> >  net/unix/garbage.c    | 74 +++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 76 insertions(+)
> > 
> > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > index f31ad1166346..970a91da2239 100644
> > --- a/include/net/af_unix.h
> > +++ b/include/net/af_unix.h
> > @@ -33,12 +33,14 @@ struct unix_vertex {
> >  	struct list_head edges;
> >  	struct list_head entry;
> >  	unsigned long out_degree;
> > +	unsigned long index;
> >  };
> >  
> >  struct unix_edge {
> >  	struct unix_sock *predecessor;
> >  	struct unix_sock *successor;
> >  	struct list_head vertex_entry;
> > +	struct list_head stack_entry;
> >  };
> >  
> >  struct sock *unix_peer_get(struct sock *sk);
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 84c8ea524a98..8b16ab9e240e 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -103,6 +103,11 @@ struct unix_sock *unix_get_socket(struct file *filp)
> >  
> >  static LIST_HEAD(unix_unvisited_vertices);
> >  
> > +enum unix_vertex_index {
> > +	UNIX_VERTEX_INDEX_UNVISITED,
> > +	UNIX_VERTEX_INDEX_START,
> > +};
> > +
> >  static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
> >  {
> >  	struct unix_vertex *vertex = edge->predecessor->vertex;
> > @@ -241,6 +246,73 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
> >  	unix_free_vertices(fpl);
> >  }
> >  
> > +static LIST_HEAD(unix_visited_vertices);
> > +
> > +static void __unix_walk_scc(struct unix_vertex *vertex)
> > +{
> > +	unsigned long index = UNIX_VERTEX_INDEX_START;
> > +	struct unix_edge *edge;
> > +	LIST_HEAD(edge_stack);
> > +
> > +next_vertex:
> > +	vertex->index = index;
> > +	index++;
> > +
> > +	/* Explore neighbour vertices (receivers of the current vertex's fd). */
> > +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> > +		struct unix_vertex *next_vertex = edge->successor->vertex;
> > +
> > +		if (!next_vertex)
> > +			continue;
> > +
> > +		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
> > +			/* Iterative deepening depth first search
> > +			 *
> > +			 *   1. Push a forward edge to edge_stack and set
> > +			 *      the successor to vertex for the next iteration.
> > +			 */
> > +			list_add(&edge->stack_entry, &edge_stack);
> > +
> > +			vertex = next_vertex;
> > +			goto next_vertex;
> > +
> > +			/*   2. Pop the edge directed to the current vertex
> > +			 *      and restore the ancestor for backtracking.
> > +			 */
> > +prev_vertex:
> > +			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
> > +			list_del_init(&edge->stack_entry);
> > +
> > +			vertex = edge->predecessor->vertex;
> > +		}
> > +	}
> > +
> > +	/* Don't restart DFS from this vertex in unix_walk_scc(). */
> > +	list_move_tail(&vertex->entry, &unix_visited_vertices);
> > +
> > +	/* Need backtracking ? */
> > +	if (!list_empty(&edge_stack))
> > +		goto prev_vertex;
> > +}
> > +
> > +static void unix_walk_scc(void)
> > +{
> > +	struct unix_vertex *vertex;
> > +
> > +	list_for_each_entry(vertex, &unix_unvisited_vertices, entry)
> > +		vertex->index = UNIX_VERTEX_INDEX_UNVISITED;
> > +
> > +	/* Visit every vertex exactly once.
> > +	 * __unix_walk_scc() moves visited vertices to unix_visited_vertices.
> > +	 */
> > +	while (!list_empty(&unix_unvisited_vertices)) {
> > +		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
> > +		__unix_walk_scc(vertex);
> 
> If I read correctly this is the single loop that could use the CPU for
> a possibly long time. I'm wondering if adding a cond_resched_lock()
> here (and possibly move the gc to a specific thread instead of a
> workqueue) would make sense.
> 
> It could avoid possibly very long starvation of the system wq and the
> used CPU.

TIL cond_resched_lock() :)

The loop is executed only when we add/del/update edges whose successor
is alredy inflight, and the loop takes O(|Vertex| + |Edges|).

If the loop takes so long, there would be too many inflight skbs, but
the current GC does not care such a situation.

The suggestion still makes sense, and then we could use mutex instead of
spinlock after kthread conversion.

Given the series has already 15 patches, I can add cond_resched_lock()
only in the next version and include the conversion in the follow-up
patches or include both changes as followup or do the conversion as prep.

What's the preferred option ?

Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 12/15] af_unix: Assign a unique index to SCC.
  2024-03-05  8:44   ` Paolo Abeni
@ 2024-03-06 20:59     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2024-03-06 20:59 UTC (permalink / raw)
  To: pabeni; +Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 05 Mar 2024 09:44:00 +0100
> On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> > The definition of the lowlink in Tarjan's algorithm is the
> > smallest index of a vertex that is reachable with at most one
> > back-edge in SCC.  This is not useful for a cross-edge.
> > 
> > If we start traversing from A in the following graph, the final
> > lowlink of D is 3.  The cross-edge here is one between D and C.
> > 
> >   A -> B -> D   D = (4, 3)  (index, lowlink)
> >   ^    |    |   C = (3, 1)
> >   |    V    |   B = (2, 1)
> >   `--- C <--'   A = (1, 1)
> > 
> > This is because the lowlink of D is updated with the index of C.
> > 
> > In the following patch, we detect a dead SCC by checking two
> > conditions for each vertex.
> > 
> >   1) vertex has no edge directed to another SCC (no bridge)
> >   2) vertex's out_degree is the same as the refcount of its file
> > 
> > If 1) is false, there is a receiver of all fds of the SCC and
> > its ancestor SCC.
> > 
> > To evaluate 1), we need to assign a unique index to each SCC and
> > assign it to all vertices in the SCC.
> > 
> > This patch changes the lowlink update logic for cross-edge so
> > that in the example above, the lowlink of D is updated with the
> > lowlink of C.
> > 
> >   A -> B -> D   D = (4, 1)  (index, lowlink)
> >   ^    |    |   C = (3, 1)
> >   |    V    |   B = (2, 1)
> >   `--- C <--'   A = (1, 1)
> > 
> > Then, all vertices in the same SCC have the same lowlink, and we
> > can quickly find the bridge connecting to different SCC if exists.
> > 
> > However, it is no longer called lowlink, so we rename it to
> > scc_index.  (It's sometimes called lowpoint.)
> 
> I'm wondering if there is any reference to this variation of Tarjan's
> algorithm you can point, to help understanding, future memory,
> reviewing.

I don't have any reference... perhaps we can add comment like
/* why ? git-blame me. */ or .rst file under Documentation/ about
why GC is needed, how GC works / what algorithm is used, etc.

When I was wondering the same thing, I googled and found someone
who had the same question, but there was no reference.

  https://stackoverflow.com/questions/23213993/what-is-the-lowelink-mean-of-tarjans-algorithm

There might be a text book but I couldn't find online resources.
Even wiki says it looks odd.

  > // The next line may look odd - but is correct.
  > // It says w.index not w.lowlink; that is deliberate and from the original paper
  > v.lowlink := min(v.lowlink, w.index)
  https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm

Regarding "lowpoint", I saw it in the wiki for the first time.

  > The lowlink is different from the lowpoint, which is the smallest
  > index reachable from v through any part of the graph.[1]: 156 [2]

In a pdf linked from the wiki:

  > lowpoint(v) = The lowest numbered vertex reachable from v using
  > zero or more tree edges followed by at most one back or cross edge.
  https://www.cs.cmu.edu/~15451-f18/lectures/lec19-DFS-strong-components.pdf

But I've just found that the original paper used LOWPT, which
is called lowlink now... :S

  > LOWPT(v) :=min(LOWPT(v) ,NUMBER(w)) ;
  https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4569669

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS.
  2024-03-06 20:14     ` Kuniyuki Iwashima
@ 2024-03-07  9:08       ` Paolo Abeni
  0 siblings, 0 replies; 25+ messages in thread
From: Paolo Abeni @ 2024-03-07  9:08 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, kuni1840, netdev

On Wed, 2024-03-06 at 12:14 -0800, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 05 Mar 2024 09:53:17 +0100
> > On Thu, 2024-02-29 at 18:22 -0800, Kuniyuki Iwashima wrote:
> > > The new GC will use a depth first search graph algorithm to find
> > > cyclic references.  The algorithm visits every vertex exactly once.
> > > 
> > > Here, we implement its DFS part without recursion so that no one
> > > can abuse it.
> > > 
> > > unix_walk_scc() marks every vertex unvisited by initialising index
> > > as UNIX_VERTEX_INDEX_UNVISITED, iterates inflight vertices in
> > > unix_unvisited_vertices, and call __unix_walk_scc() to start DFS
> > > from an arbitrary vertex.
> > > 
> > > __unix_walk_scc() iterates all edges starting from the vertex and
> > > explores the neighbour vertices with DFS using edge_stack.
> > > 
> > > After visiting all neighbours, __unix_walk_scc() moves the visited
> > > vertex to unix_visited_vertices so that unix_walk_scc() will not
> > > restart DFS from the visited vertex.
> > > 
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  include/net/af_unix.h |  2 ++
> > >  net/unix/garbage.c    | 74 +++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 76 insertions(+)
> > > 
> > > diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> > > index f31ad1166346..970a91da2239 100644
> > > --- a/include/net/af_unix.h
> > > +++ b/include/net/af_unix.h
> > > @@ -33,12 +33,14 @@ struct unix_vertex {
> > >  	struct list_head edges;
> > >  	struct list_head entry;
> > >  	unsigned long out_degree;
> > > +	unsigned long index;
> > >  };
> > >  
> > >  struct unix_edge {
> > >  	struct unix_sock *predecessor;
> > >  	struct unix_sock *successor;
> > >  	struct list_head vertex_entry;
> > > +	struct list_head stack_entry;
> > >  };
> > >  
> > >  struct sock *unix_peer_get(struct sock *sk);
> > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > index 84c8ea524a98..8b16ab9e240e 100644
> > > --- a/net/unix/garbage.c
> > > +++ b/net/unix/garbage.c
> > > @@ -103,6 +103,11 @@ struct unix_sock *unix_get_socket(struct file *filp)
> > >  
> > >  static LIST_HEAD(unix_unvisited_vertices);
> > >  
> > > +enum unix_vertex_index {
> > > +	UNIX_VERTEX_INDEX_UNVISITED,
> > > +	UNIX_VERTEX_INDEX_START,
> > > +};
> > > +
> > >  static void unix_add_edge(struct scm_fp_list *fpl, struct unix_edge *edge)
> > >  {
> > >  	struct unix_vertex *vertex = edge->predecessor->vertex;
> > > @@ -241,6 +246,73 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
> > >  	unix_free_vertices(fpl);
> > >  }
> > >  
> > > +static LIST_HEAD(unix_visited_vertices);
> > > +
> > > +static void __unix_walk_scc(struct unix_vertex *vertex)
> > > +{
> > > +	unsigned long index = UNIX_VERTEX_INDEX_START;
> > > +	struct unix_edge *edge;
> > > +	LIST_HEAD(edge_stack);
> > > +
> > > +next_vertex:
> > > +	vertex->index = index;
> > > +	index++;
> > > +
> > > +	/* Explore neighbour vertices (receivers of the current vertex's fd). */
> > > +	list_for_each_entry(edge, &vertex->edges, vertex_entry) {
> > > +		struct unix_vertex *next_vertex = edge->successor->vertex;
> > > +
> > > +		if (!next_vertex)
> > > +			continue;
> > > +
> > > +		if (next_vertex->index == UNIX_VERTEX_INDEX_UNVISITED) {
> > > +			/* Iterative deepening depth first search
> > > +			 *
> > > +			 *   1. Push a forward edge to edge_stack and set
> > > +			 *      the successor to vertex for the next iteration.
> > > +			 */
> > > +			list_add(&edge->stack_entry, &edge_stack);
> > > +
> > > +			vertex = next_vertex;
> > > +			goto next_vertex;
> > > +
> > > +			/*   2. Pop the edge directed to the current vertex
> > > +			 *      and restore the ancestor for backtracking.
> > > +			 */
> > > +prev_vertex:
> > > +			edge = list_first_entry(&edge_stack, typeof(*edge), stack_entry);
> > > +			list_del_init(&edge->stack_entry);
> > > +
> > > +			vertex = edge->predecessor->vertex;
> > > +		}
> > > +	}
> > > +
> > > +	/* Don't restart DFS from this vertex in unix_walk_scc(). */
> > > +	list_move_tail(&vertex->entry, &unix_visited_vertices);
> > > +
> > > +	/* Need backtracking ? */
> > > +	if (!list_empty(&edge_stack))
> > > +		goto prev_vertex;
> > > +}
> > > +
> > > +static void unix_walk_scc(void)
> > > +{
> > > +	struct unix_vertex *vertex;
> > > +
> > > +	list_for_each_entry(vertex, &unix_unvisited_vertices, entry)
> > > +		vertex->index = UNIX_VERTEX_INDEX_UNVISITED;
> > > +
> > > +	/* Visit every vertex exactly once.
> > > +	 * __unix_walk_scc() moves visited vertices to unix_visited_vertices.
> > > +	 */
> > > +	while (!list_empty(&unix_unvisited_vertices)) {
> > > +		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
> > > +		__unix_walk_scc(vertex);
> > 
> > If I read correctly this is the single loop that could use the CPU for
> > a possibly long time. I'm wondering if adding a cond_resched_lock()
> > here (and possibly move the gc to a specific thread instead of a
> > workqueue) would make sense.
> > 
> > It could avoid possibly very long starvation of the system wq and the
> > used CPU.
> 
> TIL cond_resched_lock() :)
> 
> The loop is executed only when we add/del/update edges whose successor
> is alredy inflight, and the loop takes O(|Vertex| + |Edges|).
> 
> If the loop takes so long, there would be too many inflight skbs, but
> the current GC does not care such a situation.

Yes, the whole idea was about a possible additional improvement WRT the
current status.

> The suggestion still makes sense, and then we could use mutex instead of
> spinlock after kthread conversion.
> 
> Given the series has already 15 patches, I can add cond_resched_lock()
> only in the next version and include the conversion in the follow-up
> patches or include both changes as followup or do the conversion as prep.
> 
> What's the preferred option ?

Keep all the 'new stuff' in follow-up series would be the best option
IMHO.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-03-07  9:08 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-01  2:22 [PATCH v4 net-next 00/15] af_unix: Rework GC Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 01/15] af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 02/15] af_unix: Allocate struct unix_edge " Kuniyuki Iwashima
2024-03-05  8:47   ` Paolo Abeni
2024-03-06 19:44     ` Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 03/15] af_unix: Link struct unix_edge when queuing skb Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 04/15] af_unix: Bulk update unix_tot_inflight/unix_inflight " Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 05/15] af_unix: Iterate all vertices by DFS Kuniyuki Iwashima
2024-03-05  8:53   ` Paolo Abeni
2024-03-06 20:14     ` Kuniyuki Iwashima
2024-03-07  9:08       ` Paolo Abeni
2024-03-01  2:22 ` [PATCH v4 net-next 06/15] af_unix: Detect Strongly Connected Components Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 07/15] af_unix: Save listener for embryo socket Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 08/15] af_unix: Fix up unix_edge.successor " Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 09/15] af_unix: Save O(n) setup of Tarjan's algo Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 10/15] af_unix: Skip GC if no cycle exists Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 11/15] af_unix: Avoid Tarjan's algorithm if unnecessary Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 12/15] af_unix: Assign a unique index to SCC Kuniyuki Iwashima
2024-03-05  8:44   ` Paolo Abeni
2024-03-06 20:59     ` Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 13/15] af_unix: Detect dead SCC Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 14/15] af_unix: Replace garbage collection algorithm Kuniyuki Iwashima
2024-03-01  2:22 ` [PATCH v4 net-next 15/15] selftest: af_unix: Test GC for SCM_RIGHTS Kuniyuki Iwashima
2024-03-04 16:18 ` [PATCH v4 net-next 00/15] af_unix: Rework GC Paolo Abeni
2024-03-04 17:31   ` Kuniyuki Iwashima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.