wait_for_unix_gc can cause CPU overload for well behaved programs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* wait_for_unix_gc can cause CPU overload for well behaved programs
@ 2023-10-19 22:35 Ivan Babrou
       [not found] ` <20231020104728.2060-1-hdanton@sina.com>
  2023-10-20 22:05 ` Kuniyuki Iwashima
  0 siblings, 2 replies; 9+ messages in thread
From: Ivan Babrou @ 2023-10-19 22:35 UTC (permalink / raw)
  To: Linux Kernel Network Developers
  Cc: kernel-team, Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel

Hello,

We have observed this issue twice (2019 and 2023): a well behaved
service that doesn't pass any file descriptors around starts to spend
a ton of CPU time in wait_for_unix_gc.

The cause of this is that the unix send path unconditionally calls
wait_for_unix_gc, which is a global garbage collection. If any
misbehaved program exists on a system, it can force extra work for
well behaved programs.

This behavior is not new: 9915672d4127 ("af_unix: limit
unix_tot_inflight") is from 2010.

I managed to come up with a repro for this behavior:

* https://gist.github.com/bobrik/82e5722261920c9f23d9402b88a0bb27

It also includes a flamegraph illustrating the issue. It's all in one
program for convenience, but in reality the offender not picking up
SCM_RIGHTS messages and the suffering program just minding its own
business are separate.

It is also non-trivial to find the offender when this happens as it
can be completely idle while wrecking havoc for the rest of the
system.

I don't think it's fair to penalize every unix_stream_sendmsg like
this. The 16k threshold also doesn't feel very flexible, surely
computers are bigger these days and can handle more.

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <20231020104728.2060-1-hdanton@sina.com>]

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
       [not found] ` <20231020104728.2060-1-hdanton@sina.com>
@ 2023-10-20 17:25   ` Ivan Babrou
       [not found]     ` <20231021012322.1799-1-hdanton@sina.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Babrou @ 2023-10-20 17:25 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Linux Kernel Network Developers, kernel-team, Eric Dumazet,
	Paolo Abeni, linux-kernel

On Fri, Oct 20, 2023 at 3:47 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Thu, 19 Oct 2023 15:35:01 -0700 Ivan Babrou <ivan@cloudflare.com>
> > Hello,
> >
> > We have observed this issue twice (2019 and 2023): a well behaved
> > service that doesn't pass any file descriptors around starts to spend
> > a ton of CPU time in wait_for_unix_gc.
>
> See if the diff below works for you, which prevents concurrent spinning
> of unix_gc_lock, a variant of spin_trylock().
>
> Hillf
> --- x/net/unix/garbage.c
> +++ y/net/unix/garbage.c
> @@ -211,15 +211,10 @@ void unix_gc(void)
>         struct list_head cursor;
>         LIST_HEAD(not_cycle_list);
>
> +       if (test_and_set_bit(0, &gc_in_progress))
> +               return;
>         spin_lock(&unix_gc_lock);
>
> -       /* Avoid a recursive GC. */
> -       if (gc_in_progress)
> -               goto out;
> -
> -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> -       WRITE_ONCE(gc_in_progress, true);
> -
>         /* First, select candidates for garbage collection.  Only
>          * in-flight sockets are considered, and from those only ones
>          * which don't have any external reference.
> --

This could solve wait_for_unix_gc spinning, but it wouldn't affect
unix_gc itself, from what I understand. There would always be one
socket writer or destroyer punished by running the gc still. My linked
repro code exercises that path rather than the waiting spinlock
(there's a single writer thread), so it's something you can see for
yourself.

Your patch doesn't build, so I wasn't able to try it out:

#26 154.3 /build/linux-source/net/unix/garbage.c: In function 'unix_gc':
#26 154.3 /build/linux-source/net/unix/garbage.c:214:33: error:
passing argument 2 of 'test_and_set_bit' from incompatible pointer
type [-Werror=incompatible-pointer-types]
#26 154.3   214 |         if (test_and_set_bit(0, &gc_in_progress))
#26 154.3       |                                 ^~~~~~~~~~~~~~~
#26 154.3       |                                 |
#26 154.3       |                                 bool * {aka _Bool *}
#26 154.3 In file included from
/build/linux-source/include/asm-generic/bitops/atomic.h:68,
#26 154.3                  from
/build/linux-source/arch/arm64/include/asm/bitops.h:25,
#26 154.3                  from /build/linux-source/include/linux/bitops.h:68,
#26 154.3                  from /build/linux-source/include/linux/kernel.h:22,
#26 154.3                  from /build/linux-source/net/unix/garbage.c:66:
#26 154.3 /build/linux-source/include/asm-generic/bitops/instrumented-atomic.h:68:79:
note: expected 'volatile long unsigned int *' but argument is of type
'bool *' {aka '_Bool *'}
#26 154.3    68 | static __always_inline bool test_and_set_bit(long
nr, volatile unsigned long *addr)
#26 154.3       |
 ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
#26 154.3 /build/linux-source/net/unix/garbage.c:328:2: error: label
'out' defined but not used [-Werror=unused-label]
#26 154.3   328 |  out:
#26 154.3       |  ^~~
#26 154.3 cc1: all warnings being treated as errors

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <20231021012322.1799-1-hdanton@sina.com>]

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
       [not found]     ` <20231021012322.1799-1-hdanton@sina.com>
@ 2023-10-23 23:22       ` Ivan Babrou
  2023-10-23 23:45         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Babrou @ 2023-10-23 23:22 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Linux Kernel Network Developers, kernel-team, Eric Dumazet,
	Paolo Abeni, Kuniyuki Iwashima, linux-kernel

On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com>
> >
> > This could solve wait_for_unix_gc spinning, but it wouldn't affect
> > unix_gc itself, from what I understand. There would always be one
> > socket writer or destroyer punished by running the gc still.
>
> See what you want. The innocents are rescued by kicking a worker off.
> Only for thoughts.
>
> --- x/net/unix/garbage.c
> +++ y/net/unix/garbage.c
> @@ -86,7 +86,6 @@
>  /* Internal data structures and random procedures: */
>
>  static LIST_HEAD(gc_candidates);
> -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
>
>  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
>                           struct sk_buff_head *hitlist)
> @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc
>                 list_move_tail(&u->link, &gc_candidates);
>  }
>
> -static bool gc_in_progress;
> +static void __unix_gc(struct work_struct *w);
> +static DECLARE_WORK(unix_gc_work, __unix_gc);
> +
>  #define UNIX_INFLIGHT_TRIGGER_GC 16000
>
>  void wait_for_unix_gc(void)
>  {
>         /* If number of inflight sockets is insane,
> -        * force a garbage collect right now.
> -        * Paired with the WRITE_ONCE() in unix_inflight(),
> -        * unix_notinflight() and gc_in_progress().
> -        */
> -       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
> -           !READ_ONCE(gc_in_progress))
> -               unix_gc();
> -       wait_event(unix_gc_wait, gc_in_progress == false);
> +        * kick a garbage collect right now.
> +        *
> +        * todo s/wait_for_unix_gc/kick_unix_gc/
> +        */
> +       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2)
> +               queue_work(system_unbound_wq, &unix_gc_work);
>  }
>
> -/* The external entry point: unix_gc() */
> -void unix_gc(void)
> +static DEFINE_MUTEX(unix_gc_mutex);
> +
> +static void __unix_gc(struct work_struct *w)
>  {
>         struct sk_buff *next_skb, *skb;
>         struct unix_sock *u;
> @@ -211,15 +211,10 @@ void unix_gc(void)
>         struct list_head cursor;
>         LIST_HEAD(not_cycle_list);
>
> +       if (!mutex_trylock(&unix_gc_mutex))
> +               return;
>         spin_lock(&unix_gc_lock);
>
> -       /* Avoid a recursive GC. */
> -       if (gc_in_progress)
> -               goto out;
> -
> -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> -       WRITE_ONCE(gc_in_progress, true);
> -
>         /* First, select candidates for garbage collection.  Only
>          * in-flight sockets are considered, and from those only ones
>          * which don't have any external reference.
> @@ -325,11 +320,12 @@ void unix_gc(void)
>         /* All candidates should have been detached by now. */
>         BUG_ON(!list_empty(&gc_candidates));
>
> -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> -       WRITE_ONCE(gc_in_progress, false);
> -
> -       wake_up(&unix_gc_wait);
> -
> - out:
>         spin_unlock(&unix_gc_lock);
> +       mutex_unlock(&unix_gc_mutex);
> +}
> +
> +/* The external entry point: unix_gc() */
> +void unix_gc(void)
> +{
> +       __unix_gc(NULL);
>  }
> --

This one results in less overall load than Kuniyuki's proposed patch
with my repro:

* https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/

My guess is that's because my repro is the one that is getting penalized there.

There's still a lot work done in unix_release_sock here, where GC runs
as long as you have any fds inflight:

* https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670

Perhaps it can be improved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
  2023-10-23 23:22       ` Ivan Babrou
@ 2023-10-23 23:45         ` Kuniyuki Iwashima
  2023-11-17 23:38           ` Ivan Babrou
  0 siblings, 1 reply; 9+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-23 23:45 UTC (permalink / raw)
  To: ivan; +Cc: edumazet, hdanton, kernel-team, kuniyu, linux-kernel, netdev, pabeni

From: Ivan Babrou <ivan@cloudflare.com>
Date: Mon, 23 Oct 2023 16:22:35 -0700
> On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com>
> > >
> > > This could solve wait_for_unix_gc spinning, but it wouldn't affect
> > > unix_gc itself, from what I understand. There would always be one
> > > socket writer or destroyer punished by running the gc still.
> >
> > See what you want. The innocents are rescued by kicking a worker off.
> > Only for thoughts.
> >
> > --- x/net/unix/garbage.c
> > +++ y/net/unix/garbage.c
> > @@ -86,7 +86,6 @@
> >  /* Internal data structures and random procedures: */
> >
> >  static LIST_HEAD(gc_candidates);
> > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
> >
> >  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
> >                           struct sk_buff_head *hitlist)
> > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc
> >                 list_move_tail(&u->link, &gc_candidates);
> >  }
> >
> > -static bool gc_in_progress;
> > +static void __unix_gc(struct work_struct *w);
> > +static DECLARE_WORK(unix_gc_work, __unix_gc);
> > +
> >  #define UNIX_INFLIGHT_TRIGGER_GC 16000
> >
> >  void wait_for_unix_gc(void)
> >  {
> >         /* If number of inflight sockets is insane,
> > -        * force a garbage collect right now.
> > -        * Paired with the WRITE_ONCE() in unix_inflight(),
> > -        * unix_notinflight() and gc_in_progress().
> > -        */
> > -       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
> > -           !READ_ONCE(gc_in_progress))
> > -               unix_gc();
> > -       wait_event(unix_gc_wait, gc_in_progress == false);
> > +        * kick a garbage collect right now.
> > +        *
> > +        * todo s/wait_for_unix_gc/kick_unix_gc/
> > +        */
> > +       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2)
> > +               queue_work(system_unbound_wq, &unix_gc_work);
> >  }
> >
> > -/* The external entry point: unix_gc() */
> > -void unix_gc(void)
> > +static DEFINE_MUTEX(unix_gc_mutex);
> > +
> > +static void __unix_gc(struct work_struct *w)
> >  {
> >         struct sk_buff *next_skb, *skb;
> >         struct unix_sock *u;
> > @@ -211,15 +211,10 @@ void unix_gc(void)
> >         struct list_head cursor;
> >         LIST_HEAD(not_cycle_list);
> >
> > +       if (!mutex_trylock(&unix_gc_mutex))
> > +               return;
> >         spin_lock(&unix_gc_lock);
> >
> > -       /* Avoid a recursive GC. */
> > -       if (gc_in_progress)
> > -               goto out;
> > -
> > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > -       WRITE_ONCE(gc_in_progress, true);
> > -
> >         /* First, select candidates for garbage collection.  Only
> >          * in-flight sockets are considered, and from those only ones
> >          * which don't have any external reference.
> > @@ -325,11 +320,12 @@ void unix_gc(void)
> >         /* All candidates should have been detached by now. */
> >         BUG_ON(!list_empty(&gc_candidates));
> >
> > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > -       WRITE_ONCE(gc_in_progress, false);
> > -
> > -       wake_up(&unix_gc_wait);
> > -
> > - out:
> >         spin_unlock(&unix_gc_lock);
> > +       mutex_unlock(&unix_gc_mutex);
> > +}
> > +
> > +/* The external entry point: unix_gc() */
> > +void unix_gc(void)
> > +{
> > +       __unix_gc(NULL);
> >  }
> > --
> 
> This one results in less overall load than Kuniyuki's proposed patch
> with my repro:
> 
> * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/
> 
> My guess is that's because my repro is the one that is getting penalized there.

Thanks for testing, and yes.

It would be good to split the repro to one offender and one normal
process, run them on different users, and measure load on the normal
process.


> There's still a lot work done in unix_release_sock here, where GC runs
> as long as you have any fds inflight:
> 
> * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670
> 
> Perhaps it can be improved.

Yes, it also can be done async by worker as done in my first patch.
I replaced schedule_work() with queue_work() to avoid using system_wq
as gc could take long.

Could you try this ?

---8<---
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 824c258143a3..3b38e21116f1 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -13,6 +13,7 @@ void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_destruct_scm(struct sk_buff *skb);
 void io_uring_destruct_scm(struct sk_buff *skb);
 void unix_gc(void);
+void unix_gc_flush(void);
 void wait_for_unix_gc(void);
 struct sock *unix_get_socket(struct file *filp);
 struct sock *unix_peer_get(struct sock *sk);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..ed3251753417 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3683,6 +3683,7 @@ static int __init af_unix_init(void)
 
 static void __exit af_unix_exit(void)
 {
+	unix_gc_flush();
 	sock_unregister(PF_UNIX);
 	proto_unregister(&unix_dgram_proto);
 	proto_unregister(&unix_stream_proto);
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 2405f0f9af31..51f30f89bacb 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -86,7 +86,9 @@
 /* Internal data structures and random procedures: */
 
 static LIST_HEAD(gc_candidates);
-static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
+
+static void __unix_gc(struct work_struct *work);
+static DECLARE_WORK(unix_gc_work, __unix_gc);
 
 static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 			  struct sk_buff_head *hitlist)
@@ -185,24 +187,26 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 		list_move_tail(&u->link, &gc_candidates);
 }
 
-static bool gc_in_progress;
-#define UNIX_INFLIGHT_TRIGGER_GC 16000
+#define UNIX_INFLIGHT_TRIGGER_GC 16
 
 void wait_for_unix_gc(void)
 {
+	struct user_struct *user = get_uid(current_user());
+
 	/* If number of inflight sockets is insane,
-	 * force a garbage collect right now.
+	 * kick a garbage collect right now.
 	 * Paired with the WRITE_ONCE() in unix_inflight(),
-	 * unix_notinflight() and gc_in_progress().
+	 * unix_notinflight().
 	 */
-	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
-	    !READ_ONCE(gc_in_progress))
-		unix_gc();
-	wait_event(unix_gc_wait, gc_in_progress == false);
+	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC)
+		queue_work(system_unbound_wq, &unix_gc_work);
+
+	/* Penalise senders of not-yet-received-fd */
+	if (READ_ONCE(user->unix_inflight))
+		flush_work(&unix_gc_work);
 }
 
-/* The external entry point: unix_gc() */
-void unix_gc(void)
+static void __unix_gc(struct work_struct *work)
 {
 	struct sk_buff *next_skb, *skb;
 	struct unix_sock *u;
@@ -213,13 +217,6 @@ void unix_gc(void)
 
 	spin_lock(&unix_gc_lock);
 
-	/* Avoid a recursive GC. */
-	if (gc_in_progress)
-		goto out;
-
-	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
-	WRITE_ONCE(gc_in_progress, true);
-
 	/* First, select candidates for garbage collection.  Only
 	 * in-flight sockets are considered, and from those only ones
 	 * which don't have any external reference.
@@ -325,11 +322,15 @@ void unix_gc(void)
 	/* All candidates should have been detached by now. */
 	BUG_ON(!list_empty(&gc_candidates));
 
-	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
-	WRITE_ONCE(gc_in_progress, false);
+	spin_unlock(&unix_gc_lock);
+}
 
-	wake_up(&unix_gc_wait);
+void unix_gc(void)
+{
+	queue_work(system_unbound_wq, &unix_gc_work);
+}
 
- out:
-	spin_unlock(&unix_gc_lock);
+void __exit unix_gc_flush(void)
+{
+	cancel_work_sync(&unix_gc_work);
 }
---8<---

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
  2023-10-23 23:45         ` Kuniyuki Iwashima
@ 2023-11-17 23:38           ` Ivan Babrou
  2023-11-20 19:29             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Babrou @ 2023-11-17 23:38 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: edumazet, hdanton, kernel-team, linux-kernel, netdev, pabeni

On Mon, Oct 23, 2023 at 4:46 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Ivan Babrou <ivan@cloudflare.com>
> Date: Mon, 23 Oct 2023 16:22:35 -0700
> > On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com>
> > > >
> > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect
> > > > unix_gc itself, from what I understand. There would always be one
> > > > socket writer or destroyer punished by running the gc still.
> > >
> > > See what you want. The innocents are rescued by kicking a worker off.
> > > Only for thoughts.
> > >
> > > --- x/net/unix/garbage.c
> > > +++ y/net/unix/garbage.c
> > > @@ -86,7 +86,6 @@
> > >  /* Internal data structures and random procedures: */
> > >
> > >  static LIST_HEAD(gc_candidates);
> > > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
> > >
> > >  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
> > >                           struct sk_buff_head *hitlist)
> > > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc
> > >                 list_move_tail(&u->link, &gc_candidates);
> > >  }
> > >
> > > -static bool gc_in_progress;
> > > +static void __unix_gc(struct work_struct *w);
> > > +static DECLARE_WORK(unix_gc_work, __unix_gc);
> > > +
> > >  #define UNIX_INFLIGHT_TRIGGER_GC 16000
> > >
> > >  void wait_for_unix_gc(void)
> > >  {
> > >         /* If number of inflight sockets is insane,
> > > -        * force a garbage collect right now.
> > > -        * Paired with the WRITE_ONCE() in unix_inflight(),
> > > -        * unix_notinflight() and gc_in_progress().
> > > -        */
> > > -       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
> > > -           !READ_ONCE(gc_in_progress))
> > > -               unix_gc();
> > > -       wait_event(unix_gc_wait, gc_in_progress == false);
> > > +        * kick a garbage collect right now.
> > > +        *
> > > +        * todo s/wait_for_unix_gc/kick_unix_gc/
> > > +        */
> > > +       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2)
> > > +               queue_work(system_unbound_wq, &unix_gc_work);
> > >  }
> > >
> > > -/* The external entry point: unix_gc() */
> > > -void unix_gc(void)
> > > +static DEFINE_MUTEX(unix_gc_mutex);
> > > +
> > > +static void __unix_gc(struct work_struct *w)
> > >  {
> > >         struct sk_buff *next_skb, *skb;
> > >         struct unix_sock *u;
> > > @@ -211,15 +211,10 @@ void unix_gc(void)
> > >         struct list_head cursor;
> > >         LIST_HEAD(not_cycle_list);
> > >
> > > +       if (!mutex_trylock(&unix_gc_mutex))
> > > +               return;
> > >         spin_lock(&unix_gc_lock);
> > >
> > > -       /* Avoid a recursive GC. */
> > > -       if (gc_in_progress)
> > > -               goto out;
> > > -
> > > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > > -       WRITE_ONCE(gc_in_progress, true);
> > > -
> > >         /* First, select candidates for garbage collection.  Only
> > >          * in-flight sockets are considered, and from those only ones
> > >          * which don't have any external reference.
> > > @@ -325,11 +320,12 @@ void unix_gc(void)
> > >         /* All candidates should have been detached by now. */
> > >         BUG_ON(!list_empty(&gc_candidates));
> > >
> > > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > > -       WRITE_ONCE(gc_in_progress, false);
> > > -
> > > -       wake_up(&unix_gc_wait);
> > > -
> > > - out:
> > >         spin_unlock(&unix_gc_lock);
> > > +       mutex_unlock(&unix_gc_mutex);
> > > +}
> > > +
> > > +/* The external entry point: unix_gc() */
> > > +void unix_gc(void)
> > > +{
> > > +       __unix_gc(NULL);
> > >  }
> > > --
> >
> > This one results in less overall load than Kuniyuki's proposed patch
> > with my repro:
> >
> > * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/
> >
> > My guess is that's because my repro is the one that is getting penalized there.
>
> Thanks for testing, and yes.
>
> It would be good to split the repro to one offender and one normal
> process, run them on different users, and measure load on the normal
> process.
>
>
> > There's still a lot work done in unix_release_sock here, where GC runs
> > as long as you have any fds inflight:
> >
> > * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670
> >
> > Perhaps it can be improved.
>
> Yes, it also can be done async by worker as done in my first patch.
> I replaced schedule_work() with queue_work() to avoid using system_wq
> as gc could take long.
>
> Could you try this ?

Apologies for the long wait, I was OOO.

A bit of a problem here is that unix_gc is called unconditionally in
unix_release_sock. It's done asynchronously now and it can only
consume a single CPU with your changes, which is a lot better than
before. I am wondering if we can still do better to only trigger gc
when it touches unix sockets that have inflight fds.

Commit 3c32da19a858 ("unix: Show number of pending scm files of
receive queue in fdinfo") added "struct scm_stat" to "struct
unix_sock", which can be used to check for the number of inflight fds
in the unix socket. Can we use that to drive the GC? I think we can:

* Trigger unix_gc from unix_release_sock if there's a non-zero number
of inflight fds in the socket being destroyed.
* Trigger wait_for_unix_gc from the write path only if the write
contains fds or if the socket contains inflight fds. Alternatively, we
can run gc at the end of the write path and only check for inflight
fds on the socket there, which is probably simpler.

GC only applies to unix sockets inflight of other unix sockets, so GC
can still affect sockets passing regular fds around, but it wouldn't
affect non-fd-passing unix sockets, which are usually in the data
path.

This way we don't have to check for per-user inflight fds, which means
that services running as "nobody" wouldn't be punished for other
services running as "nobody" screwing up.

Does this sound like a reasonable approach?

> ---8<---
> diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> index 824c258143a3..3b38e21116f1 100644
> --- a/include/net/af_unix.h
> +++ b/include/net/af_unix.h
> @@ -13,6 +13,7 @@ void unix_notinflight(struct user_struct *user, struct file *fp);
>  void unix_destruct_scm(struct sk_buff *skb);
>  void io_uring_destruct_scm(struct sk_buff *skb);
>  void unix_gc(void);
> +void unix_gc_flush(void);
>  void wait_for_unix_gc(void);
>  struct sock *unix_get_socket(struct file *filp);
>  struct sock *unix_peer_get(struct sock *sk);
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 3e8a04a13668..ed3251753417 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -3683,6 +3683,7 @@ static int __init af_unix_init(void)
>
>  static void __exit af_unix_exit(void)
>  {
> +       unix_gc_flush();
>         sock_unregister(PF_UNIX);
>         proto_unregister(&unix_dgram_proto);
>         proto_unregister(&unix_stream_proto);
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 2405f0f9af31..51f30f89bacb 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -86,7 +86,9 @@
>  /* Internal data structures and random procedures: */
>
>  static LIST_HEAD(gc_candidates);
> -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
> +
> +static void __unix_gc(struct work_struct *work);
> +static DECLARE_WORK(unix_gc_work, __unix_gc);
>
>  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
>                           struct sk_buff_head *hitlist)
> @@ -185,24 +187,26 @@ static void inc_inflight_move_tail(struct unix_sock *u)
>                 list_move_tail(&u->link, &gc_candidates);
>  }
>
> -static bool gc_in_progress;
> -#define UNIX_INFLIGHT_TRIGGER_GC 16000
> +#define UNIX_INFLIGHT_TRIGGER_GC 16

It's probably best to keep it at 16k.

>  void wait_for_unix_gc(void)
>  {
> +       struct user_struct *user = get_uid(current_user());
> +
>         /* If number of inflight sockets is insane,
> -        * force a garbage collect right now.
> +        * kick a garbage collect right now.
>          * Paired with the WRITE_ONCE() in unix_inflight(),
> -        * unix_notinflight() and gc_in_progress().
> +        * unix_notinflight().
>          */
> -       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
> -           !READ_ONCE(gc_in_progress))
> -               unix_gc();
> -       wait_event(unix_gc_wait, gc_in_progress == false);
> +       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC)
> +               queue_work(system_unbound_wq, &unix_gc_work);
> +
> +       /* Penalise senders of not-yet-received-fd */
> +       if (READ_ONCE(user->unix_inflight))
> +               flush_work(&unix_gc_work);
>  }
>
> -/* The external entry point: unix_gc() */
> -void unix_gc(void)
> +static void __unix_gc(struct work_struct *work)
>  {
>         struct sk_buff *next_skb, *skb;
>         struct unix_sock *u;
> @@ -213,13 +217,6 @@ void unix_gc(void)
>
>         spin_lock(&unix_gc_lock);
>
> -       /* Avoid a recursive GC. */
> -       if (gc_in_progress)
> -               goto out;
> -
> -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> -       WRITE_ONCE(gc_in_progress, true);
> -
>         /* First, select candidates for garbage collection.  Only
>          * in-flight sockets are considered, and from those only ones
>          * which don't have any external reference.
> @@ -325,11 +322,15 @@ void unix_gc(void)
>         /* All candidates should have been detached by now. */
>         BUG_ON(!list_empty(&gc_candidates));
>
> -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> -       WRITE_ONCE(gc_in_progress, false);
> +       spin_unlock(&unix_gc_lock);
> +}
>
> -       wake_up(&unix_gc_wait);
> +void unix_gc(void)
> +{
> +       queue_work(system_unbound_wq, &unix_gc_work);
> +}
>
> - out:
> -       spin_unlock(&unix_gc_lock);
> +void __exit unix_gc_flush(void)
> +{
> +       cancel_work_sync(&unix_gc_work);
>  }
> ---8<---

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
  2023-11-17 23:38           ` Ivan Babrou
@ 2023-11-20 19:29             ` Kuniyuki Iwashima
  2023-11-20 22:30               ` Ivan Babrou
  0 siblings, 1 reply; 9+ messages in thread
From: Kuniyuki Iwashima @ 2023-11-20 19:29 UTC (permalink / raw)
  To: ivan; +Cc: edumazet, hdanton, kernel-team, kuniyu, linux-kernel, netdev, pabeni

From: Ivan Babrou <ivan@cloudflare.com>
Date: Fri, 17 Nov 2023 15:38:42 -0800
> On Mon, Oct 23, 2023 at 4:46 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Ivan Babrou <ivan@cloudflare.com>
> > Date: Mon, 23 Oct 2023 16:22:35 -0700
> > > On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote:
> > > >
> > > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com>
> > > > >
> > > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect
> > > > > unix_gc itself, from what I understand. There would always be one
> > > > > socket writer or destroyer punished by running the gc still.
> > > >
> > > > See what you want. The innocents are rescued by kicking a worker off.
> > > > Only for thoughts.
> > > >
> > > > --- x/net/unix/garbage.c
> > > > +++ y/net/unix/garbage.c
> > > > @@ -86,7 +86,6 @@
> > > >  /* Internal data structures and random procedures: */
> > > >
> > > >  static LIST_HEAD(gc_candidates);
> > > > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
> > > >
> > > >  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
> > > >                           struct sk_buff_head *hitlist)
> > > > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc
> > > >                 list_move_tail(&u->link, &gc_candidates);
> > > >  }
> > > >
> > > > -static bool gc_in_progress;
> > > > +static void __unix_gc(struct work_struct *w);
> > > > +static DECLARE_WORK(unix_gc_work, __unix_gc);
> > > > +
> > > >  #define UNIX_INFLIGHT_TRIGGER_GC 16000
> > > >
> > > >  void wait_for_unix_gc(void)
> > > >  {
> > > >         /* If number of inflight sockets is insane,
> > > > -        * force a garbage collect right now.
> > > > -        * Paired with the WRITE_ONCE() in unix_inflight(),
> > > > -        * unix_notinflight() and gc_in_progress().
> > > > -        */
> > > > -       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
> > > > -           !READ_ONCE(gc_in_progress))
> > > > -               unix_gc();
> > > > -       wait_event(unix_gc_wait, gc_in_progress == false);
> > > > +        * kick a garbage collect right now.
> > > > +        *
> > > > +        * todo s/wait_for_unix_gc/kick_unix_gc/
> > > > +        */
> > > > +       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2)
> > > > +               queue_work(system_unbound_wq, &unix_gc_work);
> > > >  }
> > > >
> > > > -/* The external entry point: unix_gc() */
> > > > -void unix_gc(void)
> > > > +static DEFINE_MUTEX(unix_gc_mutex);
> > > > +
> > > > +static void __unix_gc(struct work_struct *w)
> > > >  {
> > > >         struct sk_buff *next_skb, *skb;
> > > >         struct unix_sock *u;
> > > > @@ -211,15 +211,10 @@ void unix_gc(void)
> > > >         struct list_head cursor;
> > > >         LIST_HEAD(not_cycle_list);
> > > >
> > > > +       if (!mutex_trylock(&unix_gc_mutex))
> > > > +               return;
> > > >         spin_lock(&unix_gc_lock);
> > > >
> > > > -       /* Avoid a recursive GC. */
> > > > -       if (gc_in_progress)
> > > > -               goto out;
> > > > -
> > > > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > > > -       WRITE_ONCE(gc_in_progress, true);
> > > > -
> > > >         /* First, select candidates for garbage collection.  Only
> > > >          * in-flight sockets are considered, and from those only ones
> > > >          * which don't have any external reference.
> > > > @@ -325,11 +320,12 @@ void unix_gc(void)
> > > >         /* All candidates should have been detached by now. */
> > > >         BUG_ON(!list_empty(&gc_candidates));
> > > >
> > > > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > > > -       WRITE_ONCE(gc_in_progress, false);
> > > > -
> > > > -       wake_up(&unix_gc_wait);
> > > > -
> > > > - out:
> > > >         spin_unlock(&unix_gc_lock);
> > > > +       mutex_unlock(&unix_gc_mutex);
> > > > +}
> > > > +
> > > > +/* The external entry point: unix_gc() */
> > > > +void unix_gc(void)
> > > > +{
> > > > +       __unix_gc(NULL);
> > > >  }
> > > > --
> > >
> > > This one results in less overall load than Kuniyuki's proposed patch
> > > with my repro:
> > >
> > > * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/
> > >
> > > My guess is that's because my repro is the one that is getting penalized there.
> >
> > Thanks for testing, and yes.
> >
> > It would be good to split the repro to one offender and one normal
> > process, run them on different users, and measure load on the normal
> > process.
> >
> >
> > > There's still a lot work done in unix_release_sock here, where GC runs
> > > as long as you have any fds inflight:
> > >
> > > * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670
> > >
> > > Perhaps it can be improved.
> >
> > Yes, it also can be done async by worker as done in my first patch.
> > I replaced schedule_work() with queue_work() to avoid using system_wq
> > as gc could take long.
> >
> > Could you try this ?
> 
> Apologies for the long wait, I was OOO.
> 
> A bit of a problem here is that unix_gc is called unconditionally in
> unix_release_sock.

unix_release_sock() calls gc only when there is a inflight socket.


> It's done asynchronously now and it can only
> consume a single CPU with your changes, which is a lot better than
> before. I am wondering if we can still do better to only trigger gc
> when it touches unix sockets that have inflight fds.
> 
> Commit 3c32da19a858 ("unix: Show number of pending scm files of
> receive queue in fdinfo") added "struct scm_stat" to "struct
> unix_sock", which can be used to check for the number of inflight fds
> in the unix socket. Can we use that to drive the GC?

I don't think the stat is useful to trigger gc.  Unless the receiver
is accessible via sendmsg(), it's not a gc candidate and we need not
care about its stats about valid FDs that are ready to be processed
and never cleaned up by gc until close().


> I think we can:
> 
> * Trigger unix_gc from unix_release_sock if there's a non-zero number
> of inflight fds in the socket being destroyed.

This is the case of now.


> * Trigger wait_for_unix_gc from the write path only if the write
> contains fds or if the socket contains inflight fds. Alternatively, we
> can run gc at the end of the write path and only check for inflight
> fds on the socket there, which is probably simpler.

I don't think it's better to call gc at the end of sendmsg() as there
would be small chance to consume memory compared to running gc in the
beginning of sendmsg().


> GC only applies to unix sockets inflight of other unix sockets, so GC
> can still affect sockets passing regular fds around, but it wouldn't
> affect non-fd-passing unix sockets, which are usually in the data
> path.

I think we can run gc only when scm contains AF_UNIX sockets by tweaking
a little bit scm processing.


> This way we don't have to check for per-user inflight fds, which means
> that services running as "nobody" wouldn't be punished for other
> services running as "nobody" screwing up.

If we do not check user, a user could send 16000 AF_UNIX fds and
other innocent users sending fds must wait gc.

I think isolating users would make more sense so you can sandbox
a suspicious process.

Probably we can move flush_work() in the preceding if.  Then, the
number of gc invocation in the "nobody" case is the same as before.

---8<---
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 51f30f89bacb..74fc208c8858 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -198,12 +198,13 @@ void wait_for_unix_gc(void)
 	 * Paired with the WRITE_ONCE() in unix_inflight(),
 	 * unix_notinflight().
 	 */
-	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC)
+	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) {
 		queue_work(system_unbound_wq, &unix_gc_work);
 
-	/* Penalise senders of not-yet-received-fd */
-	if (READ_ONCE(user->unix_inflight))
-		flush_work(&unix_gc_work);
+		/* Penalise senders of not-yet-received-fd */
+		if (READ_ONCE(user->unix_inflight))
+			flush_work(&unix_gc_work);
+	}
 }
 
 static void __unix_gc(struct work_struct *work)
---8<---


> 
> Does this sound like a reasonable approach?
> 
[...]
> > -static bool gc_in_progress;
> > -#define UNIX_INFLIGHT_TRIGGER_GC 16000
> > +#define UNIX_INFLIGHT_TRIGGER_GC 16
> 
> It's probably best to keep it at 16k.

Oops, this is just for testing on my local machine easily :p

Anyway, I'll post a formal patch this week.

Thanks!

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
  2023-11-20 19:29             ` Kuniyuki Iwashima
@ 2023-11-20 22:30               ` Ivan Babrou
  2023-11-20 23:53                 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Babrou @ 2023-11-20 22:30 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: edumazet, hdanton, kernel-team, linux-kernel, netdev, pabeni

On Mon, Nov 20, 2023 at 11:29 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Ivan Babrou <ivan@cloudflare.com>
> Date: Fri, 17 Nov 2023 15:38:42 -0800
> > On Mon, Oct 23, 2023 at 4:46 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Ivan Babrou <ivan@cloudflare.com>
> > > Date: Mon, 23 Oct 2023 16:22:35 -0700
> > > > On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote:
> > > > >
> > > > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com>
> > > > > >
> > > > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect
> > > > > > unix_gc itself, from what I understand. There would always be one
> > > > > > socket writer or destroyer punished by running the gc still.
> > > > >
> > > > > See what you want. The innocents are rescued by kicking a worker off.
> > > > > Only for thoughts.
> > > > >
> > > > > --- x/net/unix/garbage.c
> > > > > +++ y/net/unix/garbage.c
> > > > > @@ -86,7 +86,6 @@
> > > > >  /* Internal data structures and random procedures: */
> > > > >
> > > > >  static LIST_HEAD(gc_candidates);
> > > > > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
> > > > >
> > > > >  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
> > > > >                           struct sk_buff_head *hitlist)
> > > > > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc
> > > > >                 list_move_tail(&u->link, &gc_candidates);
> > > > >  }
> > > > >
> > > > > -static bool gc_in_progress;
> > > > > +static void __unix_gc(struct work_struct *w);
> > > > > +static DECLARE_WORK(unix_gc_work, __unix_gc);
> > > > > +
> > > > >  #define UNIX_INFLIGHT_TRIGGER_GC 16000
> > > > >
> > > > >  void wait_for_unix_gc(void)
> > > > >  {
> > > > >         /* If number of inflight sockets is insane,
> > > > > -        * force a garbage collect right now.
> > > > > -        * Paired with the WRITE_ONCE() in unix_inflight(),
> > > > > -        * unix_notinflight() and gc_in_progress().
> > > > > -        */
> > > > > -       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
> > > > > -           !READ_ONCE(gc_in_progress))
> > > > > -               unix_gc();
> > > > > -       wait_event(unix_gc_wait, gc_in_progress == false);
> > > > > +        * kick a garbage collect right now.
> > > > > +        *
> > > > > +        * todo s/wait_for_unix_gc/kick_unix_gc/
> > > > > +        */
> > > > > +       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2)
> > > > > +               queue_work(system_unbound_wq, &unix_gc_work);
> > > > >  }
> > > > >
> > > > > -/* The external entry point: unix_gc() */
> > > > > -void unix_gc(void)
> > > > > +static DEFINE_MUTEX(unix_gc_mutex);
> > > > > +
> > > > > +static void __unix_gc(struct work_struct *w)
> > > > >  {
> > > > >         struct sk_buff *next_skb, *skb;
> > > > >         struct unix_sock *u;
> > > > > @@ -211,15 +211,10 @@ void unix_gc(void)
> > > > >         struct list_head cursor;
> > > > >         LIST_HEAD(not_cycle_list);
> > > > >
> > > > > +       if (!mutex_trylock(&unix_gc_mutex))
> > > > > +               return;
> > > > >         spin_lock(&unix_gc_lock);
> > > > >
> > > > > -       /* Avoid a recursive GC. */
> > > > > -       if (gc_in_progress)
> > > > > -               goto out;
> > > > > -
> > > > > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > > > > -       WRITE_ONCE(gc_in_progress, true);
> > > > > -
> > > > >         /* First, select candidates for garbage collection.  Only
> > > > >          * in-flight sockets are considered, and from those only ones
> > > > >          * which don't have any external reference.
> > > > > @@ -325,11 +320,12 @@ void unix_gc(void)
> > > > >         /* All candidates should have been detached by now. */
> > > > >         BUG_ON(!list_empty(&gc_candidates));
> > > > >
> > > > > -       /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> > > > > -       WRITE_ONCE(gc_in_progress, false);
> > > > > -
> > > > > -       wake_up(&unix_gc_wait);
> > > > > -
> > > > > - out:
> > > > >         spin_unlock(&unix_gc_lock);
> > > > > +       mutex_unlock(&unix_gc_mutex);
> > > > > +}
> > > > > +
> > > > > +/* The external entry point: unix_gc() */
> > > > > +void unix_gc(void)
> > > > > +{
> > > > > +       __unix_gc(NULL);
> > > > >  }
> > > > > --
> > > >
> > > > This one results in less overall load than Kuniyuki's proposed patch
> > > > with my repro:
> > > >
> > > > * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/
> > > >
> > > > My guess is that's because my repro is the one that is getting penalized there.
> > >
> > > Thanks for testing, and yes.
> > >
> > > It would be good to split the repro to one offender and one normal
> > > process, run them on different users, and measure load on the normal
> > > process.
> > >
> > >
> > > > There's still a lot work done in unix_release_sock here, where GC runs
> > > > as long as you have any fds inflight:
> > > >
> > > > * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670
> > > >
> > > > Perhaps it can be improved.
> > >
> > > Yes, it also can be done async by worker as done in my first patch.
> > > I replaced schedule_work() with queue_work() to avoid using system_wq
> > > as gc could take long.
> > >
> > > Could you try this ?
> >
> > Apologies for the long wait, I was OOO.
> >
> > A bit of a problem here is that unix_gc is called unconditionally in
> > unix_release_sock.
>
> unix_release_sock() calls gc only when there is a inflight socket.
>
>
> > It's done asynchronously now and it can only
> > consume a single CPU with your changes, which is a lot better than
> > before. I am wondering if we can still do better to only trigger gc
> > when it touches unix sockets that have inflight fds.
> >
> > Commit 3c32da19a858 ("unix: Show number of pending scm files of
> > receive queue in fdinfo") added "struct scm_stat" to "struct
> > unix_sock", which can be used to check for the number of inflight fds
> > in the unix socket. Can we use that to drive the GC?
>
> I don't think the stat is useful to trigger gc.  Unless the receiver
> is accessible via sendmsg(), it's not a gc candidate and we need not
> care about its stats about valid FDs that are ready to be processed
> and never cleaned up by gc until close().

I'm not quite following why it's not useful, could you elaborate? The
GC call today is conditioned by 16k system-wide inflight unix fds:

* https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L2204
* https://github.com/torvalds/linux/blob/master/net/unix/garbage.c#L191

The only way checking for inflight fds is worse is if there non-unix
fds inflight. This can be easily alleviated by checking for non-zero
inflight fds on the socket + system-wide unix fds check, making it
better than just checking the system-wide gauge. What am I missing?

> > I think we can:
> >
> > * Trigger unix_gc from unix_release_sock if there's a non-zero number
> > of inflight fds in the socket being destroyed.
>
> This is the case of now.

It's not, unless I'm misunderstanding something. The check today is
for any inflight fds _system-wide_:

* https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L684-L685

My suggestion is to check this _per socket_, making sockets that do
not pass any fds (likely an overwhelming majority) unaffected.

> > * Trigger wait_for_unix_gc from the write path only if the write
> > contains fds or if the socket contains inflight fds. Alternatively, we
> > can run gc at the end of the write path and only check for inflight
> > fds on the socket there, which is probably simpler.
>
> I don't think it's better to call gc at the end of sendmsg() as there
> would be small chance to consume memory compared to running gc in the
> beginning of sendmsg().

There is no guarantee that GC frees any memory either way.

> > GC only applies to unix sockets inflight of other unix sockets, so GC
> > can still affect sockets passing regular fds around, but it wouldn't
> > affect non-fd-passing unix sockets, which are usually in the data
> > path.
>
> I think we can run gc only when scm contains AF_UNIX sockets by tweaking
> a little bit scm processing.

That's even better than what I'm proposing (all inflight fds -> just
unix inflight fds being the difference), but probably requires a bit
more code changes. I'm happy to test a patch when you have it.

> > This way we don't have to check for per-user inflight fds, which means
> > that services running as "nobody" wouldn't be punished for other
> > services running as "nobody" screwing up.
>
> If we do not check user, a user could send 16000 AF_UNIX fds and
> other innocent users sending fds must wait gc.

In my proposal we check for infight fds per socket and if there are
none, there is no GC at all.

> I think isolating users would make more sense so you can sandbox
> a suspicious process.

Sure, but that's somewhat beside the point. Ideally things should work
well outside the box.

> Probably we can move flush_work() in the preceding if.  Then, the
> number of gc invocation in the "nobody" case is the same as before.

Just to make sure I get my point across: I want to have no GC if a
socket does not possibly contribute any garbage to be collected. If my
program just passes bytes back and forth and churns through unix
sockets, it shouldn't be penalized with GC and it shouldn't try to
schedule GC, no matter what else is going on in the system

Hope this makes sense.

> ---8<---
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 51f30f89bacb..74fc208c8858 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -198,12 +198,13 @@ void wait_for_unix_gc(void)
>          * Paired with the WRITE_ONCE() in unix_inflight(),
>          * unix_notinflight().
>          */
> -       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC)
> +       if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) {
>                 queue_work(system_unbound_wq, &unix_gc_work);
>
> -       /* Penalise senders of not-yet-received-fd */
> -       if (READ_ONCE(user->unix_inflight))
> -               flush_work(&unix_gc_work);
> +               /* Penalise senders of not-yet-received-fd */
> +               if (READ_ONCE(user->unix_inflight))
> +                       flush_work(&unix_gc_work);
> +       }
>  }
>
>  static void __unix_gc(struct work_struct *work)
> ---8<---
>
>
> >
> > Does this sound like a reasonable approach?
> >
> [...]
> > > -static bool gc_in_progress;
> > > -#define UNIX_INFLIGHT_TRIGGER_GC 16000
> > > +#define UNIX_INFLIGHT_TRIGGER_GC 16
> >
> > It's probably best to keep it at 16k.
>
> Oops, this is just for testing on my local machine easily :p
>
> Anyway, I'll post a formal patch this week.
>
> Thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
  2023-11-20 22:30               ` Ivan Babrou
@ 2023-11-20 23:53                 ` Kuniyuki Iwashima
  0 siblings, 0 replies; 9+ messages in thread
From: Kuniyuki Iwashima @ 2023-11-20 23:53 UTC (permalink / raw)
  To: ivan; +Cc: edumazet, hdanton, kernel-team, kuniyu, linux-kernel, netdev, pabeni

From: Ivan Babrou <ivan@cloudflare.com>
Date: Mon, 20 Nov 2023 14:30:52 -0800
> > > A bit of a problem here is that unix_gc is called unconditionally in
> > > unix_release_sock.
> >
> > unix_release_sock() calls gc only when there is a inflight socket.
> >
> >
> > > It's done asynchronously now and it can only
> > > consume a single CPU with your changes, which is a lot better than
> > > before. I am wondering if we can still do better to only trigger gc
> > > when it touches unix sockets that have inflight fds.
> > >
> > > Commit 3c32da19a858 ("unix: Show number of pending scm files of
> > > receive queue in fdinfo") added "struct scm_stat" to "struct
> > > unix_sock", which can be used to check for the number of inflight fds
> > > in the unix socket. Can we use that to drive the GC?
> >
> > I don't think the stat is useful to trigger gc.  Unless the receiver
> > is accessible via sendmsg(), it's not a gc candidate and we need not
> > care about its stats about valid FDs that are ready to be processed
> > and never cleaned up by gc until close().
> 
> I'm not quite following why it's not useful, could you elaborate? The
> GC call today is conditioned by 16k system-wide inflight unix fds:
> 
> * https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L2204
> * https://github.com/torvalds/linux/blob/master/net/unix/garbage.c#L191
> 
> The only way checking for inflight fds is worse is if there non-unix
> fds inflight. This can be easily alleviated by checking for non-zero
> inflight fds on the socket + system-wide unix fds check, making it
> better than just checking the system-wide gauge. What am I missing?

unix_tot_inflight is the number of inflight AF_UNIX socket, excluding
non-unix fds.  (See unix_inflight())

Let's say sk-A passes its fd to sk-B, and vice versa, and then, we
close sk-A and sk-B.  Each fd passing increments the refcnt of both
"file", so unix_release_sock() is not called for _both_ sockets.

And if we check each socket's inflight fd to trigger gc, and no one
sends fd, we never trigger gc for sk-A and sk-B.  In other words,
unix_gc() call in unix_release_sock() is basically for _other_ sockets,
not for the socket being destroyed there.

So, here we need to check the number of inflight fds in other sockets.
We needed to clean up all inflight dead fd when we close() all AF_UNIX
sockets (because AF_UNIX could be modulised and unloaded).

It can no longer be a module, so technically, we need not run gc at
close() now, but it would be better to not leave the dead fd so long.
We need not wait gc though.

> 
> > > I think we can:
> > >
> > > * Trigger unix_gc from unix_release_sock if there's a non-zero number
> > > of inflight fds in the socket being destroyed.
> >
> > This is the case of now.
> 
> It's not, unless I'm misunderstanding something.

Sorry, I just skipped "in the socket being destroyed", but as mentioned
above, we need to check the system-wide unix_tot_inflight anyway.


> The check today is
> for any inflight fds _system-wide_:
> 
> * https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L684-L685
> 
> My suggestion is to check this _per socket_, making sockets that do
> not pass any fds (likely an overwhelming majority) unaffected.
> 
> > > * Trigger wait_for_unix_gc from the write path only if the write
> > > contains fds or if the socket contains inflight fds. Alternatively, we
> > > can run gc at the end of the write path and only check for inflight
> > > fds on the socket there, which is probably simpler.
> >
> > I don't think it's better to call gc at the end of sendmsg() as there
> > would be small chance to consume memory compared to running gc in the
> > beginning of sendmsg().
> 
> There is no guarantee that GC frees any memory either way.

Right.

> 
> > > GC only applies to unix sockets inflight of other unix sockets, so GC
> > > can still affect sockets passing regular fds around, but it wouldn't
> > > affect non-fd-passing unix sockets, which are usually in the data
> > > path.
> >
> > I think we can run gc only when scm contains AF_UNIX sockets by tweaking
> > a little bit scm processing.
> 
> That's even better than what I'm proposing (all inflight fds -> just
> unix inflight fds being the difference), but probably requires a bit
> more code changes. I'm happy to test a patch when you have it.
> 
> > > This way we don't have to check for per-user inflight fds, which means
> > > that services running as "nobody" wouldn't be punished for other
> > > services running as "nobody" screwing up.
> >
> > If we do not check user, a user could send 16000 AF_UNIX fds and
> > other innocent users sending fds must wait gc.
> 
> In my proposal we check for infight fds per socket and if there are
> none, there is no GC at all.

Again, we need to check the system wide unix_tot_inflight at least
for other sockets that is already close()d and not accessible from
user.  (If we can check the per-socket stat in sendmsg(), the receiver
socket must have its valid fd and is not a gc candidate.)

So, we need to schedule gc when unix_tot_inflight > 16K anyway, and
we need to decide who should wait.  If user has not-yet-received fd,
it could be in the dead fd's recv queue, so the user should wait.

OTOH, the fds that you are suggesting to check are still valid and
not dead at least.

  * if the process is sending AF_UNIX fds
  * if the receiver has inflight fds in recv queue

Botth cases could contribute to trigger gc, but there is less certainty
than user's inflight count.


> 
> > I think isolating users would make more sense so you can sandbox
> > a suspicious process.
> 
> Sure, but that's somewhat beside the point. Ideally things should work
> well outside the box.
> 
> > Probably we can move flush_work() in the preceding if.  Then, the
> > number of gc invocation in the "nobody" case is the same as before.
> 
> Just to make sure I get my point across: I want to have no GC if a
> socket does not possibly contribute any garbage to be collected. If my
> program just passes bytes back and forth and churns through unix
> sockets, it shouldn't be penalized with GC and it shouldn't try to
> schedule GC, no matter what else is going on in the system
> 
> Hope this makes sense.

Yes, that makes sense, and I understnad that sendmsg() with no fds need
not wait, but the point is who is responsible to wait for gc to finish.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wait_for_unix_gc can cause CPU overload for well behaved programs
  2023-10-19 22:35 wait_for_unix_gc can cause CPU overload for well behaved programs Ivan Babrou
       [not found] ` <20231020104728.2060-1-hdanton@sina.com>
@ 2023-10-20 22:05 ` Kuniyuki Iwashima
  1 sibling, 0 replies; 9+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-20 22:05 UTC (permalink / raw)
  To: ivan; +Cc: edumazet, kernel-team, kuba, linux-kernel, netdev, pabeni, kuniyu

From: Ivan Babrou <ivan@cloudflare.com>
Date: Thu, 19 Oct 2023 15:35:01 -0700
> Hello,
> 
> We have observed this issue twice (2019 and 2023): a well behaved
> service that doesn't pass any file descriptors around starts to spend
> a ton of CPU time in wait_for_unix_gc.
> 
> The cause of this is that the unix send path unconditionally calls
> wait_for_unix_gc, which is a global garbage collection. If any
> misbehaved program exists on a system, it can force extra work for
> well behaved programs.
> 
> This behavior is not new: 9915672d4127 ("af_unix: limit
> unix_tot_inflight") is from 2010.
> 
> I managed to come up with a repro for this behavior:
> 
> * https://gist.github.com/bobrik/82e5722261920c9f23d9402b88a0bb27
> 
> It also includes a flamegraph illustrating the issue. It's all in one
> program for convenience, but in reality the offender not picking up
> SCM_RIGHTS messages and the suffering program just minding its own
> business are separate.
> 
> It is also non-trivial to find the offender when this happens as it
> can be completely idle while wrecking havoc for the rest of the
> system.
> 
> I don't think it's fair to penalize every unix_stream_sendmsg like
> this. The 16k threshold also doesn't feel very flexible, surely
> computers are bigger these days and can handle more.

Probably we could do the gc async and enforce the penalty only on
the offender by checking user->unix_inflight.

compile test only:

---8<---
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 824c258143a3..a119f37953cc 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -12,8 +12,9 @@ void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_destruct_scm(struct sk_buff *skb);
 void io_uring_destruct_scm(struct sk_buff *skb);
-void unix_gc(void);
 void wait_for_unix_gc(void);
+void unix_gc_start(void);
+void unix_gc_stop(void);
 struct sock *unix_get_socket(struct file *filp);
 struct sock *unix_peer_get(struct sock *sk);
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..56db096b13f1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -681,7 +681,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
 	 */
 
 	if (READ_ONCE(unix_tot_inflight))
-		unix_gc();		/* Garbage collect fds */
+		unix_gc_start(); /* Garbage collect fds */
 }
 
 static void init_peercred(struct sock *sk)
@@ -3683,6 +3683,7 @@ static int __init af_unix_init(void)
 
 static void __exit af_unix_exit(void)
 {
+	unix_gc_stop();
 	sock_unregister(PF_UNIX);
 	proto_unregister(&unix_dgram_proto);
 	proto_unregister(&unix_stream_proto);
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 2405f0f9af31..fb24d62fe34a 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -185,24 +185,26 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 		list_move_tail(&u->link, &gc_candidates);
 }
 
-static bool gc_in_progress;
 #define UNIX_INFLIGHT_TRIGGER_GC 16000
 
+static void unix_gc(struct work_struct *work);
+static DECLARE_WORK(unix_gc_work, unix_gc);
+
 void wait_for_unix_gc(void)
 {
-	/* If number of inflight sockets is insane,
-	 * force a garbage collect right now.
-	 * Paired with the WRITE_ONCE() in unix_inflight(),
-	 * unix_notinflight() and gc_in_progress().
-	 */
-	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
-	    !READ_ONCE(gc_in_progress))
-		unix_gc();
-	wait_event(unix_gc_wait, gc_in_progress == false);
+	struct user_struct *user = get_uid(current_user());
+
+	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC)
+		schedule_work(&unix_gc_work);
+
+	if (!READ_ONCE(user->unix_inflight))
+		return;
+
+	flush_work(&unix_gc_work);
 }
 
 /* The external entry point: unix_gc() */
-void unix_gc(void)
+static void unix_gc(struct work_struct *work)
 {
 	struct sk_buff *next_skb, *skb;
 	struct unix_sock *u;
@@ -213,13 +215,6 @@ void unix_gc(void)
 
 	spin_lock(&unix_gc_lock);
 
-	/* Avoid a recursive GC. */
-	if (gc_in_progress)
-		goto out;
-
-	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
-	WRITE_ONCE(gc_in_progress, true);
-
 	/* First, select candidates for garbage collection.  Only
 	 * in-flight sockets are considered, and from those only ones
 	 * which don't have any external reference.
@@ -325,11 +320,15 @@ void unix_gc(void)
 	/* All candidates should have been detached by now. */
 	BUG_ON(!list_empty(&gc_candidates));
 
-	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
-	WRITE_ONCE(gc_in_progress, false);
+	spin_unlock(&unix_gc_lock);
+}
 
-	wake_up(&unix_gc_wait);
+void unix_gc_start(void)
+{
+	schedule_work(&unix_gc_work);
+}
 
- out:
-	spin_unlock(&unix_gc_lock);
+void __exit unix_gc_stop(void)
+{
+	flush_work(&unix_gc_work);
 }
---8<---

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-11-20 23:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-19 22:35 wait_for_unix_gc can cause CPU overload for well behaved programs Ivan Babrou
     [not found] ` <20231020104728.2060-1-hdanton@sina.com>
2023-10-20 17:25   ` Ivan Babrou
     [not found]     ` <20231021012322.1799-1-hdanton@sina.com>
2023-10-23 23:22       ` Ivan Babrou
2023-10-23 23:45         ` Kuniyuki Iwashima
2023-11-17 23:38           ` Ivan Babrou
2023-11-20 19:29             ` Kuniyuki Iwashima
2023-11-20 22:30               ` Ivan Babrou
2023-11-20 23:53                 ` Kuniyuki Iwashima
2023-10-20 22:05 ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).