* wait_for_unix_gc can cause CPU overload for well behaved programs @ 2023-10-19 22:35 Ivan Babrou [not found] ` <20231020104728.2060-1-hdanton@sina.com> 2023-10-20 22:05 ` Kuniyuki Iwashima 0 siblings, 2 replies; 9+ messages in thread From: Ivan Babrou @ 2023-10-19 22:35 UTC (permalink / raw) To: Linux Kernel Network Developers Cc: kernel-team, Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel Hello, We have observed this issue twice (2019 and 2023): a well behaved service that doesn't pass any file descriptors around starts to spend a ton of CPU time in wait_for_unix_gc. The cause of this is that the unix send path unconditionally calls wait_for_unix_gc, which is a global garbage collection. If any misbehaved program exists on a system, it can force extra work for well behaved programs. This behavior is not new: 9915672d4127 ("af_unix: limit unix_tot_inflight") is from 2010. I managed to come up with a repro for this behavior: * https://gist.github.com/bobrik/82e5722261920c9f23d9402b88a0bb27 It also includes a flamegraph illustrating the issue. It's all in one program for convenience, but in reality the offender not picking up SCM_RIGHTS messages and the suffering program just minding its own business are separate. It is also non-trivial to find the offender when this happens as it can be completely idle while wrecking havoc for the rest of the system. I don't think it's fair to penalize every unix_stream_sendmsg like this. The 16k threshold also doesn't feel very flexible, surely computers are bigger these days and can handle more. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20231020104728.2060-1-hdanton@sina.com>]
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs [not found] ` <20231020104728.2060-1-hdanton@sina.com> @ 2023-10-20 17:25 ` Ivan Babrou [not found] ` <20231021012322.1799-1-hdanton@sina.com> 0 siblings, 1 reply; 9+ messages in thread From: Ivan Babrou @ 2023-10-20 17:25 UTC (permalink / raw) To: Hillf Danton Cc: Linux Kernel Network Developers, kernel-team, Eric Dumazet, Paolo Abeni, linux-kernel On Fri, Oct 20, 2023 at 3:47 AM Hillf Danton <hdanton@sina.com> wrote: > > On Thu, 19 Oct 2023 15:35:01 -0700 Ivan Babrou <ivan@cloudflare.com> > > Hello, > > > > We have observed this issue twice (2019 and 2023): a well behaved > > service that doesn't pass any file descriptors around starts to spend > > a ton of CPU time in wait_for_unix_gc. > > See if the diff below works for you, which prevents concurrent spinning > of unix_gc_lock, a variant of spin_trylock(). > > Hillf > --- x/net/unix/garbage.c > +++ y/net/unix/garbage.c > @@ -211,15 +211,10 @@ void unix_gc(void) > struct list_head cursor; > LIST_HEAD(not_cycle_list); > > + if (test_and_set_bit(0, &gc_in_progress)) > + return; > spin_lock(&unix_gc_lock); > > - /* Avoid a recursive GC. */ > - if (gc_in_progress) > - goto out; > - > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > - WRITE_ONCE(gc_in_progress, true); > - > /* First, select candidates for garbage collection. Only > * in-flight sockets are considered, and from those only ones > * which don't have any external reference. > -- This could solve wait_for_unix_gc spinning, but it wouldn't affect unix_gc itself, from what I understand. There would always be one socket writer or destroyer punished by running the gc still. My linked repro code exercises that path rather than the waiting spinlock (there's a single writer thread), so it's something you can see for yourself. Your patch doesn't build, so I wasn't able to try it out: #26 154.3 /build/linux-source/net/unix/garbage.c: In function 'unix_gc': #26 154.3 /build/linux-source/net/unix/garbage.c:214:33: error: passing argument 2 of 'test_and_set_bit' from incompatible pointer type [-Werror=incompatible-pointer-types] #26 154.3 214 | if (test_and_set_bit(0, &gc_in_progress)) #26 154.3 | ^~~~~~~~~~~~~~~ #26 154.3 | | #26 154.3 | bool * {aka _Bool *} #26 154.3 In file included from /build/linux-source/include/asm-generic/bitops/atomic.h:68, #26 154.3 from /build/linux-source/arch/arm64/include/asm/bitops.h:25, #26 154.3 from /build/linux-source/include/linux/bitops.h:68, #26 154.3 from /build/linux-source/include/linux/kernel.h:22, #26 154.3 from /build/linux-source/net/unix/garbage.c:66: #26 154.3 /build/linux-source/include/asm-generic/bitops/instrumented-atomic.h:68:79: note: expected 'volatile long unsigned int *' but argument is of type 'bool *' {aka '_Bool *'} #26 154.3 68 | static __always_inline bool test_and_set_bit(long nr, volatile unsigned long *addr) #26 154.3 | ~~~~~~~~~~~~~~~~~~~~~~~~^~~~ #26 154.3 /build/linux-source/net/unix/garbage.c:328:2: error: label 'out' defined but not used [-Werror=unused-label] #26 154.3 328 | out: #26 154.3 | ^~~ #26 154.3 cc1: all warnings being treated as errors ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20231021012322.1799-1-hdanton@sina.com>]
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs [not found] ` <20231021012322.1799-1-hdanton@sina.com> @ 2023-10-23 23:22 ` Ivan Babrou 2023-10-23 23:45 ` Kuniyuki Iwashima 0 siblings, 1 reply; 9+ messages in thread From: Ivan Babrou @ 2023-10-23 23:22 UTC (permalink / raw) To: Hillf Danton Cc: Linux Kernel Network Developers, kernel-team, Eric Dumazet, Paolo Abeni, Kuniyuki Iwashima, linux-kernel On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote: > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com> > > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect > > unix_gc itself, from what I understand. There would always be one > > socket writer or destroyer punished by running the gc still. > > See what you want. The innocents are rescued by kicking a worker off. > Only for thoughts. > > --- x/net/unix/garbage.c > +++ y/net/unix/garbage.c > @@ -86,7 +86,6 @@ > /* Internal data structures and random procedures: */ > > static LIST_HEAD(gc_candidates); > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); > > static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), > struct sk_buff_head *hitlist) > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc > list_move_tail(&u->link, &gc_candidates); > } > > -static bool gc_in_progress; > +static void __unix_gc(struct work_struct *w); > +static DECLARE_WORK(unix_gc_work, __unix_gc); > + > #define UNIX_INFLIGHT_TRIGGER_GC 16000 > > void wait_for_unix_gc(void) > { > /* If number of inflight sockets is insane, > - * force a garbage collect right now. > - * Paired with the WRITE_ONCE() in unix_inflight(), > - * unix_notinflight() and gc_in_progress(). > - */ > - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && > - !READ_ONCE(gc_in_progress)) > - unix_gc(); > - wait_event(unix_gc_wait, gc_in_progress == false); > + * kick a garbage collect right now. > + * > + * todo s/wait_for_unix_gc/kick_unix_gc/ > + */ > + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2) > + queue_work(system_unbound_wq, &unix_gc_work); > } > > -/* The external entry point: unix_gc() */ > -void unix_gc(void) > +static DEFINE_MUTEX(unix_gc_mutex); > + > +static void __unix_gc(struct work_struct *w) > { > struct sk_buff *next_skb, *skb; > struct unix_sock *u; > @@ -211,15 +211,10 @@ void unix_gc(void) > struct list_head cursor; > LIST_HEAD(not_cycle_list); > > + if (!mutex_trylock(&unix_gc_mutex)) > + return; > spin_lock(&unix_gc_lock); > > - /* Avoid a recursive GC. */ > - if (gc_in_progress) > - goto out; > - > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > - WRITE_ONCE(gc_in_progress, true); > - > /* First, select candidates for garbage collection. Only > * in-flight sockets are considered, and from those only ones > * which don't have any external reference. > @@ -325,11 +320,12 @@ void unix_gc(void) > /* All candidates should have been detached by now. */ > BUG_ON(!list_empty(&gc_candidates)); > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > - WRITE_ONCE(gc_in_progress, false); > - > - wake_up(&unix_gc_wait); > - > - out: > spin_unlock(&unix_gc_lock); > + mutex_unlock(&unix_gc_mutex); > +} > + > +/* The external entry point: unix_gc() */ > +void unix_gc(void) > +{ > + __unix_gc(NULL); > } > -- This one results in less overall load than Kuniyuki's proposed patch with my repro: * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/ My guess is that's because my repro is the one that is getting penalized there. There's still a lot work done in unix_release_sock here, where GC runs as long as you have any fds inflight: * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670 Perhaps it can be improved. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs 2023-10-23 23:22 ` Ivan Babrou @ 2023-10-23 23:45 ` Kuniyuki Iwashima 2023-11-17 23:38 ` Ivan Babrou 0 siblings, 1 reply; 9+ messages in thread From: Kuniyuki Iwashima @ 2023-10-23 23:45 UTC (permalink / raw) To: ivan; +Cc: edumazet, hdanton, kernel-team, kuniyu, linux-kernel, netdev, pabeni From: Ivan Babrou <ivan@cloudflare.com> Date: Mon, 23 Oct 2023 16:22:35 -0700 > On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote: > > > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com> > > > > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect > > > unix_gc itself, from what I understand. There would always be one > > > socket writer or destroyer punished by running the gc still. > > > > See what you want. The innocents are rescued by kicking a worker off. > > Only for thoughts. > > > > --- x/net/unix/garbage.c > > +++ y/net/unix/garbage.c > > @@ -86,7 +86,6 @@ > > /* Internal data structures and random procedures: */ > > > > static LIST_HEAD(gc_candidates); > > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); > > > > static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), > > struct sk_buff_head *hitlist) > > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc > > list_move_tail(&u->link, &gc_candidates); > > } > > > > -static bool gc_in_progress; > > +static void __unix_gc(struct work_struct *w); > > +static DECLARE_WORK(unix_gc_work, __unix_gc); > > + > > #define UNIX_INFLIGHT_TRIGGER_GC 16000 > > > > void wait_for_unix_gc(void) > > { > > /* If number of inflight sockets is insane, > > - * force a garbage collect right now. > > - * Paired with the WRITE_ONCE() in unix_inflight(), > > - * unix_notinflight() and gc_in_progress(). > > - */ > > - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && > > - !READ_ONCE(gc_in_progress)) > > - unix_gc(); > > - wait_event(unix_gc_wait, gc_in_progress == false); > > + * kick a garbage collect right now. > > + * > > + * todo s/wait_for_unix_gc/kick_unix_gc/ > > + */ > > + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2) > > + queue_work(system_unbound_wq, &unix_gc_work); > > } > > > > -/* The external entry point: unix_gc() */ > > -void unix_gc(void) > > +static DEFINE_MUTEX(unix_gc_mutex); > > + > > +static void __unix_gc(struct work_struct *w) > > { > > struct sk_buff *next_skb, *skb; > > struct unix_sock *u; > > @@ -211,15 +211,10 @@ void unix_gc(void) > > struct list_head cursor; > > LIST_HEAD(not_cycle_list); > > > > + if (!mutex_trylock(&unix_gc_mutex)) > > + return; > > spin_lock(&unix_gc_lock); > > > > - /* Avoid a recursive GC. */ > > - if (gc_in_progress) > > - goto out; > > - > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > - WRITE_ONCE(gc_in_progress, true); > > - > > /* First, select candidates for garbage collection. Only > > * in-flight sockets are considered, and from those only ones > > * which don't have any external reference. > > @@ -325,11 +320,12 @@ void unix_gc(void) > > /* All candidates should have been detached by now. */ > > BUG_ON(!list_empty(&gc_candidates)); > > > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > - WRITE_ONCE(gc_in_progress, false); > > - > > - wake_up(&unix_gc_wait); > > - > > - out: > > spin_unlock(&unix_gc_lock); > > + mutex_unlock(&unix_gc_mutex); > > +} > > + > > +/* The external entry point: unix_gc() */ > > +void unix_gc(void) > > +{ > > + __unix_gc(NULL); > > } > > -- > > This one results in less overall load than Kuniyuki's proposed patch > with my repro: > > * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/ > > My guess is that's because my repro is the one that is getting penalized there. Thanks for testing, and yes. It would be good to split the repro to one offender and one normal process, run them on different users, and measure load on the normal process. > There's still a lot work done in unix_release_sock here, where GC runs > as long as you have any fds inflight: > > * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670 > > Perhaps it can be improved. Yes, it also can be done async by worker as done in my first patch. I replaced schedule_work() with queue_work() to avoid using system_wq as gc could take long. Could you try this ? ---8<--- diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 824c258143a3..3b38e21116f1 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -13,6 +13,7 @@ void unix_notinflight(struct user_struct *user, struct file *fp); void unix_destruct_scm(struct sk_buff *skb); void io_uring_destruct_scm(struct sk_buff *skb); void unix_gc(void); +void unix_gc_flush(void); void wait_for_unix_gc(void); struct sock *unix_get_socket(struct file *filp); struct sock *unix_peer_get(struct sock *sk); diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 3e8a04a13668..ed3251753417 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -3683,6 +3683,7 @@ static int __init af_unix_init(void) static void __exit af_unix_exit(void) { + unix_gc_flush(); sock_unregister(PF_UNIX); proto_unregister(&unix_dgram_proto); proto_unregister(&unix_stream_proto); diff --git a/net/unix/garbage.c b/net/unix/garbage.c index 2405f0f9af31..51f30f89bacb 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -86,7 +86,9 @@ /* Internal data structures and random procedures: */ static LIST_HEAD(gc_candidates); -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); + +static void __unix_gc(struct work_struct *work); +static DECLARE_WORK(unix_gc_work, __unix_gc); static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), struct sk_buff_head *hitlist) @@ -185,24 +187,26 @@ static void inc_inflight_move_tail(struct unix_sock *u) list_move_tail(&u->link, &gc_candidates); } -static bool gc_in_progress; -#define UNIX_INFLIGHT_TRIGGER_GC 16000 +#define UNIX_INFLIGHT_TRIGGER_GC 16 void wait_for_unix_gc(void) { + struct user_struct *user = get_uid(current_user()); + /* If number of inflight sockets is insane, - * force a garbage collect right now. + * kick a garbage collect right now. * Paired with the WRITE_ONCE() in unix_inflight(), - * unix_notinflight() and gc_in_progress(). + * unix_notinflight(). */ - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && - !READ_ONCE(gc_in_progress)) - unix_gc(); - wait_event(unix_gc_wait, gc_in_progress == false); + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) + queue_work(system_unbound_wq, &unix_gc_work); + + /* Penalise senders of not-yet-received-fd */ + if (READ_ONCE(user->unix_inflight)) + flush_work(&unix_gc_work); } -/* The external entry point: unix_gc() */ -void unix_gc(void) +static void __unix_gc(struct work_struct *work) { struct sk_buff *next_skb, *skb; struct unix_sock *u; @@ -213,13 +217,6 @@ void unix_gc(void) spin_lock(&unix_gc_lock); - /* Avoid a recursive GC. */ - if (gc_in_progress) - goto out; - - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ - WRITE_ONCE(gc_in_progress, true); - /* First, select candidates for garbage collection. Only * in-flight sockets are considered, and from those only ones * which don't have any external reference. @@ -325,11 +322,15 @@ void unix_gc(void) /* All candidates should have been detached by now. */ BUG_ON(!list_empty(&gc_candidates)); - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ - WRITE_ONCE(gc_in_progress, false); + spin_unlock(&unix_gc_lock); +} - wake_up(&unix_gc_wait); +void unix_gc(void) +{ + queue_work(system_unbound_wq, &unix_gc_work); +} - out: - spin_unlock(&unix_gc_lock); +void __exit unix_gc_flush(void) +{ + cancel_work_sync(&unix_gc_work); } ---8<--- ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs 2023-10-23 23:45 ` Kuniyuki Iwashima @ 2023-11-17 23:38 ` Ivan Babrou 2023-11-20 19:29 ` Kuniyuki Iwashima 0 siblings, 1 reply; 9+ messages in thread From: Ivan Babrou @ 2023-11-17 23:38 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: edumazet, hdanton, kernel-team, linux-kernel, netdev, pabeni On Mon, Oct 23, 2023 at 4:46 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Ivan Babrou <ivan@cloudflare.com> > Date: Mon, 23 Oct 2023 16:22:35 -0700 > > On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote: > > > > > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com> > > > > > > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect > > > > unix_gc itself, from what I understand. There would always be one > > > > socket writer or destroyer punished by running the gc still. > > > > > > See what you want. The innocents are rescued by kicking a worker off. > > > Only for thoughts. > > > > > > --- x/net/unix/garbage.c > > > +++ y/net/unix/garbage.c > > > @@ -86,7 +86,6 @@ > > > /* Internal data structures and random procedures: */ > > > > > > static LIST_HEAD(gc_candidates); > > > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); > > > > > > static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), > > > struct sk_buff_head *hitlist) > > > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc > > > list_move_tail(&u->link, &gc_candidates); > > > } > > > > > > -static bool gc_in_progress; > > > +static void __unix_gc(struct work_struct *w); > > > +static DECLARE_WORK(unix_gc_work, __unix_gc); > > > + > > > #define UNIX_INFLIGHT_TRIGGER_GC 16000 > > > > > > void wait_for_unix_gc(void) > > > { > > > /* If number of inflight sockets is insane, > > > - * force a garbage collect right now. > > > - * Paired with the WRITE_ONCE() in unix_inflight(), > > > - * unix_notinflight() and gc_in_progress(). > > > - */ > > > - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && > > > - !READ_ONCE(gc_in_progress)) > > > - unix_gc(); > > > - wait_event(unix_gc_wait, gc_in_progress == false); > > > + * kick a garbage collect right now. > > > + * > > > + * todo s/wait_for_unix_gc/kick_unix_gc/ > > > + */ > > > + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2) > > > + queue_work(system_unbound_wq, &unix_gc_work); > > > } > > > > > > -/* The external entry point: unix_gc() */ > > > -void unix_gc(void) > > > +static DEFINE_MUTEX(unix_gc_mutex); > > > + > > > +static void __unix_gc(struct work_struct *w) > > > { > > > struct sk_buff *next_skb, *skb; > > > struct unix_sock *u; > > > @@ -211,15 +211,10 @@ void unix_gc(void) > > > struct list_head cursor; > > > LIST_HEAD(not_cycle_list); > > > > > > + if (!mutex_trylock(&unix_gc_mutex)) > > > + return; > > > spin_lock(&unix_gc_lock); > > > > > > - /* Avoid a recursive GC. */ > > > - if (gc_in_progress) > > > - goto out; > > > - > > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > > - WRITE_ONCE(gc_in_progress, true); > > > - > > > /* First, select candidates for garbage collection. Only > > > * in-flight sockets are considered, and from those only ones > > > * which don't have any external reference. > > > @@ -325,11 +320,12 @@ void unix_gc(void) > > > /* All candidates should have been detached by now. */ > > > BUG_ON(!list_empty(&gc_candidates)); > > > > > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > > - WRITE_ONCE(gc_in_progress, false); > > > - > > > - wake_up(&unix_gc_wait); > > > - > > > - out: > > > spin_unlock(&unix_gc_lock); > > > + mutex_unlock(&unix_gc_mutex); > > > +} > > > + > > > +/* The external entry point: unix_gc() */ > > > +void unix_gc(void) > > > +{ > > > + __unix_gc(NULL); > > > } > > > -- > > > > This one results in less overall load than Kuniyuki's proposed patch > > with my repro: > > > > * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/ > > > > My guess is that's because my repro is the one that is getting penalized there. > > Thanks for testing, and yes. > > It would be good to split the repro to one offender and one normal > process, run them on different users, and measure load on the normal > process. > > > > There's still a lot work done in unix_release_sock here, where GC runs > > as long as you have any fds inflight: > > > > * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670 > > > > Perhaps it can be improved. > > Yes, it also can be done async by worker as done in my first patch. > I replaced schedule_work() with queue_work() to avoid using system_wq > as gc could take long. > > Could you try this ? Apologies for the long wait, I was OOO. A bit of a problem here is that unix_gc is called unconditionally in unix_release_sock. It's done asynchronously now and it can only consume a single CPU with your changes, which is a lot better than before. I am wondering if we can still do better to only trigger gc when it touches unix sockets that have inflight fds. Commit 3c32da19a858 ("unix: Show number of pending scm files of receive queue in fdinfo") added "struct scm_stat" to "struct unix_sock", which can be used to check for the number of inflight fds in the unix socket. Can we use that to drive the GC? I think we can: * Trigger unix_gc from unix_release_sock if there's a non-zero number of inflight fds in the socket being destroyed. * Trigger wait_for_unix_gc from the write path only if the write contains fds or if the socket contains inflight fds. Alternatively, we can run gc at the end of the write path and only check for inflight fds on the socket there, which is probably simpler. GC only applies to unix sockets inflight of other unix sockets, so GC can still affect sockets passing regular fds around, but it wouldn't affect non-fd-passing unix sockets, which are usually in the data path. This way we don't have to check for per-user inflight fds, which means that services running as "nobody" wouldn't be punished for other services running as "nobody" screwing up. Does this sound like a reasonable approach? > ---8<--- > diff --git a/include/net/af_unix.h b/include/net/af_unix.h > index 824c258143a3..3b38e21116f1 100644 > --- a/include/net/af_unix.h > +++ b/include/net/af_unix.h > @@ -13,6 +13,7 @@ void unix_notinflight(struct user_struct *user, struct file *fp); > void unix_destruct_scm(struct sk_buff *skb); > void io_uring_destruct_scm(struct sk_buff *skb); > void unix_gc(void); > +void unix_gc_flush(void); > void wait_for_unix_gc(void); > struct sock *unix_get_socket(struct file *filp); > struct sock *unix_peer_get(struct sock *sk); > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 3e8a04a13668..ed3251753417 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -3683,6 +3683,7 @@ static int __init af_unix_init(void) > > static void __exit af_unix_exit(void) > { > + unix_gc_flush(); > sock_unregister(PF_UNIX); > proto_unregister(&unix_dgram_proto); > proto_unregister(&unix_stream_proto); > diff --git a/net/unix/garbage.c b/net/unix/garbage.c > index 2405f0f9af31..51f30f89bacb 100644 > --- a/net/unix/garbage.c > +++ b/net/unix/garbage.c > @@ -86,7 +86,9 @@ > /* Internal data structures and random procedures: */ > > static LIST_HEAD(gc_candidates); > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); > + > +static void __unix_gc(struct work_struct *work); > +static DECLARE_WORK(unix_gc_work, __unix_gc); > > static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), > struct sk_buff_head *hitlist) > @@ -185,24 +187,26 @@ static void inc_inflight_move_tail(struct unix_sock *u) > list_move_tail(&u->link, &gc_candidates); > } > > -static bool gc_in_progress; > -#define UNIX_INFLIGHT_TRIGGER_GC 16000 > +#define UNIX_INFLIGHT_TRIGGER_GC 16 It's probably best to keep it at 16k. > void wait_for_unix_gc(void) > { > + struct user_struct *user = get_uid(current_user()); > + > /* If number of inflight sockets is insane, > - * force a garbage collect right now. > + * kick a garbage collect right now. > * Paired with the WRITE_ONCE() in unix_inflight(), > - * unix_notinflight() and gc_in_progress(). > + * unix_notinflight(). > */ > - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && > - !READ_ONCE(gc_in_progress)) > - unix_gc(); > - wait_event(unix_gc_wait, gc_in_progress == false); > + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) > + queue_work(system_unbound_wq, &unix_gc_work); > + > + /* Penalise senders of not-yet-received-fd */ > + if (READ_ONCE(user->unix_inflight)) > + flush_work(&unix_gc_work); > } > > -/* The external entry point: unix_gc() */ > -void unix_gc(void) > +static void __unix_gc(struct work_struct *work) > { > struct sk_buff *next_skb, *skb; > struct unix_sock *u; > @@ -213,13 +217,6 @@ void unix_gc(void) > > spin_lock(&unix_gc_lock); > > - /* Avoid a recursive GC. */ > - if (gc_in_progress) > - goto out; > - > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > - WRITE_ONCE(gc_in_progress, true); > - > /* First, select candidates for garbage collection. Only > * in-flight sockets are considered, and from those only ones > * which don't have any external reference. > @@ -325,11 +322,15 @@ void unix_gc(void) > /* All candidates should have been detached by now. */ > BUG_ON(!list_empty(&gc_candidates)); > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > - WRITE_ONCE(gc_in_progress, false); > + spin_unlock(&unix_gc_lock); > +} > > - wake_up(&unix_gc_wait); > +void unix_gc(void) > +{ > + queue_work(system_unbound_wq, &unix_gc_work); > +} > > - out: > - spin_unlock(&unix_gc_lock); > +void __exit unix_gc_flush(void) > +{ > + cancel_work_sync(&unix_gc_work); > } > ---8<--- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs 2023-11-17 23:38 ` Ivan Babrou @ 2023-11-20 19:29 ` Kuniyuki Iwashima 2023-11-20 22:30 ` Ivan Babrou 0 siblings, 1 reply; 9+ messages in thread From: Kuniyuki Iwashima @ 2023-11-20 19:29 UTC (permalink / raw) To: ivan; +Cc: edumazet, hdanton, kernel-team, kuniyu, linux-kernel, netdev, pabeni From: Ivan Babrou <ivan@cloudflare.com> Date: Fri, 17 Nov 2023 15:38:42 -0800 > On Mon, Oct 23, 2023 at 4:46 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Ivan Babrou <ivan@cloudflare.com> > > Date: Mon, 23 Oct 2023 16:22:35 -0700 > > > On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote: > > > > > > > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com> > > > > > > > > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect > > > > > unix_gc itself, from what I understand. There would always be one > > > > > socket writer or destroyer punished by running the gc still. > > > > > > > > See what you want. The innocents are rescued by kicking a worker off. > > > > Only for thoughts. > > > > > > > > --- x/net/unix/garbage.c > > > > +++ y/net/unix/garbage.c > > > > @@ -86,7 +86,6 @@ > > > > /* Internal data structures and random procedures: */ > > > > > > > > static LIST_HEAD(gc_candidates); > > > > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); > > > > > > > > static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), > > > > struct sk_buff_head *hitlist) > > > > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc > > > > list_move_tail(&u->link, &gc_candidates); > > > > } > > > > > > > > -static bool gc_in_progress; > > > > +static void __unix_gc(struct work_struct *w); > > > > +static DECLARE_WORK(unix_gc_work, __unix_gc); > > > > + > > > > #define UNIX_INFLIGHT_TRIGGER_GC 16000 > > > > > > > > void wait_for_unix_gc(void) > > > > { > > > > /* If number of inflight sockets is insane, > > > > - * force a garbage collect right now. > > > > - * Paired with the WRITE_ONCE() in unix_inflight(), > > > > - * unix_notinflight() and gc_in_progress(). > > > > - */ > > > > - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && > > > > - !READ_ONCE(gc_in_progress)) > > > > - unix_gc(); > > > > - wait_event(unix_gc_wait, gc_in_progress == false); > > > > + * kick a garbage collect right now. > > > > + * > > > > + * todo s/wait_for_unix_gc/kick_unix_gc/ > > > > + */ > > > > + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2) > > > > + queue_work(system_unbound_wq, &unix_gc_work); > > > > } > > > > > > > > -/* The external entry point: unix_gc() */ > > > > -void unix_gc(void) > > > > +static DEFINE_MUTEX(unix_gc_mutex); > > > > + > > > > +static void __unix_gc(struct work_struct *w) > > > > { > > > > struct sk_buff *next_skb, *skb; > > > > struct unix_sock *u; > > > > @@ -211,15 +211,10 @@ void unix_gc(void) > > > > struct list_head cursor; > > > > LIST_HEAD(not_cycle_list); > > > > > > > > + if (!mutex_trylock(&unix_gc_mutex)) > > > > + return; > > > > spin_lock(&unix_gc_lock); > > > > > > > > - /* Avoid a recursive GC. */ > > > > - if (gc_in_progress) > > > > - goto out; > > > > - > > > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > > > - WRITE_ONCE(gc_in_progress, true); > > > > - > > > > /* First, select candidates for garbage collection. Only > > > > * in-flight sockets are considered, and from those only ones > > > > * which don't have any external reference. > > > > @@ -325,11 +320,12 @@ void unix_gc(void) > > > > /* All candidates should have been detached by now. */ > > > > BUG_ON(!list_empty(&gc_candidates)); > > > > > > > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > > > - WRITE_ONCE(gc_in_progress, false); > > > > - > > > > - wake_up(&unix_gc_wait); > > > > - > > > > - out: > > > > spin_unlock(&unix_gc_lock); > > > > + mutex_unlock(&unix_gc_mutex); > > > > +} > > > > + > > > > +/* The external entry point: unix_gc() */ > > > > +void unix_gc(void) > > > > +{ > > > > + __unix_gc(NULL); > > > > } > > > > -- > > > > > > This one results in less overall load than Kuniyuki's proposed patch > > > with my repro: > > > > > > * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/ > > > > > > My guess is that's because my repro is the one that is getting penalized there. > > > > Thanks for testing, and yes. > > > > It would be good to split the repro to one offender and one normal > > process, run them on different users, and measure load on the normal > > process. > > > > > > > There's still a lot work done in unix_release_sock here, where GC runs > > > as long as you have any fds inflight: > > > > > > * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670 > > > > > > Perhaps it can be improved. > > > > Yes, it also can be done async by worker as done in my first patch. > > I replaced schedule_work() with queue_work() to avoid using system_wq > > as gc could take long. > > > > Could you try this ? > > Apologies for the long wait, I was OOO. > > A bit of a problem here is that unix_gc is called unconditionally in > unix_release_sock. unix_release_sock() calls gc only when there is a inflight socket. > It's done asynchronously now and it can only > consume a single CPU with your changes, which is a lot better than > before. I am wondering if we can still do better to only trigger gc > when it touches unix sockets that have inflight fds. > > Commit 3c32da19a858 ("unix: Show number of pending scm files of > receive queue in fdinfo") added "struct scm_stat" to "struct > unix_sock", which can be used to check for the number of inflight fds > in the unix socket. Can we use that to drive the GC? I don't think the stat is useful to trigger gc. Unless the receiver is accessible via sendmsg(), it's not a gc candidate and we need not care about its stats about valid FDs that are ready to be processed and never cleaned up by gc until close(). > I think we can: > > * Trigger unix_gc from unix_release_sock if there's a non-zero number > of inflight fds in the socket being destroyed. This is the case of now. > * Trigger wait_for_unix_gc from the write path only if the write > contains fds or if the socket contains inflight fds. Alternatively, we > can run gc at the end of the write path and only check for inflight > fds on the socket there, which is probably simpler. I don't think it's better to call gc at the end of sendmsg() as there would be small chance to consume memory compared to running gc in the beginning of sendmsg(). > GC only applies to unix sockets inflight of other unix sockets, so GC > can still affect sockets passing regular fds around, but it wouldn't > affect non-fd-passing unix sockets, which are usually in the data > path. I think we can run gc only when scm contains AF_UNIX sockets by tweaking a little bit scm processing. > This way we don't have to check for per-user inflight fds, which means > that services running as "nobody" wouldn't be punished for other > services running as "nobody" screwing up. If we do not check user, a user could send 16000 AF_UNIX fds and other innocent users sending fds must wait gc. I think isolating users would make more sense so you can sandbox a suspicious process. Probably we can move flush_work() in the preceding if. Then, the number of gc invocation in the "nobody" case is the same as before. ---8<--- diff --git a/net/unix/garbage.c b/net/unix/garbage.c index 51f30f89bacb..74fc208c8858 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -198,12 +198,13 @@ void wait_for_unix_gc(void) * Paired with the WRITE_ONCE() in unix_inflight(), * unix_notinflight(). */ - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) { queue_work(system_unbound_wq, &unix_gc_work); - /* Penalise senders of not-yet-received-fd */ - if (READ_ONCE(user->unix_inflight)) - flush_work(&unix_gc_work); + /* Penalise senders of not-yet-received-fd */ + if (READ_ONCE(user->unix_inflight)) + flush_work(&unix_gc_work); + } } static void __unix_gc(struct work_struct *work) ---8<--- > > Does this sound like a reasonable approach? > [...] > > -static bool gc_in_progress; > > -#define UNIX_INFLIGHT_TRIGGER_GC 16000 > > +#define UNIX_INFLIGHT_TRIGGER_GC 16 > > It's probably best to keep it at 16k. Oops, this is just for testing on my local machine easily :p Anyway, I'll post a formal patch this week. Thanks! ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs 2023-11-20 19:29 ` Kuniyuki Iwashima @ 2023-11-20 22:30 ` Ivan Babrou 2023-11-20 23:53 ` Kuniyuki Iwashima 0 siblings, 1 reply; 9+ messages in thread From: Ivan Babrou @ 2023-11-20 22:30 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: edumazet, hdanton, kernel-team, linux-kernel, netdev, pabeni On Mon, Nov 20, 2023 at 11:29 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Ivan Babrou <ivan@cloudflare.com> > Date: Fri, 17 Nov 2023 15:38:42 -0800 > > On Mon, Oct 23, 2023 at 4:46 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > From: Ivan Babrou <ivan@cloudflare.com> > > > Date: Mon, 23 Oct 2023 16:22:35 -0700 > > > > On Fri, Oct 20, 2023 at 6:23 PM Hillf Danton <hdanton@sina.com> wrote: > > > > > > > > > > On Fri, 20 Oct 2023 10:25:25 -0700 Ivan Babrou <ivan@cloudflare.com> > > > > > > > > > > > > This could solve wait_for_unix_gc spinning, but it wouldn't affect > > > > > > unix_gc itself, from what I understand. There would always be one > > > > > > socket writer or destroyer punished by running the gc still. > > > > > > > > > > See what you want. The innocents are rescued by kicking a worker off. > > > > > Only for thoughts. > > > > > > > > > > --- x/net/unix/garbage.c > > > > > +++ y/net/unix/garbage.c > > > > > @@ -86,7 +86,6 @@ > > > > > /* Internal data structures and random procedures: */ > > > > > > > > > > static LIST_HEAD(gc_candidates); > > > > > -static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); > > > > > > > > > > static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), > > > > > struct sk_buff_head *hitlist) > > > > > @@ -185,24 +184,25 @@ static void inc_inflight_move_tail(struc > > > > > list_move_tail(&u->link, &gc_candidates); > > > > > } > > > > > > > > > > -static bool gc_in_progress; > > > > > +static void __unix_gc(struct work_struct *w); > > > > > +static DECLARE_WORK(unix_gc_work, __unix_gc); > > > > > + > > > > > #define UNIX_INFLIGHT_TRIGGER_GC 16000 > > > > > > > > > > void wait_for_unix_gc(void) > > > > > { > > > > > /* If number of inflight sockets is insane, > > > > > - * force a garbage collect right now. > > > > > - * Paired with the WRITE_ONCE() in unix_inflight(), > > > > > - * unix_notinflight() and gc_in_progress(). > > > > > - */ > > > > > - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && > > > > > - !READ_ONCE(gc_in_progress)) > > > > > - unix_gc(); > > > > > - wait_event(unix_gc_wait, gc_in_progress == false); > > > > > + * kick a garbage collect right now. > > > > > + * > > > > > + * todo s/wait_for_unix_gc/kick_unix_gc/ > > > > > + */ > > > > > + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC /2) > > > > > + queue_work(system_unbound_wq, &unix_gc_work); > > > > > } > > > > > > > > > > -/* The external entry point: unix_gc() */ > > > > > -void unix_gc(void) > > > > > +static DEFINE_MUTEX(unix_gc_mutex); > > > > > + > > > > > +static void __unix_gc(struct work_struct *w) > > > > > { > > > > > struct sk_buff *next_skb, *skb; > > > > > struct unix_sock *u; > > > > > @@ -211,15 +211,10 @@ void unix_gc(void) > > > > > struct list_head cursor; > > > > > LIST_HEAD(not_cycle_list); > > > > > > > > > > + if (!mutex_trylock(&unix_gc_mutex)) > > > > > + return; > > > > > spin_lock(&unix_gc_lock); > > > > > > > > > > - /* Avoid a recursive GC. */ > > > > > - if (gc_in_progress) > > > > > - goto out; > > > > > - > > > > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > > > > - WRITE_ONCE(gc_in_progress, true); > > > > > - > > > > > /* First, select candidates for garbage collection. Only > > > > > * in-flight sockets are considered, and from those only ones > > > > > * which don't have any external reference. > > > > > @@ -325,11 +320,12 @@ void unix_gc(void) > > > > > /* All candidates should have been detached by now. */ > > > > > BUG_ON(!list_empty(&gc_candidates)); > > > > > > > > > > - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ > > > > > - WRITE_ONCE(gc_in_progress, false); > > > > > - > > > > > - wake_up(&unix_gc_wait); > > > > > - > > > > > - out: > > > > > spin_unlock(&unix_gc_lock); > > > > > + mutex_unlock(&unix_gc_mutex); > > > > > +} > > > > > + > > > > > +/* The external entry point: unix_gc() */ > > > > > +void unix_gc(void) > > > > > +{ > > > > > + __unix_gc(NULL); > > > > > } > > > > > -- > > > > > > > > This one results in less overall load than Kuniyuki's proposed patch > > > > with my repro: > > > > > > > > * https://lore.kernel.org/netdev/20231020220511.45854-1-kuniyu@amazon.com/ > > > > > > > > My guess is that's because my repro is the one that is getting penalized there. > > > > > > Thanks for testing, and yes. > > > > > > It would be good to split the repro to one offender and one normal > > > process, run them on different users, and measure load on the normal > > > process. > > > > > > > > > > There's still a lot work done in unix_release_sock here, where GC runs > > > > as long as you have any fds inflight: > > > > > > > > * https://elixir.bootlin.com/linux/v6.1/source/net/unix/af_unix.c#L670 > > > > > > > > Perhaps it can be improved. > > > > > > Yes, it also can be done async by worker as done in my first patch. > > > I replaced schedule_work() with queue_work() to avoid using system_wq > > > as gc could take long. > > > > > > Could you try this ? > > > > Apologies for the long wait, I was OOO. > > > > A bit of a problem here is that unix_gc is called unconditionally in > > unix_release_sock. > > unix_release_sock() calls gc only when there is a inflight socket. > > > > It's done asynchronously now and it can only > > consume a single CPU with your changes, which is a lot better than > > before. I am wondering if we can still do better to only trigger gc > > when it touches unix sockets that have inflight fds. > > > > Commit 3c32da19a858 ("unix: Show number of pending scm files of > > receive queue in fdinfo") added "struct scm_stat" to "struct > > unix_sock", which can be used to check for the number of inflight fds > > in the unix socket. Can we use that to drive the GC? > > I don't think the stat is useful to trigger gc. Unless the receiver > is accessible via sendmsg(), it's not a gc candidate and we need not > care about its stats about valid FDs that are ready to be processed > and never cleaned up by gc until close(). I'm not quite following why it's not useful, could you elaborate? The GC call today is conditioned by 16k system-wide inflight unix fds: * https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L2204 * https://github.com/torvalds/linux/blob/master/net/unix/garbage.c#L191 The only way checking for inflight fds is worse is if there non-unix fds inflight. This can be easily alleviated by checking for non-zero inflight fds on the socket + system-wide unix fds check, making it better than just checking the system-wide gauge. What am I missing? > > I think we can: > > > > * Trigger unix_gc from unix_release_sock if there's a non-zero number > > of inflight fds in the socket being destroyed. > > This is the case of now. It's not, unless I'm misunderstanding something. The check today is for any inflight fds _system-wide_: * https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L684-L685 My suggestion is to check this _per socket_, making sockets that do not pass any fds (likely an overwhelming majority) unaffected. > > * Trigger wait_for_unix_gc from the write path only if the write > > contains fds or if the socket contains inflight fds. Alternatively, we > > can run gc at the end of the write path and only check for inflight > > fds on the socket there, which is probably simpler. > > I don't think it's better to call gc at the end of sendmsg() as there > would be small chance to consume memory compared to running gc in the > beginning of sendmsg(). There is no guarantee that GC frees any memory either way. > > GC only applies to unix sockets inflight of other unix sockets, so GC > > can still affect sockets passing regular fds around, but it wouldn't > > affect non-fd-passing unix sockets, which are usually in the data > > path. > > I think we can run gc only when scm contains AF_UNIX sockets by tweaking > a little bit scm processing. That's even better than what I'm proposing (all inflight fds -> just unix inflight fds being the difference), but probably requires a bit more code changes. I'm happy to test a patch when you have it. > > This way we don't have to check for per-user inflight fds, which means > > that services running as "nobody" wouldn't be punished for other > > services running as "nobody" screwing up. > > If we do not check user, a user could send 16000 AF_UNIX fds and > other innocent users sending fds must wait gc. In my proposal we check for infight fds per socket and if there are none, there is no GC at all. > I think isolating users would make more sense so you can sandbox > a suspicious process. Sure, but that's somewhat beside the point. Ideally things should work well outside the box. > Probably we can move flush_work() in the preceding if. Then, the > number of gc invocation in the "nobody" case is the same as before. Just to make sure I get my point across: I want to have no GC if a socket does not possibly contribute any garbage to be collected. If my program just passes bytes back and forth and churns through unix sockets, it shouldn't be penalized with GC and it shouldn't try to schedule GC, no matter what else is going on in the system Hope this makes sense. > ---8<--- > diff --git a/net/unix/garbage.c b/net/unix/garbage.c > index 51f30f89bacb..74fc208c8858 100644 > --- a/net/unix/garbage.c > +++ b/net/unix/garbage.c > @@ -198,12 +198,13 @@ void wait_for_unix_gc(void) > * Paired with the WRITE_ONCE() in unix_inflight(), > * unix_notinflight(). > */ > - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) > + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) { > queue_work(system_unbound_wq, &unix_gc_work); > > - /* Penalise senders of not-yet-received-fd */ > - if (READ_ONCE(user->unix_inflight)) > - flush_work(&unix_gc_work); > + /* Penalise senders of not-yet-received-fd */ > + if (READ_ONCE(user->unix_inflight)) > + flush_work(&unix_gc_work); > + } > } > > static void __unix_gc(struct work_struct *work) > ---8<--- > > > > > > Does this sound like a reasonable approach? > > > [...] > > > -static bool gc_in_progress; > > > -#define UNIX_INFLIGHT_TRIGGER_GC 16000 > > > +#define UNIX_INFLIGHT_TRIGGER_GC 16 > > > > It's probably best to keep it at 16k. > > Oops, this is just for testing on my local machine easily :p > > Anyway, I'll post a formal patch this week. > > Thanks! ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs 2023-11-20 22:30 ` Ivan Babrou @ 2023-11-20 23:53 ` Kuniyuki Iwashima 0 siblings, 0 replies; 9+ messages in thread From: Kuniyuki Iwashima @ 2023-11-20 23:53 UTC (permalink / raw) To: ivan; +Cc: edumazet, hdanton, kernel-team, kuniyu, linux-kernel, netdev, pabeni From: Ivan Babrou <ivan@cloudflare.com> Date: Mon, 20 Nov 2023 14:30:52 -0800 > > > A bit of a problem here is that unix_gc is called unconditionally in > > > unix_release_sock. > > > > unix_release_sock() calls gc only when there is a inflight socket. > > > > > > > It's done asynchronously now and it can only > > > consume a single CPU with your changes, which is a lot better than > > > before. I am wondering if we can still do better to only trigger gc > > > when it touches unix sockets that have inflight fds. > > > > > > Commit 3c32da19a858 ("unix: Show number of pending scm files of > > > receive queue in fdinfo") added "struct scm_stat" to "struct > > > unix_sock", which can be used to check for the number of inflight fds > > > in the unix socket. Can we use that to drive the GC? > > > > I don't think the stat is useful to trigger gc. Unless the receiver > > is accessible via sendmsg(), it's not a gc candidate and we need not > > care about its stats about valid FDs that are ready to be processed > > and never cleaned up by gc until close(). > > I'm not quite following why it's not useful, could you elaborate? The > GC call today is conditioned by 16k system-wide inflight unix fds: > > * https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L2204 > * https://github.com/torvalds/linux/blob/master/net/unix/garbage.c#L191 > > The only way checking for inflight fds is worse is if there non-unix > fds inflight. This can be easily alleviated by checking for non-zero > inflight fds on the socket + system-wide unix fds check, making it > better than just checking the system-wide gauge. What am I missing? unix_tot_inflight is the number of inflight AF_UNIX socket, excluding non-unix fds. (See unix_inflight()) Let's say sk-A passes its fd to sk-B, and vice versa, and then, we close sk-A and sk-B. Each fd passing increments the refcnt of both "file", so unix_release_sock() is not called for _both_ sockets. And if we check each socket's inflight fd to trigger gc, and no one sends fd, we never trigger gc for sk-A and sk-B. In other words, unix_gc() call in unix_release_sock() is basically for _other_ sockets, not for the socket being destroyed there. So, here we need to check the number of inflight fds in other sockets. We needed to clean up all inflight dead fd when we close() all AF_UNIX sockets (because AF_UNIX could be modulised and unloaded). It can no longer be a module, so technically, we need not run gc at close() now, but it would be better to not leave the dead fd so long. We need not wait gc though. > > > > I think we can: > > > > > > * Trigger unix_gc from unix_release_sock if there's a non-zero number > > > of inflight fds in the socket being destroyed. > > > > This is the case of now. > > It's not, unless I'm misunderstanding something. Sorry, I just skipped "in the socket being destroyed", but as mentioned above, we need to check the system-wide unix_tot_inflight anyway. > The check today is > for any inflight fds _system-wide_: > > * https://github.com/torvalds/linux/blob/v6.7-rc2/net/unix/af_unix.c#L684-L685 > > My suggestion is to check this _per socket_, making sockets that do > not pass any fds (likely an overwhelming majority) unaffected. > > > > * Trigger wait_for_unix_gc from the write path only if the write > > > contains fds or if the socket contains inflight fds. Alternatively, we > > > can run gc at the end of the write path and only check for inflight > > > fds on the socket there, which is probably simpler. > > > > I don't think it's better to call gc at the end of sendmsg() as there > > would be small chance to consume memory compared to running gc in the > > beginning of sendmsg(). > > There is no guarantee that GC frees any memory either way. Right. > > > > GC only applies to unix sockets inflight of other unix sockets, so GC > > > can still affect sockets passing regular fds around, but it wouldn't > > > affect non-fd-passing unix sockets, which are usually in the data > > > path. > > > > I think we can run gc only when scm contains AF_UNIX sockets by tweaking > > a little bit scm processing. > > That's even better than what I'm proposing (all inflight fds -> just > unix inflight fds being the difference), but probably requires a bit > more code changes. I'm happy to test a patch when you have it. > > > > This way we don't have to check for per-user inflight fds, which means > > > that services running as "nobody" wouldn't be punished for other > > > services running as "nobody" screwing up. > > > > If we do not check user, a user could send 16000 AF_UNIX fds and > > other innocent users sending fds must wait gc. > > In my proposal we check for infight fds per socket and if there are > none, there is no GC at all. Again, we need to check the system wide unix_tot_inflight at least for other sockets that is already close()d and not accessible from user. (If we can check the per-socket stat in sendmsg(), the receiver socket must have its valid fd and is not a gc candidate.) So, we need to schedule gc when unix_tot_inflight > 16K anyway, and we need to decide who should wait. If user has not-yet-received fd, it could be in the dead fd's recv queue, so the user should wait. OTOH, the fds that you are suggesting to check are still valid and not dead at least. * if the process is sending AF_UNIX fds * if the receiver has inflight fds in recv queue Botth cases could contribute to trigger gc, but there is less certainty than user's inflight count. > > > I think isolating users would make more sense so you can sandbox > > a suspicious process. > > Sure, but that's somewhat beside the point. Ideally things should work > well outside the box. > > > Probably we can move flush_work() in the preceding if. Then, the > > number of gc invocation in the "nobody" case is the same as before. > > Just to make sure I get my point across: I want to have no GC if a > socket does not possibly contribute any garbage to be collected. If my > program just passes bytes back and forth and churns through unix > sockets, it shouldn't be penalized with GC and it shouldn't try to > schedule GC, no matter what else is going on in the system > > Hope this makes sense. Yes, that makes sense, and I understnad that sendmsg() with no fds need not wait, but the point is who is responsible to wait for gc to finish. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wait_for_unix_gc can cause CPU overload for well behaved programs 2023-10-19 22:35 wait_for_unix_gc can cause CPU overload for well behaved programs Ivan Babrou [not found] ` <20231020104728.2060-1-hdanton@sina.com> @ 2023-10-20 22:05 ` Kuniyuki Iwashima 1 sibling, 0 replies; 9+ messages in thread From: Kuniyuki Iwashima @ 2023-10-20 22:05 UTC (permalink / raw) To: ivan; +Cc: edumazet, kernel-team, kuba, linux-kernel, netdev, pabeni, kuniyu From: Ivan Babrou <ivan@cloudflare.com> Date: Thu, 19 Oct 2023 15:35:01 -0700 > Hello, > > We have observed this issue twice (2019 and 2023): a well behaved > service that doesn't pass any file descriptors around starts to spend > a ton of CPU time in wait_for_unix_gc. > > The cause of this is that the unix send path unconditionally calls > wait_for_unix_gc, which is a global garbage collection. If any > misbehaved program exists on a system, it can force extra work for > well behaved programs. > > This behavior is not new: 9915672d4127 ("af_unix: limit > unix_tot_inflight") is from 2010. > > I managed to come up with a repro for this behavior: > > * https://gist.github.com/bobrik/82e5722261920c9f23d9402b88a0bb27 > > It also includes a flamegraph illustrating the issue. It's all in one > program for convenience, but in reality the offender not picking up > SCM_RIGHTS messages and the suffering program just minding its own > business are separate. > > It is also non-trivial to find the offender when this happens as it > can be completely idle while wrecking havoc for the rest of the > system. > > I don't think it's fair to penalize every unix_stream_sendmsg like > this. The 16k threshold also doesn't feel very flexible, surely > computers are bigger these days and can handle more. Probably we could do the gc async and enforce the penalty only on the offender by checking user->unix_inflight. compile test only: ---8<--- diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 824c258143a3..a119f37953cc 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -12,8 +12,9 @@ void unix_inflight(struct user_struct *user, struct file *fp); void unix_notinflight(struct user_struct *user, struct file *fp); void unix_destruct_scm(struct sk_buff *skb); void io_uring_destruct_scm(struct sk_buff *skb); -void unix_gc(void); void wait_for_unix_gc(void); +void unix_gc_start(void); +void unix_gc_stop(void); struct sock *unix_get_socket(struct file *filp); struct sock *unix_peer_get(struct sock *sk); diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 3e8a04a13668..56db096b13f1 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -681,7 +681,7 @@ static void unix_release_sock(struct sock *sk, int embrion) */ if (READ_ONCE(unix_tot_inflight)) - unix_gc(); /* Garbage collect fds */ + unix_gc_start(); /* Garbage collect fds */ } static void init_peercred(struct sock *sk) @@ -3683,6 +3683,7 @@ static int __init af_unix_init(void) static void __exit af_unix_exit(void) { + unix_gc_stop(); sock_unregister(PF_UNIX); proto_unregister(&unix_dgram_proto); proto_unregister(&unix_stream_proto); diff --git a/net/unix/garbage.c b/net/unix/garbage.c index 2405f0f9af31..fb24d62fe34a 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -185,24 +185,26 @@ static void inc_inflight_move_tail(struct unix_sock *u) list_move_tail(&u->link, &gc_candidates); } -static bool gc_in_progress; #define UNIX_INFLIGHT_TRIGGER_GC 16000 +static void unix_gc(struct work_struct *work); +static DECLARE_WORK(unix_gc_work, unix_gc); + void wait_for_unix_gc(void) { - /* If number of inflight sockets is insane, - * force a garbage collect right now. - * Paired with the WRITE_ONCE() in unix_inflight(), - * unix_notinflight() and gc_in_progress(). - */ - if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC && - !READ_ONCE(gc_in_progress)) - unix_gc(); - wait_event(unix_gc_wait, gc_in_progress == false); + struct user_struct *user = get_uid(current_user()); + + if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC) + schedule_work(&unix_gc_work); + + if (!READ_ONCE(user->unix_inflight)) + return; + + flush_work(&unix_gc_work); } /* The external entry point: unix_gc() */ -void unix_gc(void) +static void unix_gc(struct work_struct *work) { struct sk_buff *next_skb, *skb; struct unix_sock *u; @@ -213,13 +215,6 @@ void unix_gc(void) spin_lock(&unix_gc_lock); - /* Avoid a recursive GC. */ - if (gc_in_progress) - goto out; - - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ - WRITE_ONCE(gc_in_progress, true); - /* First, select candidates for garbage collection. Only * in-flight sockets are considered, and from those only ones * which don't have any external reference. @@ -325,11 +320,15 @@ void unix_gc(void) /* All candidates should have been detached by now. */ BUG_ON(!list_empty(&gc_candidates)); - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ - WRITE_ONCE(gc_in_progress, false); + spin_unlock(&unix_gc_lock); +} - wake_up(&unix_gc_wait); +void unix_gc_start(void) +{ + schedule_work(&unix_gc_work); +} - out: - spin_unlock(&unix_gc_lock); +void __exit unix_gc_stop(void) +{ + flush_work(&unix_gc_work); } ---8<--- ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-11-20 23:53 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-10-19 22:35 wait_for_unix_gc can cause CPU overload for well behaved programs Ivan Babrou [not found] ` <20231020104728.2060-1-hdanton@sina.com> 2023-10-20 17:25 ` Ivan Babrou [not found] ` <20231021012322.1799-1-hdanton@sina.com> 2023-10-23 23:22 ` Ivan Babrou 2023-10-23 23:45 ` Kuniyuki Iwashima 2023-11-17 23:38 ` Ivan Babrou 2023-11-20 19:29 ` Kuniyuki Iwashima 2023-11-20 22:30 ` Ivan Babrou 2023-11-20 23:53 ` Kuniyuki Iwashima 2023-10-20 22:05 ` Kuniyuki Iwashima
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).