From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1052C433F5 for ; Sun, 5 Sep 2021 23:37:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BAD4A60F92 for ; Sun, 5 Sep 2021 23:37:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238255AbhIEXiE (ORCPT ); Sun, 5 Sep 2021 19:38:04 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:33686 "EHLO galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229666AbhIEXiC (ORCPT ); Sun, 5 Sep 2021 19:38:02 -0400 From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1630885017; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to; bh=XRDBwmPSq0OtExDtrXwn2TezsW0Z/MUnaN5kkmANKDA=; b=Tq23GPbTpOfmoAuO1nJdHdzP5Ssdddhg48q3UTqIpcbzKlydMr15p7s58F4AmhFuhS5vcy UX+w5VNwG5BN3BYpZTlHpHZyVq4x8UvnGXsP0gkI9bUujM3+RGP7tVm/sxb5oA6TpKJ+79 l1xlfrFpSkdfnFw/0CGV7aATjn7dMu+Jn5glKBNupuCYhphlORdHnIhuK5AGeap+gvBrQX nT0kmaZDcB2TrxznfVzm8FXJ2VAmQO/io251fc7H7w8duD7uXjcoWMdL1slcF5phxT8qto Ju/LAeUhZsedjuCgnuM66kNhhsvkKSJAg7or526uy3lrzZfzCPvbgsDQiQ2BbA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1630885017; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to; bh=XRDBwmPSq0OtExDtrXwn2TezsW0Z/MUnaN5kkmANKDA=; b=AkyuB/Ns+u8qG59Tjk4QQpec7V5GBsNxQdnPNioHu8BzdnEWKGmWuzpHxd5EScozlVJyWD izyUAw1mDYSxQbBw== To: Hillf Danton , syzbot Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, syzkaller-bugs@googlegroups.com, eric.dumazet@gmail.com Subject: Re: [syzbot] INFO: task hung in __lru_add_drain_all In-Reply-To: <20210903111011.2811-1-hdanton@sina.com> Date: Mon, 06 Sep 2021 01:36:56 +0200 Message-ID: <87k0jua92f.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hillf, On Fri, Sep 03 2021 at 19:10, Hillf Danton wrote: > > See if ksoftirqd is preventing bound workqueue work from running. What? > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -521,6 +521,7 @@ asmlinkage __visible void __softirq_entr > bool in_hardirq; > __u32 pending; > int softirq_bit; > + bool is_ksoftirqd = __this_cpu_read(ksoftirqd) == current; > > /* > * Mask out PF_MEMALLOC as the current task context is borrowed for the > @@ -565,6 +566,8 @@ restart: > } > h++; > pending >>= softirq_bit; > + if (is_ksoftirqd && in_task()) Can you please explain how this would ever be true? #define in_task() (!(in_nmi() | in_hardirq() | in_serving_softirq())) in_task() is guaranteed to be false here, because in_serving_softirq() is guaranteed to be true simply because this is the softirq processing context. > + cond_resched(); ___do_softirq() returns after 2 msec of softirq processing whether it is invoked on return from interrupt or in ksoftirqd context. On return from interrupt this wakes ksoftirqd and returns. In ksoftirqd this is a rescheduling point. But that only works when the action handlers, e.g. net_rx_action(), behave well and respect that limit as well. net_rx_action() has it's own time limit: netdev_budget_usecs That defaults to: 2 * USEC_PER_SEC / HZ The config has HZ=100, so this loop should terminate after 2 * 1e6 / 100 = 20000us = 20ms The provided C-reproducer does not change that default. But again this loop can only terminate if napi_poll() and the subsequently invoked callchain behaves well. So instead of sending obviously useless "debug" patches, why are you not grabbing the kernel config and the reproducer and figure out what the root cause is? Enable tracing, add some trace_printks and let ftrace_dump_on_oops spill it out when the problem triggers. That will pinpoint the issue. Thanks, tglx