All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Joe Lawrence <joe.lawrence@stratus.com>
Cc: netdev@vger.kernel.org, Jiri Pirko <jiri@resnulli.us>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock
Date: Mon, 29 Sep 2014 12:06:01 -0400	[thread overview]
Message-ID: <20140929160601.GD15925@htj.dyndns.org> (raw)
In-Reply-To: <20140929115445.40221d8e@jlaw-desktop.mno.stratus.com>

(cc'ing Paul and quoting the whole body)

Paul, this is a fix for RCU sched stall observed w/ a work item
requeueing itself waiting for the RCU grace period.  As the self
requeueing work item ends up being executed by the same kworker, the
worker task never stops running in the absence of a higher priority
task and it seems to delay RCU grace period for a very long time on
!PREEMPT kernels.  As each work item denotes a boundary which no
synchronization construct stretches across, I wonder whether it'd be a
good idea to add a notification for the end of RCU critical section
between executions of work items.

Thanks.

On Mon, Sep 29, 2014 at 11:54:45AM -0400, Joe Lawrence wrote:
> Hello Jiri,
> 
> I've been debugging a hang on RHEL7 that seems to originate in the
> teaming driver and the team_notify_peers_work/team_mcast_rejoin_work
> rtnl_trylock rescheduling logic.  Running a stand-alone minimal driver
> mimicing the same schedule_delayed_work(.., 0) reproduces the problem on
> RHEL7 and upstream kernels [1].
> 
> A quick summary of the hang:
> 
> 1 - systemd-udevd issues an ioctl that heads down dev_ioctl (grabs the
>     rtnl_mutex), dev_ifsioc, dev_change_name and finally
>     synchronize_sched.  In every vmcore I've taken of the hang, this
>     thread is waiting on the RCU.
> 
> 2 - A kworker thread goes to 100% CPU.
> 
> 3 - Inspecting the running thread on the CPU that rcusched reported as
>     holding up the RCU grace period usually shows it in either
>     team_notify_peers_work, team_mcast_rejoin_work, or somewhere in the
>     workqueue code (process_one_work).  This is the same CPU/thread as
>     #2.
> 
> 4 - team_notify_peers_work and team_mcast_rejoin_work want the rtnl_lock
>     that systemd-udevd in #1 has, so they try to play nice by calling
>     rtnl_trylock and rescheduling on failure.  Unfortunately with 0
>     jiffy delay, process_one_work will "execute immediately" (ie, after
>     others already in queue, but before the next tick).  With the stock
>     RHEL7 !CONFIG_PREEMPT at least, this creates a tight loop on
>     process_one_work + rtnl_trylock that spins the CPU in #2.
> 
> 5 - Sometime minutes later, RCU seems to be kicked by a side effect of
>     a smp_apic_timer_interrupt.  (This was the only other interesting
>     function reported by ftrace function tracer).
> 
> See the patch below for a potential workaround.  Giving at least 1 jiffy
> should give process_one_work some breathing room before calling back
> into team_notify_peers_work/team_mcast_rejoin_work and attempting to
> acquire the rtnl_lock mutex.
> 
> Regards,
> 
> -- Joe
> 
> [1] http://marc.info/?l=linux-kernel&m=141192244232345&w=2
> 
> -->8--- -->8--- -->8--- -->8---
> 
> From fc5bbf5771b5732f7479ac6e84bbfdde05710023 Mon Sep 17 00:00:00 2001
> From: Joe Lawrence <joe.lawrence@stratus.com>
> Date: Mon, 29 Sep 2014 11:09:05 -0400
> Subject: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock
> 
> Give the CPU running the kworker handling team_notify_peers_work and
> team_mcast_rejoin_work functions some scheduling air by specifying a
> non-zero delay.
> 
> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> ---
>  drivers/net/team/team.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> index ef10302..d46df38 100644
> --- a/drivers/net/team/team.c
> +++ b/drivers/net/team/team.c
> @@ -633,7 +633,7 @@ static void team_notify_peers_work(struct work_struct *work)
>  	team = container_of(work, struct team, notify_peers.dw.work);
>  
>  	if (!rtnl_trylock()) {
> -		schedule_delayed_work(&team->notify_peers.dw, 0);
> +		schedule_delayed_work(&team->notify_peers.dw, 1);
>  		return;
>  	}
>  	call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, team->dev);
> @@ -673,7 +673,7 @@ static void team_mcast_rejoin_work(struct work_struct *work)
>  	team = container_of(work, struct team, mcast_rejoin.dw.work);
>  
>  	if (!rtnl_trylock()) {
> -		schedule_delayed_work(&team->mcast_rejoin.dw, 0);
> +		schedule_delayed_work(&team->mcast_rejoin.dw, 1);
>  		return;
>  	}
>  	call_netdevice_notifiers(NETDEV_RESEND_IGMP, team->dev);
> -- 
> 1.7.10.4
> 

-- 
tejun

  reply	other threads:[~2014-09-29 16:06 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-29 15:54 [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock Joe Lawrence
2014-09-29 16:06 ` Tejun Heo [this message]
2014-10-02  6:43   ` Paul E. McKenney
2014-10-03 19:37     ` Joe Lawrence
2014-10-04  8:37       ` Paul E. McKenney
2014-10-05  2:13         ` Tejun Heo
2014-10-05 12:53           ` Joe Lawrence
2014-10-05 14:08             ` Paul E. McKenney
2014-10-05 16:11               ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140929160601.GD15925@htj.dyndns.org \
    --to=tj@kernel.org \
    --cc=jiri@resnulli.us \
    --cc=joe.lawrence@stratus.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.