Re: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock

From: Tejun Heo <tj@kernel.org>
To: Joe Lawrence <joe.lawrence@stratus.com>
Cc: netdev@vger.kernel.org, Jiri Pirko <jiri@resnulli.us>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock
Date: Mon, 29 Sep 2014 12:06:01 -0400	[thread overview]
Message-ID: <20140929160601.GD15925@htj.dyndns.org> (raw)
In-Reply-To: <20140929115445.40221d8e@jlaw-desktop.mno.stratus.com>

(cc'ing Paul and quoting the whole body)

Paul, this is a fix for RCU sched stall observed w/ a work item
requeueing itself waiting for the RCU grace period.  As the self
requeueing work item ends up being executed by the same kworker, the
worker task never stops running in the absence of a higher priority
task and it seems to delay RCU grace period for a very long time on
!PREEMPT kernels.  As each work item denotes a boundary which no
synchronization construct stretches across, I wonder whether it'd be a
good idea to add a notification for the end of RCU critical section
between executions of work items.

Thanks.

On Mon, Sep 29, 2014 at 11:54:45AM -0400, Joe Lawrence wrote:
> Hello Jiri,
> 
> I've been debugging a hang on RHEL7 that seems to originate in the
> teaming driver and the team_notify_peers_work/team_mcast_rejoin_work
> rtnl_trylock rescheduling logic.  Running a stand-alone minimal driver
> mimicing the same schedule_delayed_work(.., 0) reproduces the problem on
> RHEL7 and upstream kernels [1].
> 
> A quick summary of the hang:
> 
> 1 - systemd-udevd issues an ioctl that heads down dev_ioctl (grabs the
>     rtnl_mutex), dev_ifsioc, dev_change_name and finally
>     synchronize_sched.  In every vmcore I've taken of the hang, this
>     thread is waiting on the RCU.
> 
> 2 - A kworker thread goes to 100% CPU.
> 
> 3 - Inspecting the running thread on the CPU that rcusched reported as
>     holding up the RCU grace period usually shows it in either
>     team_notify_peers_work, team_mcast_rejoin_work, or somewhere in the
>     workqueue code (process_one_work).  This is the same CPU/thread as
>     #2.
> 
> 4 - team_notify_peers_work and team_mcast_rejoin_work want the rtnl_lock
>     that systemd-udevd in #1 has, so they try to play nice by calling
>     rtnl_trylock and rescheduling on failure.  Unfortunately with 0
>     jiffy delay, process_one_work will "execute immediately" (ie, after
>     others already in queue, but before the next tick).  With the stock
>     RHEL7 !CONFIG_PREEMPT at least, this creates a tight loop on
>     process_one_work + rtnl_trylock that spins the CPU in #2.
> 
> 5 - Sometime minutes later, RCU seems to be kicked by a side effect of
>     a smp_apic_timer_interrupt.  (This was the only other interesting
>     function reported by ftrace function tracer).
> 
> See the patch below for a potential workaround.  Giving at least 1 jiffy
> should give process_one_work some breathing room before calling back
> into team_notify_peers_work/team_mcast_rejoin_work and attempting to
> acquire the rtnl_lock mutex.
> 
> Regards,
> 
> -- Joe
> 
> [1] http://marc.info/?l=linux-kernel&m=141192244232345&w=2
> 
> -->8--- -->8--- -->8--- -->8---
> 
> From fc5bbf5771b5732f7479ac6e84bbfdde05710023 Mon Sep 17 00:00:00 2001
> From: Joe Lawrence <joe.lawrence@stratus.com>
> Date: Mon, 29 Sep 2014 11:09:05 -0400
> Subject: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock
> 
> Give the CPU running the kworker handling team_notify_peers_work and
> team_mcast_rejoin_work functions some scheduling air by specifying a
> non-zero delay.
> 
> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> ---
>  drivers/net/team/team.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> index ef10302..d46df38 100644
> --- a/drivers/net/team/team.c
> +++ b/drivers/net/team/team.c
> @@ -633,7 +633,7 @@ static void team_notify_peers_work(struct work_struct *work)
>  	team = container_of(work, struct team, notify_peers.dw.work);
>  
>  	if (!rtnl_trylock()) {
> -		schedule_delayed_work(&team->notify_peers.dw, 0);
> +		schedule_delayed_work(&team->notify_peers.dw, 1);
>  		return;
>  	}
>  	call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, team->dev);
> @@ -673,7 +673,7 @@ static void team_mcast_rejoin_work(struct work_struct *work)
>  	team = container_of(work, struct team, mcast_rejoin.dw.work);
>  
>  	if (!rtnl_trylock()) {
> -		schedule_delayed_work(&team->mcast_rejoin.dw, 0);
> +		schedule_delayed_work(&team->mcast_rejoin.dw, 1);
>  		return;
>  	}
>  	call_netdevice_notifiers(NETDEV_RESEND_IGMP, team->dev);
> -- 
> 1.7.10.4
> 

-- 
tejun