All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Wan, Kaike" <kaike.wan@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: "Marciniszyn, Mike" <mike.marciniszyn@intel.com>,
	"Dalessandro, Dennis" <dennis.dalessandro@intel.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	"Fleck, John" <john.fleck@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: RE: hrtimer (rdmavt RNR timer) was lost
Date: Mon, 23 Apr 2018 13:22:11 +0000	[thread overview]
Message-ID: <3F128C9216C9B84BB6ED23EF16290AFB634C9420@CRSMSX101.amr.corp.intel.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1804231355570.2077@nanos.tec.linutronix.de>



> -----Original Message-----
> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> Sent: Monday, April 23, 2018 8:53 AM
> To: Wan, Kaike <kaike.wan@intel.com>
> Cc: Marciniszyn, Mike <mike.marciniszyn@intel.com>; Dalessandro, Dennis
> <dennis.dalessandro@intel.com>; Weiny, Ira <ira.weiny@intel.com>; Fleck,
> John <john.fleck@intel.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org
> Subject: Re: hrtimer (rdmavt RNR timer) was lost
> 
> On Fri, 20 Apr 2018, Wan, Kaike wrote:
> 
> Please do not send html mail. It's rejected by the list server.
> 
> >           <idle>-0     [001] d.h. 3598473310671: hrtimer_start:
> hrtimer=00000000708914d7 function=rvt_rc_rnr_retry [rdmavt]
> expires=712172295802 softexpires=712172295802mode=REL
> base=0000000066dda1ea next=0000000039cae92b
> >           <idle>-0     [001] d.h. 3598473311561: hrtimer_post_add:
> hrtimer=00000000708914d7 function=rvt_rc_rnr_retry [rdmavt]
> base=0000000066dda1ea next=0000000039cae92b
> >
> 
> > --> The target was first added. The expires was 712172295802. The call
> >     was made from cpu 1 while the timer base was on cpu 3.
> 
> >           <idle>-0     [003] dn.. 3598605305357: hrtimer_start:
> hrtimer=0000000004346740 function=tick_sched_timer
> expires=716767000000 softexpires=716767000000mode=ABS|PINNED
> base=0000000066dda1ea next=00000000708914d7
> >           <idle>-0     [003] dn.. 3598605305781: hrtimer_post_add:
> hrtimer=0000000004346740 function=tick_sched_timer
> base=0000000066dda1ea next=00000000708914d7
> >
> > --> the target timer was the next one to trigger. The call was from cpu 3.
> >
> >           <idle>-0     [003] dN.. 3598605306604: hrtimer_pre_remove:
> hrtimer=0000000004346740 function=tick_sched_timer
> base=0000000066dda1ea next=00000000708914d7
> >           <idle>-0     [003] dN.. 3598605306802: hrtimer_post_remove:
> hrtimer=0000000004346740 function=tick_sched_timer
> base=0000000066dda1ea next=00000000708914d7
> >
> > --> The tick_sched_timer was removed
> >
> >           <idle>-0     [003] dN.. 3598605307236: hrtimer_start:
> hrtimer=0000000004346740 function=tick_sched_timer
> expires=712171000000 softexpires=712171000000mode=ABS|PINNED
> base=0000000066dda1ea next=00000000708914d7
> >           <idle>-0     [003] dN.. 3598605307601: hrtimer_post_add:
> hrtimer=0000000004346740 function=tick_sched_timer
> base=0000000066dda1ea next=0000000004346740
> >
> > --> The tick_sched_timer was added again with a new expires
> > --> (712171000000
> >     < 716767000000), which replaced the target (712172295802) as the
> >     head->next.
> >
> >           <idle>-0     [002] d.h. 3598605313255: hrtimer_start:
> hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt]
> expires=712172915662 softexpires=712172915662mode=REL
> base=0000000066dda1ea next=0000000004346740
> >           <idle>-0     [002] d.h. 3599538740786: hrtimer_post_add:
> hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt]
> base=0000000066dda1ea next=00000000662d2dd8
> >
> >
> > --> A new hrtimer (00000000662d2dd8) was added (from cpu 2). Its
> > --> expires
> >     (712172915662) was larger than the target's (712172295802). How could
> >     it become the next hrtimer to trigger??? The next pointer was not set
> >     properly here.
> 
> I have no idea how that can happen assumed that you did not strip anything
> from the trace. If you stripped stuff, please send me the full trace or upload
> it somewhere.
> 
> Can you apply the following debug patch and enable the hrtimer_start trace
> point and send me the full trace or upload it somewhere?

The original trace was about 29GB and I filtered it with  "0000000066dda1ea" (the offending base) to generate a 1.4GB  file that I could open and investigate.  I am not sure how I can send them to you. Do you have somewhere I can upload to?

I can try your debug patch and again I am anticipating a big trace file.

Kaike

> 
> Thanks,
> 
> 	tglx
> 
> 8<-------------------------
> --- a/lib/timerqueue.c
> +++ b/lib/timerqueue.c
> @@ -57,6 +57,10 @@ bool timerqueue_add(struct timerqueue_he
>  	rb_link_node(&node->node, parent, p);
>  	rb_insert_color(&node->node, &head->head);
> 
> +	trace_printk("head %p next %p next->exp %llu node %p node-
> >exp %llu\n",
> +		     head, head->next, head->next ? head->next->expires :
> 0ULL,
> +		     node, node->expires);
> +
>  	if (!head->next || node->expires < head->next->expires) {
>  		head->next = node;
>  		return true;
> @@ -84,6 +88,11 @@ bool timerqueue_del(struct timerqueue_he
> 
>  		head->next = rb_entry_safe(rbn, struct timerqueue_node,
> node);
>  	}
> +
> +	trace_printk("head %p next %p next->exp %llu node %p node-
> >exp %llu\n",
> +		     head, head->next, head->next ? head->next->expires :
> 0ULL,
> +		     node, node->expires);
> +
>  	rb_erase(&node->node, &head->head);
>  	RB_CLEAR_NODE(&node->node);
>  	return head->next != NULL;

  reply	other threads:[~2018-04-23 13:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <3F128C9216C9B84BB6ED23EF16290AFB634C8C87@CRSMSX101.amr.corp.intel.com>
2018-04-23 12:53 ` hrtimer (rdmavt RNR timer) was lost Thomas Gleixner
2018-04-23 13:22   ` Wan, Kaike [this message]
2018-04-23 13:35     ` Thomas Gleixner
2018-04-24 14:54       ` Thomas Gleixner
2018-04-24 16:48         ` Frederic Weisbecker
2018-04-24 19:22         ` [PATCH] tick/sched: Do not mess with an enqueued hrtimer Thomas Gleixner
2018-04-25 13:22           ` Frederic Weisbecker
2018-04-25 14:21           ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
2018-04-26 12:56           ` [PATCH] " Wan, Kaike
2018-04-26 12:57           ` [tip:timers/urgent] " tip-bot for Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3F128C9216C9B84BB6ED23EF16290AFB634C9420@CRSMSX101.amr.corp.intel.com \
    --to=kaike.wan@intel.com \
    --cc=dennis.dalessandro@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=john.fleck@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mike.marciniszyn@intel.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.