All of lore.kernel.org
 help / color / mirror / Atom feed
* Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board
@ 2011-08-29 18:58 Sankara Muthukrishnan
  2011-08-29 20:13 ` Sankara Muthukrishnan
  2011-09-01 19:33 ` Thomas Gleixner
  0 siblings, 2 replies; 7+ messages in thread
From: Sankara Muthukrishnan @ 2011-08-29 18:58 UTC (permalink / raw)
  To: linux-rt-users

****************** Note **************************
: This is not an RT patch issue. I have posted to linux-omap and
cross-posting to this list, since cyclictest is from this community
and RT folks may run into this issue on OMAP targets (unless it is
specific to panda board). That said, replies are most welcome to
narrow this issue down. Apologies for those who are in both mailing
lists. Not sure what the convention about cross-posting is.
***************************************************
Hello everyone,

Greetings. I have tried the following kernels and found "the problem"
to occur on all of them with high resolution timer enabled

(1) mainline stable 3.0.1 kernel but with Hemant Pedanekar's patch
(http://www.spinics.net/lists/linux-omap/msg50742.html) and by
disabling 32KHz Timer ("System Type -> TI OMAP Common Features ->  Use
32KHz timer")
(2) Same as (1) with RT patch (3.0.1-rt11)
(3) OMAP kernel version v3.1-rc2 (
http://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6.git
)

Problem:
*********
On panda baord, I ran the v0.74 of cyclictest
(git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git)
to measure the latency (./cyclictest -l4000000000 -m -S -p99 -i70 -h60
-q -n). These arguments make the test to use TIMER_ABSTIME for
clock_nanosleep and CPU affinity (using sched_setaffinity) for the 2
threads to be set to each CPU.  It is expected to see large latencies
without the RT patch. However, when I ran the tests overnight, I
observed maximum latency of 4294967103 us (weird but it is close to
unsigned int max). So, I instrumented the test to print some
additional information and exit as soon as it finds such a weird
latency. I was also trying to stress ethernet/network/interrupts of
the system with SFTP but I think (not very sure) I could reproduce the
issue without that. clock_nanosleep was called to sleep until
27608:739311172 (sec:nsec), but after clock_nanosleep returned,
clock_gettime returned the time as 27608:739117429 (sec:nsec) which is
roughly 193 usec earlier than the value passed to clock_nanosleep and
that is the bug. I ran the test with just one thread ( remove "-S" and
add "-t1 -a1 -n" ) and saw the weird latency of 4294967294 usec.

Questions
*************
(1) Is this a known bug? If so, do we already have a fix?
(2) Does anyone have suggestions for narrowing this down further
(timer driver issue vs scheduler/kernel issue)?
(3) I am not too familiar with OMAP and Linux kernel. Which timer gets
used when I use high-resolution timer and disable the 32 KHz timer? Is
it part of "MP core"? Is this timer per CPU? Pointers to source code
for the high-res timer driver?
(4) If the timer is per CPU, are they synchronized in the hardware?
(5) In the same process/task, if a thread (created with
pthread_create) is assigned CPU affinity to a particular core
(sched_setaffinity), is it a soft-request to the scheduler or is it
guaranteed that the thread will not be scheduled on other CPUs at
all?The reason I am asking this is to rule out the possibility of the
thread jumping to different CPU and the timers are off by quite a bit
for different CPUs.
(6) Is it ok to call sched_setaffinity with the first argument 0 to
set a the affinity for a particular pthread in the process? Or, should
the value returned by gettid() should be passed instead?
(7) Should I post it to any other mailing list also?

Thanks,
Sankara
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board
  2011-08-29 18:58 Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board Sankara Muthukrishnan
@ 2011-08-29 20:13 ` Sankara Muthukrishnan
  2011-09-01 19:33 ` Thomas Gleixner
  1 sibling, 0 replies; 7+ messages in thread
From: Sankara Muthukrishnan @ 2011-08-29 20:13 UTC (permalink / raw)
  To: linux-rt-users

On Mon, Aug 29, 2011 at 1:58 PM, Sankara Muthukrishnan
<sankara.m@gmail.com> wrote:
> ****************** Note **************************
> : This is not an RT patch issue. I have posted to linux-omap and
> cross-posting to this list, since cyclictest is from this community
> and RT folks may run into this issue on OMAP targets (unless it is
> specific to panda board). That said, replies are most welcome to
> narrow this issue down. Apologies for those who are in both mailing
> lists. Not sure what the convention about cross-posting is.
> ***************************************************
> Hello everyone,
>
> Greetings. I have tried the following kernels and found "the problem"
> to occur on all of them with high resolution timer enabled
>
> (1) mainline stable 3.0.1 kernel but with Hemant Pedanekar's patch
> (http://www.spinics.net/lists/linux-omap/msg50742.html) and by
> disabling 32KHz Timer ("System Type -> TI OMAP Common Features ->  Use
> 32KHz timer")
> (2) Same as (1) with RT patch (3.0.1-rt11)
> (3) OMAP kernel version v3.1-rc2 (
> http://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6.git
> )
>
> Problem:
> *********
> On panda baord, I ran the v0.74 of cyclictest
> (git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git)
> to measure the latency (./cyclictest -l4000000000 -m -S -p99 -i70 -h60
> -q -n). These arguments make the test to use TIMER_ABSTIME for
> clock_nanosleep and CPU affinity (using sched_setaffinity) for the 2
> threads to be set to each CPU.  It is expected to see large latencies
> without the RT patch. However, when I ran the tests overnight, I
> observed maximum latency of 4294967103 us (weird but it is close to
> unsigned int max). So, I instrumented the test to print some
> additional information and exit as soon as it finds such a weird
> latency. I was also trying to stress ethernet/network/interrupts of
> the system with SFTP but I think (not very sure) I could reproduce the
> issue without that. clock_nanosleep was called to sleep until
> 27608:739311172 (sec:nsec), but after clock_nanosleep returned,
> clock_gettime returned the time as 27608:739117429 (sec:nsec) which is
> roughly 193 usec earlier than the value passed to clock_nanosleep and
> that is the bug. I ran the test with just one thread ( remove "-S" and
> add "-t1 -a1 -n" ) and saw the weird latency of 4294967294 usec.
>
> Questions
> *************
> (1) Is this a known bug? If so, do we already have a fix?
> (2) Does anyone have suggestions for narrowing this down further
> (timer driver issue vs scheduler/kernel issue)?
> (3) I am not too familiar with OMAP and Linux kernel. Which timer gets
> used when I use high-resolution timer and disable the 32 KHz timer? Is
> it part of "MP core"? Is this timer per CPU? Pointers to source code
> for the high-res timer driver?
> (4) If the timer is per CPU, are they synchronized in the hardware?
> (5) In the same process/task, if a thread (created with
> pthread_create) is assigned CPU affinity to a particular core
> (sched_setaffinity), is it a soft-request to the scheduler or is it
> guaranteed that the thread will not be scheduled on other CPUs at
> all?The reason I am asking this is to rule out the possibility of the
> thread jumping to different CPU and the timers are off by quite a bit
> for different CPUs.
> (6) Is it ok to call sched_setaffinity with the first argument 0 to
> set a the affinity for a particular pthread in the process? Or, should
> the value returned by gettid() should be passed instead?
> (7) Should I post it to any other mailing list also?
>
> Thanks,
> Sankara
>
One thing I forgot to mention: For my tests, I did modify the original
cyclictest to check for return codes for the APIs (and errno's on
failure and print them at the end) and both clock_gettime and
clock_nanosleep returned the value of 0 (success), when the actual
failure happened. So, the APIs did not fail and particularly
clock_nanosleep was not interrupted by a signal.
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board
  2011-08-29 18:58 Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board Sankara Muthukrishnan
  2011-08-29 20:13 ` Sankara Muthukrishnan
@ 2011-09-01 19:33 ` Thomas Gleixner
       [not found]   ` <CA+zZMHFhJ81EjM09jRkfyGXtucASf3Ac1KS65TuYw4VDaJ5jFQ@mail.gmail.com>
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2011-09-01 19:33 UTC (permalink / raw)
  To: Sankara Muthukrishnan; +Cc: linux-rt-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1910 bytes --]

On Mon, 29 Aug 2011, Sankara Muthukrishnan wrote:
> On panda baord, I ran the v0.74 of cyclictest
> (git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git)
> to measure the latency (./cyclictest -l4000000000 -m -S -p99 -i70 -h60

 -i 70 ??? That's a 70us interval. Pretty damned close to what that CPU
 can handle. :)

> -q -n). These arguments make the test to use TIMER_ABSTIME for
> clock_nanosleep and CPU affinity (using sched_setaffinity) for the 2
> threads to be set to each CPU.  It is expected to see large latencies
> without the RT patch. However, when I ran the tests overnight, I
> observed maximum latency of 4294967103 us (weird but it is close to
> unsigned int max). So, I instrumented the test to print some
> additional information and exit as soon as it finds such a weird
> latency. I was also trying to stress ethernet/network/interrupts of
> the system with SFTP but I think (not very sure) I could reproduce the
> issue without that. clock_nanosleep was called to sleep until
> 27608:739311172 (sec:nsec), but after clock_nanosleep returned,
> clock_gettime returned the time as 27608:739117429 (sec:nsec) which is
> roughly 193 usec earlier than the value passed to clock_nanosleep and
> that is the bug. I ran the test with just one thread ( remove "-S" and
> add "-t1 -a1 -n" ) and saw the weird latency of 4294967294 usec.

That looks like a problem in the clocksource. i.e. time is going
backward or having weird momentary jumps. You could verify that by
running one ore more tight loops which do

	clock_gettime(CLOCK_MONOTONIC, &prev);
	while (1) {
	      clock_gettime(CLOCK_MONOTONIC, &curr);
	      if (curr < prev) /* Use a proper compare function for timespec! */
	      	       printf(.....);
	      prev = curr;
	}

If that triggers on RT or on a vanilla kernel, then the problem is
definitely somewhere in the timekeeping/clocksource area.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board
       [not found]   ` <CA+zZMHFhJ81EjM09jRkfyGXtucASf3Ac1KS65TuYw4VDaJ5jFQ@mail.gmail.com>
@ 2011-09-02  7:37     ` Thomas Gleixner
       [not found]       ` <CA+zZMHGg=Q3rxW9j5iW0CqNNjKbrbSyi6OxQjRe9i1cO5mFRuA@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2011-09-02  7:37 UTC (permalink / raw)
  To: Michael Mueller; +Cc: Sankara Muthukrishnan, linux-rt-users

Mike,

On Thu, 1 Sep 2011, Michael Mueller wrote:
> On Thu, Sep 1, 2011 at 3:33 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> >
> > That looks like a problem in the clocksource. i.e. time is going
> > backward or having weird momentary jumps. You could verify that by
> > running one ore more tight loops which do
> >
> >        clock_gettime(CLOCK_MONOTONIC, &prev);
> >        while (1) {
> >              clock_gettime(CLOCK_MONOTONIC, &curr);
> >              if (curr < prev) /* Use a proper compare function for
> > timespec! */
> >                       printf(.....);
> >              prev = curr;
> >        }
> >
> > If that triggers on RT or on a vanilla kernel, then the problem is
> > definitely somewhere in the timekeeping/clocksource area.
> 
> 
> have seen this problem in vanilla kernels; developed a work-around that
> calls
> clock_gettime(CLOCK_MONOTONIC, &curr); until curr>prev
> a couple of times, and if that doesn't work, then it gets a new prev value
> and a new curr value; rinse and repeat until curr<prev; not sure what
> precipitates the problem; will post workaround code if wanted

Is that on a Panda board as well?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board
       [not found]       ` <CA+zZMHGg=Q3rxW9j5iW0CqNNjKbrbSyi6OxQjRe9i1cO5mFRuA@mail.gmail.com>
@ 2011-09-02 11:29         ` Michael Mueller
  2011-09-06  4:08           ` Sankara Muthukrishnan
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Mueller @ 2011-09-02 11:29 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Sankara Muthukrishnan, linux-rt-users

repost attempt - kernel.org correctly rejects html email - i blame
gmail - hopefully fixed

On Fri, Sep 2, 2011 at 7:19 AM, Michael Mueller <ss7box@gmail.com> wrote:
>
>
> On Fri, Sep 2, 2011 at 3:37 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Mike,
>>
>> On Thu, 1 Sep 2011, Michael Mueller wrote:
>> > On Thu, Sep 1, 2011 at 3:33 PM, Thomas Gleixner <tglx@linutronix.de>
>> > wrote:
>> >
>> > >
>> > > That looks like a problem in the clocksource. i.e. time is going
>> > > backward or having weird momentary jumps. You could verify that by
>> > > running one ore more tight loops which do
>> > >
>> > >        clock_gettime(CLOCK_MONOTONIC, &prev);
>> > >        while (1) {
>> > >              clock_gettime(CLOCK_MONOTONIC, &curr);
>> > >              if (curr < prev) /* Use a proper compare function for
>> > > timespec! */
>> > >                       printf(.....);
>> > >              prev = curr;
>> > >        }
>> > >
>> > > If that triggers on RT or on a vanilla kernel, then the problem is
>> > > definitely somewhere in the timekeeping/clocksource area.
>> >
>> >
>> > have seen this problem in vanilla kernels; developed a work-around that
>> > calls
>> > clock_gettime(CLOCK_MONOTONIC, &curr); until curr>prev
>> > a couple of times, and if that doesn't work, then it gets a new prev
>> > value
>> > and a new curr value; rinse and repeat until curr<prev; not sure what
>> > precipitates the problem; will post workaround code if wanted
>>
>> Is that on a Panda board as well?
>
> older and newer Dell boxes with 2.6.old kernels; a few weeks ago I
> discovered a newer box time leaping/lagging so much that I had to throttle
> back the syslog messages indicating detection/correction
> if needed I can try to isolate and gather more info - see if a newer kernel
> generates the problem
> edit above: rinse and repeat until curr>prev
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board
  2011-09-02 11:29         ` Michael Mueller
@ 2011-09-06  4:08           ` Sankara Muthukrishnan
  2011-09-06 23:00             ` Fernando Gomes
  0 siblings, 1 reply; 7+ messages in thread
From: Sankara Muthukrishnan @ 2011-09-06  4:08 UTC (permalink / raw)
  To: Michael Mueller; +Cc: Thomas Gleixner, linux-rt-users

>On Fri, Sep 2, 2011 at 6:29 AM, Michael Mueller <ss7box@gmail.com> wrote:
>
> >> > >
> >> > >        clock_gettime(CLOCK_MONOTONIC, &prev);
> >> > >        while (1) {
> >> > >              clock_gettime(CLOCK_MONOTONIC, &curr);
> >> > >              if (curr < prev) /* Use a proper compare function for
> >> > > timespec! */
> >> > >                       printf(.....);
> >> > >              prev = curr;
> >> > >        }

I have tried this on 2 panda boards. Both of them fail this simple
test with vanilla (both single core and SMP) kernels (2.6.39-rc4
mainline with high-resolution timer patch, 3.0.1 mainline with
high-resolution timer patch, 3.1-rc2 OMAP git tree without any patch
but with high-resolution timer enabled) within a few seconds to an
hour. Time goes backwards from as low as 234 nsec to 53 usec. The
problem does not reproduce without high resolution timer (i.e by
enabling 32 KHz timer in kernel config) for several hours of testing.

Everyone, could you please try this test on any other ARM boards (and
x86 machines?) with high resolution timer support?
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board
  2011-09-06  4:08           ` Sankara Muthukrishnan
@ 2011-09-06 23:00             ` Fernando Gomes
  0 siblings, 0 replies; 7+ messages in thread
From: Fernando Gomes @ 2011-09-06 23:00 UTC (permalink / raw)
  To: Sankara Muthukrishnan, Michael Mueller; +Cc: Thomas Gleixner, linux-rt-users

We made the test today on our custom board based on a OMAP L138 (ARM9) using Linux 3.0 with high resolution timers enabled in the kernel (without RT patch). Our test ran for more than 4 hours without any issue. So it seems that this issue doesn't appear on all the ARM platforms.

Best regards

Fernando

-----Original Message-----
From: linux-rt-users-owner@vger.kernel.org [mailto:linux-rt-users-owner@vger.kernel.org] On Behalf Of Sankara Muthukrishnan
Sent: terça-feira, 6 de Setembro de 2011 05:08
To: Michael Mueller
Cc: Thomas Gleixner; linux-rt-users
Subject: Re: Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board

>On Fri, Sep 2, 2011 at 6:29 AM, Michael Mueller <ss7box@gmail.com> wrote:
>
> >> > >
> >> > >        clock_gettime(CLOCK_MONOTONIC, &prev);
> >> > >        while (1) {
> >> > >              clock_gettime(CLOCK_MONOTONIC, &curr);
> >> > >              if (curr < prev) /* Use a proper compare function for
> >> > > timespec! */
> >> > >                       printf(.....);
> >> > >              prev = curr;
> >> > >        }

I have tried this on 2 panda boards. Both of them fail this simple
test with vanilla (both single core and SMP) kernels (2.6.39-rc4
mainline with high-resolution timer patch, 3.0.1 mainline with
high-resolution timer patch, 3.1-rc2 OMAP git tree without any patch
but with high-resolution timer enabled) within a few seconds to an
hour. Time goes backwards from as low as 234 nsec to 53 usec. The
problem does not reproduce without high resolution timer (i.e by
enabling 32 KHz timer in kernel config) for several hours of testing.

Everyone, could you please try this test on any other ARM boards (and
x86 machines?) with high resolution timer support?
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-09-06 23:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-29 18:58 Issue (NOT RELATED TO RT-PATCH) with clock_gettime / clock_nanosleep APIs with high resolution timer on panda board Sankara Muthukrishnan
2011-08-29 20:13 ` Sankara Muthukrishnan
2011-09-01 19:33 ` Thomas Gleixner
     [not found]   ` <CA+zZMHFhJ81EjM09jRkfyGXtucASf3Ac1KS65TuYw4VDaJ5jFQ@mail.gmail.com>
2011-09-02  7:37     ` Thomas Gleixner
     [not found]       ` <CA+zZMHGg=Q3rxW9j5iW0CqNNjKbrbSyi6OxQjRe9i1cO5mFRuA@mail.gmail.com>
2011-09-02 11:29         ` Michael Mueller
2011-09-06  4:08           ` Sankara Muthukrishnan
2011-09-06 23:00             ` Fernando Gomes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.