linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
@ 2015-03-23 16:57 Trevor Cordes
  2015-04-30  4:27 ` Trevor Cordes
  0 siblings, 1 reply; 8+ messages in thread
From: Trevor Cordes @ 2015-03-23 16:57 UTC (permalink / raw)
  To: linux-kernel

Hello everyone, this is my first attempt at bisecting a kernel to solve a 
bug.  Please bear with me.

I have successfully bisected and located a commit that is causing my 
problem.  Look at commit 166afb64511.

ktime_to_us returns s64, but the commit changes it so ktime_to_us just 
returns what ktime_divns returns, and ktime_divns returns a u64!  If the 
u64 is big enough, wouldn't it wrap s64 around to a negative number?  Or, 
perhaps if some caller is passing in negative ktime_t to begin with it 
will trigger without having to hit big numbers.  With my limited knowledge 
of C, I am stabbing in the dark here.

That's just my guess as to why this commit causes my problem.  My bug 
symptom is my previously working MythTV lirc blaster no longer reliably 
sends IR signals.  Using irsend to test I can see irsend is just timing 
out (and only sometimes blasts, usually the first attempt).  On good 
kernels it returns immediately after blasting.

This little patch (at bottom of email) that puts the code back in place 
and gets rid of the function call fixes the problem for me.  I applied 
this patch to the very latest FC21 kernel-PAE-3.19.1-201.fc21.i686 src.rpm 
and rpmbuilded and the bug is gone!  I can once again MythTV.  Hooray.

I suspect no one else is seeing this because less people are running 
32-bit now, and perhaps in most code paths the value of the u64 never gets 
above 2^63.  I suspect something in drivers/media (possibly) is passing 
very high or negative values (possibly another bug) to these calls.

Obviously my patch isn't the real solution, the real solution is to make 
the new function calls use a consistent 64-bit type, or figure out what in 
my code path is calling these functions and check it for value sanity.

I've documented the whole process / details of this bug in RHBZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1200353

Thanks!

diff -uNr a/include/linux/ktime.h b/include/linux/ktime.h
--- a/include/linux/ktime.h	2015-02-08 20:54:22.000000000 -0600
+++ b/include/linux/ktime.h	2015-03-23 01:09:43.000000000 -0500
@@ -173,12 +173,16 @@
 
 static inline s64 ktime_to_us(const ktime_t kt)
 {
-	return ktime_divns(kt, NSEC_PER_USEC);
+/*	return ktime_divns(kt, NSEC_PER_USEC); */
+	struct timeval tv = ktime_to_timeval(kt);
+	return (s64) tv.tv_sec * USEC_PER_SEC + tv.tv_usec;
 }
 
 static inline s64 ktime_to_ms(const ktime_t kt)
 {
-	return ktime_divns(kt, NSEC_PER_MSEC);
+/*	return ktime_divns(kt, NSEC_PER_MSEC); */
+	struct timeval tv = ktime_to_timeval(kt);
+	return (s64) tv.tv_sec * MSEC_PER_SEC + tv.tv_usec / USEC_PER_MSEC;
 }
 
 static inline s64 ktime_us_delta(const ktime_t later, const ktime_t earlier)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
  2015-03-23 16:57 regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511 Trevor Cordes
@ 2015-04-30  4:27 ` Trevor Cordes
  2015-04-30 10:53   ` One Thousand Gnomes
  2015-04-30 16:35   ` John Stultz
  0 siblings, 2 replies; 8+ messages in thread
From: Trevor Cordes @ 2015-04-30  4:27 UTC (permalink / raw)
  To: linux-kernel, John Stultz, Thomas Gleixner, Josh Boyer

Sorry for the top-posting; Josh Boyer suggested I re-mail this mail
from last month which didn't get any replies.  I'm still having this
weird kernel bug affecting me and I've bisected it down to like 2-4
lines of code.  (I've thought more about my theory regarding
unsigned/signed below and it's probably wrong, so ignore my
prognosticating.)  Please see my rhbz link near the bottom for the full
details.

Thanks everyone!

On 2015-03-23 Trevor Cordes wrote:
> Hello everyone, this is my first attempt at bisecting a kernel to
> solve a bug.  Please bear with me.
> 
> I have successfully bisected and located a commit that is causing my 
> problem.  Look at commit 166afb64511.
> 
> ktime_to_us returns s64, but the commit changes it so ktime_to_us
> just returns what ktime_divns returns, and ktime_divns returns a
> u64!  If the u64 is big enough, wouldn't it wrap s64 around to a
> negative number?  Or, perhaps if some caller is passing in negative
> ktime_t to begin with it will trigger without having to hit big
> numbers.  With my limited knowledge of C, I am stabbing in the dark
> here.
> 
> That's just my guess as to why this commit causes my problem.  My bug 
> symptom is my previously working MythTV lirc blaster no longer
> reliably sends IR signals.  Using irsend to test I can see irsend is
> just timing out (and only sometimes blasts, usually the first
> attempt).  On good kernels it returns immediately after blasting.
> 
> This little patch (at bottom of email) that puts the code back in
> place and gets rid of the function call fixes the problem for me.  I
> applied this patch to the very latest FC21
> kernel-PAE-3.19.1-201.fc21.i686 src.rpm and rpmbuilded and the bug is
> gone!  I can once again MythTV.  Hooray.
> 
> I suspect no one else is seeing this because less people are running 
> 32-bit now, and perhaps in most code paths the value of the u64 never
> gets above 2^63.  I suspect something in drivers/media (possibly) is
> passing very high or negative values (possibly another bug) to these
> calls.
> 
> Obviously my patch isn't the real solution, the real solution is to
> make the new function calls use a consistent 64-bit type, or figure
> out what in my code path is calling these functions and check it for
> value sanity.
> 
> I've documented the whole process / details of this bug in RHBZ:
> https://bugzilla.redhat.com/show_bug.cgi?id=1200353
> 
> Thanks!
> 
> diff -uNr a/include/linux/ktime.h b/include/linux/ktime.h
> --- a/include/linux/ktime.h	2015-02-08 20:54:22.000000000 -0600
> +++ b/include/linux/ktime.h	2015-03-23 01:09:43.000000000 -0500
> @@ -173,12 +173,16 @@
>  
>  static inline s64 ktime_to_us(const ktime_t kt)
>  {
> -	return ktime_divns(kt, NSEC_PER_USEC);
> +/*	return ktime_divns(kt, NSEC_PER_USEC); */
> +	struct timeval tv = ktime_to_timeval(kt);
> +	return (s64) tv.tv_sec * USEC_PER_SEC + tv.tv_usec;
>  }
>  
>  static inline s64 ktime_to_ms(const ktime_t kt)
>  {
> -	return ktime_divns(kt, NSEC_PER_MSEC);
> +/*	return ktime_divns(kt, NSEC_PER_MSEC); */
> +	struct timeval tv = ktime_to_timeval(kt);
> +	return (s64) tv.tv_sec * MSEC_PER_SEC + tv.tv_usec /
> USEC_PER_MSEC; }
>  
>  static inline s64 ktime_us_delta(const ktime_t later, const ktime_t
> earlier)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
  2015-04-30  4:27 ` Trevor Cordes
@ 2015-04-30 10:53   ` One Thousand Gnomes
  2015-04-30 16:35   ` John Stultz
  1 sibling, 0 replies; 8+ messages in thread
From: One Thousand Gnomes @ 2015-04-30 10:53 UTC (permalink / raw)
  To: Trevor Cordes; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Josh Boyer

> > Obviously my patch isn't the real solution, the real solution is to
> > make the new function calls use a consistent 64-bit type, or figure
> > out what in my code path is calling these functions and check it for
> > value sanity.

I would be tempted to do both computations and print then inputs as well
as do a WARN_ON when they differ. That will give you a stack trace of the
offender, and providing you return the "correct" value you can see how it
continues functioning afterwards.

Alan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
  2015-04-30  4:27 ` Trevor Cordes
  2015-04-30 10:53   ` One Thousand Gnomes
@ 2015-04-30 16:35   ` John Stultz
  2015-05-01 10:02     ` Trevor Cordes
  1 sibling, 1 reply; 8+ messages in thread
From: John Stultz @ 2015-04-30 16:35 UTC (permalink / raw)
  To: Trevor Cordes; +Cc: lkml, Thomas Gleixner, Josh Boyer

On Wed, Apr 29, 2015 at 9:27 PM, Trevor Cordes <trevor@tecnopolis.ca> wrote:
> Sorry for the top-posting; Josh Boyer suggested I re-mail this mail
> from last month which didn't get any replies.  I'm still having this
> weird kernel bug affecting me and I've bisected it down to like 2-4
> lines of code.  (I've thought more about my theory regarding
> unsigned/signed below and it's probably wrong, so ignore my
> prognosticating.)  Please see my rhbz link near the bottom for the full
> details.

Thanks so much for the report and all the effort to chase down this regression!

>From your description it does seem like some sort of edge case problem
w/ the 32bit ktime_divns(), but I don't see it right off, and I agree
with Alan to do both calculations and print out warn when that
happens.

There's also not a ton of users of that function, but ktime_us_delta()
is used in drivers/media/rc/ir-lirc-codec.c, which makes use of it in
ir_lirc_transmit_ir().

We should instrument that to see if its calculating negative deltas.

I'll send you a debug patch to do the above.

thanks
-john

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
  2015-04-30 16:35   ` John Stultz
@ 2015-05-01 10:02     ` Trevor Cordes
  2015-05-01 17:31       ` John Stultz
  0 siblings, 1 reply; 8+ messages in thread
From: Trevor Cordes @ 2015-05-01 10:02 UTC (permalink / raw)
  To: John Stultz; +Cc: lkml, Thomas Gleixner, Josh Boyer, One Thousand Gnomes

On 2015-04-30 John Stultz wrote:
> >From your description it does seem like some sort of edge case
> >problem
> w/ the 32bit ktime_divns(), but I don't see it right off, and I agree
> with Alan to do both calculations and print out warn when that
> happens.
> 
> There's also not a ton of users of that function, but ktime_us_delta()
> is used in drivers/media/rc/ir-lirc-codec.c, which makes use of it in
> ir_lirc_transmit_ir().
> 
> We should instrument that to see if its calculating negative deltas.

Thanks for looking at this!  I didn't have the know-how to add the
debug code myself (I was scared this fn might be called a zillion times
in other kernel operations and overload the system with debug logging).

> I'll send you a debug patch to do the above.

Got the patch.  Sorry for the delay (takes hours to compile using
rpmbuild on this old P4 box).

I think I have "success" in terms of useful debug output (relevant
lines only):

May  1 04:41:08 piles lircd: 'lirc' written to protocols file /sys/class/rc/rc0/protocols
May  1 04:41:08 piles lircd-0.9.1a[978]: lircd(default) ready, using /var/run/lirc/lircd
May  1 04:41:11 piles lircd-0.9.1a[978]: accepted new client on /var/run/lirc/lircd
May  1 04:41:11 piles kernel: [   55.265023] JDB: ktime_to_us: -20782699 -> divns 18446744073688768 != old method: -20783
May  1 04:44:00 piles lircd-0.9.1a[978]: removed client
May  1 04:44:00 piles lircd-0.9.1a[978]: caught signal
May  1 04:44:00 piles lircd: 'lirc' written to protocols file /sys/class/rc/rc0/protocols
May  1 04:44:00 piles lircd-0.9.1a[1523]: lircd(default) ready, using /var/run/lirc/lircd
May  1 04:45:03 piles lircd-0.9.1a[1523]: accepted new client on /var/run/lirc/lircd
May  1 04:45:03 piles kernel: [  287.445027] JDB: ktime_to_us: -20599906 -> divns 18446744073688951 != old method: -20600
May  1 04:45:37 piles lircd-0.9.1a[1523]: removed client
May  1 04:45:37 piles lircd-0.9.1a[1523]: caught signal
May  1 04:45:37 piles lircd: 'lirc' written to protocols file /sys/class/rc/rc0/protocols
May  1 04:45:37 piles lircd-0.9.1a[1579]: lircd(default) ready, using /var/run/lirc/lircd
May  1 04:45:40 piles lircd-0.9.1a[1579]: accepted new client on /var/run/lirc/lircd
May  1 04:45:40 piles kernel: [  324.209023] JDB: ktime_to_us: -20443355 -> divns 18446744073689108 != old method: -20444
May  1 04:46:12 piles lircd-0.9.1a[1579]: removed client
May  1 04:46:12 piles lircd-0.9.1a[1579]: caught signal
May  1 04:46:12 piles lircd: 'lirc' written to protocols file /sys/class/rc/rc0/protocols
May  1 04:46:12 piles lircd-0.9.1a[1597]: lircd(default) ready, using /var/run/lirc/lircd
May  1 04:46:12 piles lircd-0.9.1a[1597]: accepted new client on /var/run/lirc/lircd
May  1 04:46:12 piles kernel: [  356.838029] JDB: ktime_to_us: -20157485 -> divns 18446744073689394 != old method: -20158

The last 2-3 or 3 groups of output I could produce on demand by stopping mythbackend and running:
systemctl restart lircd.service ; irsend SEND_ONCE dct700 info

Subsequent irsends don't trigger the bug, since (as I found out a
while ago) by that point lircd is "hung", at least for a long while.
Hey!  Maybe lircd is then hung for 18446744073689394 us or ns :-)
If this result is used as a delay timer, the negative would produce
0 delay, and the + number the "hang".  I calculate that hang is 584
years?  :-)

So it looks like maybe my theory wasn't so wacky: we're dealing
with a caller passing negative numbers (or 32/64 weirdness).  Very
strange as it seems the caller *wants* (or is happy with) negative
numbers!

Let me know if you need any more debugging/patch-tests.  But give
me 4+ hours between rpmbuilds (probably my responses will be 24-hr
later).

Thanks a million!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
  2015-05-01 10:02     ` Trevor Cordes
@ 2015-05-01 17:31       ` John Stultz
  2015-05-01 18:29         ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: John Stultz @ 2015-05-01 17:31 UTC (permalink / raw)
  To: Trevor Cordes; +Cc: lkml, Thomas Gleixner, Josh Boyer, One Thousand Gnomes

On Fri, May 1, 2015 at 3:02 AM, Trevor Cordes <trevor@tecnopolis.ca> wrote:
> May  1 04:46:12 piles kernel: [  356.838029] JDB: ktime_to_us: -20157485 -> divns 18446744073689394 != old method: -20158
>
> The last 2-3 or 3 groups of output I could produce on demand by stopping mythbackend and running:
> systemctl restart lircd.service ; irsend SEND_ONCE dct700 info
>
> Subsequent irsends don't trigger the bug, since (as I found out a
> while ago) by that point lircd is "hung", at least for a long while.
> Hey!  Maybe lircd is then hung for 18446744073689394 us or ns :-)
> If this result is used as a delay timer, the negative would produce
> 0 delay, and the + number the "hang".  I calculate that hang is 584
> years?  :-)
>
> So it looks like maybe my theory wasn't so wacky: we're dealing
> with a caller passing negative numbers (or 32/64 weirdness).  Very
> strange as it seems the caller *wants* (or is happy with) negative
> numbers!

So yea. I see the issue now. The ktime_divns() function uses do_div
which is an unsigned divide. So negative numbers look like large
numbers, and it doesn't do sign-extension, so when it divides we lose
the sign bit.

ktimes are normally positive values, so this probably isn't something
its run-across often, thus why you've just noticed. It does make me
suspicious that the driver is doing something funny still, but I've
gone ahead and written up a patch to fix this since this is clearly an
issue here.

I'll send it out here shortly. If you could give it a spin at your
leisure, and if it works give me a Tested-by: tag I'd appreciate it!

Great work again on chasing this down, and thanks for helping with
debugging and validating the fix!
-john

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
  2015-05-01 17:31       ` John Stultz
@ 2015-05-01 18:29         ` Nicolas Pitre
  2015-05-01 18:51           ` John Stultz
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2015-05-01 18:29 UTC (permalink / raw)
  To: John Stultz
  Cc: Trevor Cordes, lkml, Thomas Gleixner, Josh Boyer, One Thousand Gnomes

On Fri, 1 May 2015, John Stultz wrote:

> So yea. I see the issue now. The ktime_divns() function uses do_div
> which is an unsigned divide. So negative numbers look like large
> numbers, and it doesn't do sign-extension, so when it divides we lose
> the sign bit.

That doesn't stop there. On 32-bit systems, if the divisor is not 
constant or too large, then __ktime_divns() is used. If you look at the 
code for __ktime_divns(), you'll notice it would get into an infinite 
loop if you pass it a negative divisor.


Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511
  2015-05-01 18:29         ` Nicolas Pitre
@ 2015-05-01 18:51           ` John Stultz
  0 siblings, 0 replies; 8+ messages in thread
From: John Stultz @ 2015-05-01 18:51 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Trevor Cordes, lkml, Thomas Gleixner, Josh Boyer, One Thousand Gnomes

On Fri, May 1, 2015 at 11:29 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 1 May 2015, John Stultz wrote:
>
>> So yea. I see the issue now. The ktime_divns() function uses do_div
>> which is an unsigned divide. So negative numbers look like large
>> numbers, and it doesn't do sign-extension, so when it divides we lose
>> the sign bit.
>
> That doesn't stop there. On 32-bit systems, if the divisor is not
> constant or too large, then __ktime_divns() is used. If you look at the
> code for __ktime_divns(), you'll notice it would get into an infinite
> loop if you pass it a negative divisor.

True. Though while negative ktime intervals are mostly rare, I suspect
we probably don't have any negative divisors for now. Even so, it
seems like something that could easily get someones foot shot.

Is there any sort of compiler magic to generate warnings if folks are
passing signed types to the divisor?

I'll be sending my first rfc of the fix here shortly. Additional
feedback there would be appreciated!

thanks
-john

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-01 18:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-23 16:57 regression in ktime.h circa 3.16.0-rc5+ breaks lirc irsend, bad commit 166afb64511 Trevor Cordes
2015-04-30  4:27 ` Trevor Cordes
2015-04-30 10:53   ` One Thousand Gnomes
2015-04-30 16:35   ` John Stultz
2015-05-01 10:02     ` Trevor Cordes
2015-05-01 17:31       ` John Stultz
2015-05-01 18:29         ` Nicolas Pitre
2015-05-01 18:51           ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).