All of lore.kernel.org
 help / color / mirror / Atom feed
* [sur40] Debugging a race condition?
@ 2015-03-23 11:57 Florian Echtler
  2015-03-23 15:47 ` Florian Echtler
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Echtler @ 2015-03-23 11:57 UTC (permalink / raw)
  To: linux-input, LMML, Hans Verkuil, Benjamin Tissoires, Dmitry Torokhov

[-- Attachment #1: Type: text/plain, Size: 1035 bytes --]

Hello everyone,

now that I'm using the newly merged sur40 video driver in a development
environment, I've noticed that a custom V4L2 application we've been
using in our lab will sometimes trigger a hard lockup of the machine
(_nothing_ works anymore, no VT switching, no network, not even Magic
SysRq).

This doesn't happen with plain old cheese or v4l2-compliance, only with
our custom application and only under X11, i.e. as far as I can tell,
when the input device is being polled at the same time. However, I have
a really hard time tracking this down, as even SysRq doesn't work
anymore. A console continuously dumping dmesg or strace of our tool
didn't really help, either.

I assume that somehow the input_polldev thread is put to sleep/waiting
for a lock due to the video functions and that causes the lockup, but I
can't really tell where that might happen. Can somebody with better
knowledge of the internals give some suggestions?

Thanks & best regards, Florian
-- 
SENT FROM MY DEC VT50 TERMINAL


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [sur40] Debugging a race condition?
  2015-03-23 11:57 [sur40] Debugging a race condition? Florian Echtler
@ 2015-03-23 15:47 ` Florian Echtler
  2015-03-25  6:52   ` input_polldev interval (was Re: [sur40] Debugging a race condition)? Florian Echtler
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Echtler @ 2015-03-23 15:47 UTC (permalink / raw)
  To: linux-input, LMML, Hans Verkuil, Benjamin Tissoires,
	Dmitry Torokhov, Laurent Pinchart

[-- Attachment #1: Type: text/plain, Size: 1506 bytes --]

Additional note: this happens almost never with the original code using
dma-contig, which is why I didn't catch it during testing. I've now
switched back and forth between the two versions multiple times, and
it's definitely a lot less stable with dma-sg and usb_sg_init/_wait.
Maybe that can help somebody in narrowing down the reason of the problem?

Best, Florian

On 23.03.2015 12:57, Florian Echtler wrote:
> Hello everyone,
> 
> now that I'm using the newly merged sur40 video driver in a development
> environment, I've noticed that a custom V4L2 application we've been
> using in our lab will sometimes trigger a hard lockup of the machine
> (_nothing_ works anymore, no VT switching, no network, not even Magic
> SysRq).
> 
> This doesn't happen with plain old cheese or v4l2-compliance, only with
> our custom application and only under X11, i.e. as far as I can tell,
> when the input device is being polled at the same time. However, I have
> a really hard time tracking this down, as even SysRq doesn't work
> anymore. A console continuously dumping dmesg or strace of our tool
> didn't really help, either.
> 
> I assume that somehow the input_polldev thread is put to sleep/waiting
> for a lock due to the video functions and that causes the lockup, but I
> can't really tell where that might happen. Can somebody with better
> knowledge of the internals give some suggestions?
> 
> Thanks & best regards, Florian
> 


-- 
SENT FROM MY DEC VT50 TERMINAL


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* input_polldev interval (was Re: [sur40] Debugging a race condition)?
  2015-03-23 15:47 ` Florian Echtler
@ 2015-03-25  6:52   ` Florian Echtler
  2015-03-25 13:23     ` Dmitry Torokhov
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Echtler @ 2015-03-25  6:52 UTC (permalink / raw)
  To: linux-input, LMML, Hans Verkuil, Benjamin Tissoires,
	Dmitry Torokhov, Laurent Pinchart

Sorry for the continued noise, but this bug/crash is proving quite difficult to nail down.

Currently, I'm setting the interval for input_polldev to 10 ms. However, with video data being retrieved at the same time, it's quite possible that one iteration of poll() will take longer than that. Could this ultimately be the reason? What happens if a new poll() call is scheduled before the previous one completes?

Best, Florian

On March 23, 2015 4:47:19 PM CET, Florian Echtler <floe@butterbrot.org> wrote:
>Additional note: this happens almost never with the original code using
>dma-contig, which is why I didn't catch it during testing. I've now
>switched back and forth between the two versions multiple times, and
>it's definitely a lot less stable with dma-sg and usb_sg_init/_wait.
>Maybe that can help somebody in narrowing down the reason of the
>problem?
>
>Best, Florian
>
>On 23.03.2015 12:57, Florian Echtler wrote:
>> Hello everyone,
>> 
>> now that I'm using the newly merged sur40 video driver in a
>development
>> environment, I've noticed that a custom V4L2 application we've been
>> using in our lab will sometimes trigger a hard lockup of the machine
>> (_nothing_ works anymore, no VT switching, no network, not even Magic
>> SysRq).
>> 
>> This doesn't happen with plain old cheese or v4l2-compliance, only
>with
>> our custom application and only under X11, i.e. as far as I can tell,
>> when the input device is being polled at the same time. However, I
>have
>> a really hard time tracking this down, as even SysRq doesn't work
>> anymore. A console continuously dumping dmesg or strace of our tool
>> didn't really help, either.
>> 
>> I assume that somehow the input_polldev thread is put to
>sleep/waiting
>> for a lock due to the video functions and that causes the lockup, but
>I
>> can't really tell where that might happen. Can somebody with better
>> knowledge of the internals give some suggestions?
>> 
>> Thanks & best regards, Florian
>> 

-- 
SENT FROM MY PDP-11

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: input_polldev interval (was Re: [sur40] Debugging a race condition)?
  2015-03-25  6:52   ` input_polldev interval (was Re: [sur40] Debugging a race condition)? Florian Echtler
@ 2015-03-25 13:23     ` Dmitry Torokhov
  2015-03-25 14:10       ` Florian Echtler
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Torokhov @ 2015-03-25 13:23 UTC (permalink / raw)
  To: Florian Echtler, linux-input, LMML, Hans Verkuil,
	Benjamin Tissoires, Laurent Pinchart

On March 24, 2015 11:52:54 PM PDT, Florian Echtler <floe@butterbrot.org> wrote:
>Sorry for the continued noise, but this bug/crash is proving quite
>difficult to nail down.
>
>Currently, I'm setting the interval for input_polldev to 10 ms.
>However, with video data being retrieved at the same time, it's quite
>possible that one iteration of poll() will take longer than that. Could
>this ultimately be the reason? What happens if a new poll() call is
>scheduled before the previous one completes?

This can't happen as we schedule the next poll only after current one completes.

Hi Florian,
Thanks.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: input_polldev interval (was Re: [sur40] Debugging a race condition)?
  2015-03-25 13:23     ` Dmitry Torokhov
@ 2015-03-25 14:10       ` Florian Echtler
  2015-03-26 21:10           ` Antonio Ospite
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Echtler @ 2015-03-25 14:10 UTC (permalink / raw)
  To: Dmitry Torokhov, linux-input, LMML, Hans Verkuil,
	Benjamin Tissoires, Laurent Pinchart

[-- Attachment #1: Type: text/plain, Size: 1262 bytes --]

Hello Dmitry,

On 25.03.2015 14:23, Dmitry Torokhov wrote:
> On March 24, 2015 11:52:54 PM PDT, Florian Echtler <floe@butterbrot.org> wrote:
>> Currently, I'm setting the interval for input_polldev to 10 ms.
>> However, with video data being retrieved at the same time, it's quite
>> possible that one iteration of poll() will take longer than that. Could
>> this ultimately be the reason? What happens if a new poll() call is
>> scheduled before the previous one completes?
> 
> This can't happen as we schedule the next poll only after current one completes.
> 
Thanks - any other suggestions how to debug such a complete freeze? I
have the following options enabled in my kernel config:

CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_EARLY_PRINTK_EFI=y

Unfortunately, even after the system is frozen for several minutes, I
never get to see a panic message. Maybe it's there on the console
somewhere, but the screen never switches away from X (and as mentioned
earlier, I think this bug can only be triggered from within X). Network
also freezes, so I don't think netconsole will help?

Best, Florian
-- 
SENT FROM MY DEC VT50 TERMINAL


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: input_polldev interval (was Re: [sur40] Debugging a race condition)?
  2015-03-25 14:10       ` Florian Echtler
@ 2015-03-26 21:10           ` Antonio Ospite
  0 siblings, 0 replies; 8+ messages in thread
From: Antonio Ospite @ 2015-03-26 21:10 UTC (permalink / raw)
  To: Florian Echtler
  Cc: Dmitry Torokhov, linux-input, LMML, Hans Verkuil,
	Benjamin Tissoires, Laurent Pinchart

On Wed, 25 Mar 2015 15:10:44 +0100
Florian Echtler <floe@butterbrot.org> wrote:

> Hello Dmitry,
> 
> On 25.03.2015 14:23, Dmitry Torokhov wrote:
> > On March 24, 2015 11:52:54 PM PDT, Florian Echtler <floe@butterbrot.org> wrote:
> >> Currently, I'm setting the interval for input_polldev to 10 ms.
> >> However, with video data being retrieved at the same time, it's quite
> >> possible that one iteration of poll() will take longer than that. Could
> >> this ultimately be the reason? What happens if a new poll() call is
> >> scheduled before the previous one completes?
> > 
> > This can't happen as we schedule the next poll only after current one completes.
> > 
> Thanks - any other suggestions how to debug such a complete freeze? I
> have the following options enabled in my kernel config:
> 
> CONFIG_LOCKUP_DETECTOR=y
> CONFIG_HARDLOCKUP_DETECTOR=y
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_EARLY_PRINTK=y
> CONFIG_EARLY_PRINTK_DBGP=y
> CONFIG_EARLY_PRINTK_EFI=y
> 
> Unfortunately, even after the system is frozen for several minutes, I
> never get to see a panic message. Maybe it's there on the console
> somewhere, but the screen never switches away from X (and as mentioned
> earlier, I think this bug can only be triggered from within X). Network
> also freezes, so I don't think netconsole will help?
> 

PSTORE + some EFI/ACPI mechanism, maybe?
http://lwn.net/Articles/434821/

However I have never tried that myself and I don't know if all the
needed bits are in linux already.

JFTR, on some embedded system I worked on in the past the RAM content
was preserved across resets and, after a crash, we used to dump the RAM
from a second stage bootloader (i.e. before lading another linux
instance) and then scrape the dump to look for the kernel messages, but
AFAIK this is not going to be reliable —or even possible— on a more
complex system.

Ciao,
   Antonio

-- 
Antonio Ospite
http://ao2.it

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: input_polldev interval (was Re: [sur40] Debugging a race condition)?
@ 2015-03-26 21:10           ` Antonio Ospite
  0 siblings, 0 replies; 8+ messages in thread
From: Antonio Ospite @ 2015-03-26 21:10 UTC (permalink / raw)
  To: Florian Echtler
  Cc: Dmitry Torokhov, linux-input, LMML, Hans Verkuil,
	Benjamin Tissoires, Laurent Pinchart

On Wed, 25 Mar 2015 15:10:44 +0100
Florian Echtler <floe@butterbrot.org> wrote:

> Hello Dmitry,
> 
> On 25.03.2015 14:23, Dmitry Torokhov wrote:
> > On March 24, 2015 11:52:54 PM PDT, Florian Echtler <floe@butterbrot.org> wrote:
> >> Currently, I'm setting the interval for input_polldev to 10 ms.
> >> However, with video data being retrieved at the same time, it's quite
> >> possible that one iteration of poll() will take longer than that. Could
> >> this ultimately be the reason? What happens if a new poll() call is
> >> scheduled before the previous one completes?
> > 
> > This can't happen as we schedule the next poll only after current one completes.
> > 
> Thanks - any other suggestions how to debug such a complete freeze? I
> have the following options enabled in my kernel config:
> 
> CONFIG_LOCKUP_DETECTOR=y
> CONFIG_HARDLOCKUP_DETECTOR=y
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_EARLY_PRINTK=y
> CONFIG_EARLY_PRINTK_DBGP=y
> CONFIG_EARLY_PRINTK_EFI=y
> 
> Unfortunately, even after the system is frozen for several minutes, I
> never get to see a panic message. Maybe it's there on the console
> somewhere, but the screen never switches away from X (and as mentioned
> earlier, I think this bug can only be triggered from within X). Network
> also freezes, so I don't think netconsole will help?
> 

PSTORE + some EFI/ACPI mechanism, maybe?
http://lwn.net/Articles/434821/

However I have never tried that myself and I don't know if all the
needed bits are in linux already.

JFTR, on some embedded system I worked on in the past the RAM content
was preserved across resets and, after a crash, we used to dump the RAM
from a second stage bootloader (i.e. before lading another linux
instance) and then scrape the dump to look for the kernel messages, but
AFAIK this is not going to be reliable —or even possible— on a more
complex system.

Ciao,
   Antonio

-- 
Antonio Ospite
http://ao2.it

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?
--
To unsubscribe from this list: send the line "unsubscribe linux-input" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: input_polldev interval (was Re: [sur40] Debugging a race condition)?
  2015-03-26 21:10           ` Antonio Ospite
  (?)
@ 2015-03-27  9:09           ` Florian Echtler
  -1 siblings, 0 replies; 8+ messages in thread
From: Florian Echtler @ 2015-03-27  9:09 UTC (permalink / raw)
  To: Antonio Ospite
  Cc: Dmitry Torokhov, linux-input, LMML, Hans Verkuil,
	Benjamin Tissoires, Laurent Pinchart

[-- Attachment #1: Type: text/plain, Size: 1738 bytes --]

Hello Antonio,

On 26.03.2015 22:10, Antonio Ospite wrote:
> On Wed, 25 Mar 2015 15:10:44 +0100
> Florian Echtler <floe@butterbrot.org> wrote:
>>
>> Thanks - any other suggestions how to debug such a complete freeze? I
>> have the following options enabled in my kernel config:
>>
>> Unfortunately, even after the system is frozen for several minutes, I
>> never get to see a panic message. Maybe it's there on the console
>> somewhere, but the screen never switches away from X (and as mentioned
>> earlier, I think this bug can only be triggered from within X). Network
>> also freezes, so I don't think netconsole will help?
> 
> PSTORE + some EFI/ACPI mechanism, maybe?
> http://lwn.net/Articles/434821/
> 
> However I have never tried that myself and I don't know if all the
> needed bits are in linux already.
> 
> JFTR, on some embedded system I worked on in the past the RAM content
> was preserved across resets and, after a crash, we used to dump the RAM
> from a second stage bootloader (i.e. before lading another linux
> instance) and then scrape the dump to look for the kernel messages, but
> AFAIK this is not going to be reliable —or even possible— on a more
> complex system.

thanks for your suggestions - however, this is a regular x86 system, so
what I will try next is to reproduce the crash in a Virtualbox instance
with the SUR40 device routed to the guest using USB passthrough and the
serial console routed to the host. Hope this will give some clues.

One more general question: what are possible reasons for a complete
freeze? Only a spinlock being held with interrupts disabled, or are
there other possibilities?

Best, Florian
-- 
SENT FROM MY DEC VT50 TERMINAL


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-03-27  9:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-23 11:57 [sur40] Debugging a race condition? Florian Echtler
2015-03-23 15:47 ` Florian Echtler
2015-03-25  6:52   ` input_polldev interval (was Re: [sur40] Debugging a race condition)? Florian Echtler
2015-03-25 13:23     ` Dmitry Torokhov
2015-03-25 14:10       ` Florian Echtler
2015-03-26 21:10         ` Antonio Ospite
2015-03-26 21:10           ` Antonio Ospite
2015-03-27  9:09           ` Florian Echtler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.