linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found] <571ACEFD467F7749BC50E0A98C17CDD8D5FDBB@pdsmsx403.ccr.corp.intel.com>
@ 2003-10-28 23:56 ` Noah J. Misch
  0 siblings, 0 replies; 5+ messages in thread
From: Noah J. Misch @ 2003-10-28 23:56 UTC (permalink / raw)
  To: Li, Shaohua; +Cc: acpi-devel, linux-kernel

On Tue, 28 Oct 2003, Li, Shaohua wrote:

> The reason is just what you said. It's a race condition.

> In UP, the problem occurs also. But it can't cause anything, because spinlock
> does nothing in UP. In some machines which have the problem in UP, you will
> see some info like this:
> Hwregs-0760[35] hw_low_level_read: Unsupported address space:C8

It's clearly a spinlock that leads up to Alex's oops, but the T40 oops?

> > It seems rather theoretical, but perhaps we could fix it with a patch like
> > the following.  I tested it for kicks and didn't hit any problems, but I'm
> > afraid it risks more problems than it solves.  Thoughts?
>
> Did just discarding the events cause problem?

Hmmm.  I don't imagine it would.  Indeed, registering the address space handler
alone may be good enough for ECDT use.  I'd favor that.

Regardless, we still have the Thinkpad T40 problem.

On Tue, 28 Oct 2003, Li, Shaohua wrote:

> one solution is using bh to handle such events before EC device added.
> After EC is loaded, we still use keventd to handle the events. Any idea?

Aren't BHs obsolete?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
  2003-10-27 16:47 ` Alex Williamson
@ 2003-10-27 18:02   ` Noah J. Misch
  0 siblings, 0 replies; 5+ messages in thread
From: Noah J. Misch @ 2003-10-27 18:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-kernel, acpi-devel, shaohua.li, len.brown, jon

On Mon, 27 Oct 2003, Alex Williamson wrote:

> > This was obvious on my system because it has no ECDT table, and as such
> > acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
> > ECDT it would only do so for a brief time during boot, and the problem would be
> > much more subtle.  That's probably why nobody noticed this in earlier tests.
> >
>
>   I don't have an ECDT either.  Is it possible that the setting of
> ec_device_init = 1 is simply misplaced?

It is misplaced.  If revision 1.26 of ec.c were otherwise sound, I would place
ec_device_init = 1 right before the call to acpi_install_gpe_handler in
acpi_ec_start.  Anywhere outside that if and between where _add removes the
handlers and _start installs them would work.  This would fix your crash, but
it's not the right fix.

> I can see why we wouldn't want to call acpi_os_queue_for_execution() early in
> bootup, but there ought to be a fixed point after which it's ok, regardless of
> whether the system has the ECDT table.

I don't think early calls to schedule_work (via acpi_os_queue_for_execution) are
a problem.  The call to init_workqueues is just before do_initcalls in
do_basic_setup, so it happens earlier than all this stuff.

The more general problem is that acpi_ec_gpe_query cannot run in an interrupt
handler as written.  It used to always run from a queue.  We can either fix it
so it can run from an interrupt handler or change it back to never doing so.  I
favor the latter, especially because I don't see how the recent change fixed the
problem T40 users were experiencing.

> Would it be sufficient to set ec_device_init to 1 at the beginning of
> acpi_ec_add(), with no dependency on the ECDT table?

That particular placement looks racy.  I would do it after removing the
handlers, as explained above.

Thanks,
Noah


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
  2003-10-27  8:22 Noah J. Misch
@ 2003-10-27 16:47 ` Alex Williamson
  2003-10-27 18:02   ` Noah J. Misch
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Williamson @ 2003-10-27 16:47 UTC (permalink / raw)
  To: Noah J. Misch; +Cc: linux-kernel, acpi-devel, shaohua.li, len.brown, jon

On Mon, 2003-10-27 at 01:22, Noah J. Misch wrote:

> This problem stems from the changes in revision 1.26 of drivers/acpi/ec.c.
> They come from a patch Shaohua Li submitted for kernel bug 1171 at
> bugme.osdl.org.  That patch can cause acpi_ec_gpe_query to run in interrupt
> context, whereas before it always ran from a workqueue.  It does non-interrupt
> like things, like sleeping and kmalloc'ing with GFP_KERNEL.
> 
> This was obvious on my system because it has no ECDT table, and as such
> acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
> ECDT it would only do so for a brief time during boot, and the problem would be
> much more subtle.  That's probably why nobody noticed this in earlier tests.
> 

  I don't have an ECDT either.  Is it possible that the setting of
ec_device_init = 1 is simply misplaced?  I can see why we wouldn't want
to call acpi_os_queue_for_execution() early in bootup, but there ought
to be a fixed point after which it's ok, regardless of whether the
system has the ECDT table.  Would it be sufficient to set ec_device_init
to 1 at the beginning of acpi_ec_add(), with no dependency on the ECDT
table?

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
@ 2003-10-27  8:22 Noah J. Misch
  2003-10-27 16:47 ` Alex Williamson
  0 siblings, 1 reply; 5+ messages in thread
From: Noah J. Misch @ 2003-10-27  8:22 UTC (permalink / raw)
  To: alex.williamson; +Cc: linux-kernel, acpi-devel, shaohua.li, len.brown, jon

Hi,

>    On an Omnibook 500 running test9, removing AC power causes an
> immediate hang.  This laptop is getting a little old and I have to force
> on ACPI support, but this did not happen with test8.  The bug and panic

I have this problem as well, on a Sony Vaio, model PCG-571L.

> are shown below.  It looks like the AML associated with the AC event is
> trying to do an AML_SLEEP_OP.  Since this is called while in the
> interrupt handler, and the eventual call to acpi_os_sleep() sets the
> current state to interruptible... boom.  One simple, but terribly ugly,
> workaround is to make acpi_os_sleep() call acpi_os_stall() if
> in_atomic() is true (patch below).  Hopefully there's a better way to
> fix this.  Somehow the interpreter really needs to drop interrupt
> context before it starts making calls like this.  Thanks,

This problem stems from the changes in revision 1.26 of drivers/acpi/ec.c.
They come from a patch Shaohua Li submitted for kernel bug 1171 at
bugme.osdl.org.  That patch can cause acpi_ec_gpe_query to run in interrupt
context, whereas before it always ran from a workqueue.  It does non-interrupt
like things, like sleeping and kmalloc'ing with GFP_KERNEL.

This was obvious on my system because it has no ECDT table, and as such
acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
ECDT it would only do so for a brief time during boot, and the problem would be
much more subtle.  That's probably why nobody noticed this in earlier tests.

I reversed cset 1.1337.43.3 as follows, and that fixed the problem:

bk export -tpatch -r1.1337.43.3 | patch -p1 -R

I can't figure out why that patch fixed the oops in bug 1171.  It was a hook
into the ec address space handler, not the gpe handler, that led to the oops,
yet the patch seems to only modify gpe-related code.  Perhaps you could explain,
Shaohua?

I'd guess the T40 oops results from the ACPI_MEM_FREE on line 305 of
drivers/acpi/events/evregion.c freeing already-freed memory.  I'm actually not
sure why that free is even there.  I also can't figure why only SMP-configured
kernels exhibited the problem.  If someone has the problem hardware, I am
willing to debug it, however.

The errant patch does address what seems to be a race condition that could play
out as follows:

1) The early ECDT probe locates an ECDT and registers a handler for the relevant
GPE and address space.

2) An IRQ triggers acpi_ec_gpe_handler, which schedules acpi_ec_gpe_query.

3) ACPI scans for devices and adds the "real" embedded controller device,
freeing the (temporary) context of the old GPE query handler.

4) Queue runs acpi_ec_gpe_query with a context that has already been kfree'd,
causing it to fail.

It seems rather theoretical, but perhaps we could fix it with a patch like the
following.  I tested it for kicks and didn't hit any problems, but I'm afraid it
risks more problems than it solves.  Thoughts?

--- 1.27/drivers/acpi/ec.c	Mon Oct 27 03:50:57 2003
+++ edited/drivers/acpi/ec.c	Mon Oct 27 03:51:57 2003
@@ -28,6 +28,7 @@
 #include <linux/init.h>
 #include <linux/types.h>
 #include <linux/delay.h>
+#include <linux/workqueue.h>
 #include <linux/proc_fs.h>
 #include <asm/io.h>
 #include <acpi/acpi_bus.h>
@@ -593,6 +594,10 @@
 			ACPI_ADR_SPACE_EC, &acpi_ec_space_handler);

 		acpi_remove_gpe_handler(NULL, ec_ecdt->gpe_bit, &acpi_ec_gpe_handler);
+
+		/* Clear any pending GPE queries before freeing the context for
+		   their handlers */
+		flush_scheduled_work();

 		kfree(ec_ecdt);
 	}



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [BUG] test9 ACPI bad: scheduling while atomic!
@ 2003-10-27  2:23 Alex Williamson
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Williamson @ 2003-10-27  2:23 UTC (permalink / raw)
  To: linux-kernel


   On an Omnibook 500 running test9, removing AC power causes an
immediate hang.  This laptop is getting a little old and I have to force
on ACPI support, but this did not happen with test8.  The bug and panic
are shown below.  It looks like the AML associated with the AC event is
trying to do an AML_SLEEP_OP.  Since this is called while in the
interrupt handler, and the eventual call to acpi_os_sleep() sets the
current state to interruptible... boom.  One simple, but terribly ugly,
workaround is to make acpi_os_sleep() call acpi_os_stall() if
in_atomic() is true (patch below).  Hopefully there's a better way to
fix this.  Somehow the interpreter really needs to drop interrupt
context before it starts making calls like this.  Thanks,

	Alex

bad: scheduling while atomic!
Call Trace:
 [<c011b9c5>] schedule+0x5a5/0x5b0
 [<c01cd1f6>] acpi_ut_acquire_from_cache+0x48/0xa2
 [<c0126d53>] __mod_timer+0x123/0x170
 [<c01278a3>] schedule_timeout+0x63/0xc0
 [<c0127830>] process_timeout+0x0/0x10
 [<c01c2b40>] acpi_ex_system_do_suspend+0x1f/0x28
 [<c01c1b84>] acpi_ex_opcode_1A_0T_0R+0x48/0x8d
 [<c01bbd16>] acpi_ds_exec_end_op+0xa2/0x228
 [<c01c8b95>] acpi_ps_parse_loop+0x518/0x823
 [<c01ce042>] acpi_ut_acquire_mutex+0x5c/0x72
 [<c01cd155>] acpi_ut_release_to_cache+0x31/0x8a
 [<c01ce0bf>] acpi_ut_release_mutex+0x67/0x6c
 [<c01ce1f1>] acpi_ut_delete_generic_state+0xb/0xe
 [<c01ceef3>] acpi_ut_update_object_reference+0x1d0/0x20e
 [<c01cef6e>] acpi_ut_remove_reference+0x21/0x27
 [<c01bc1fb>] acpi_ds_call_control_method+0x146/0x181
 [<c01c8eee>] acpi_ps_parse_aml+0x4e/0x17c
 [<c01c979d>] acpi_psx_execute+0x155/0x1a0
 [<c01c6b2d>] acpi_ns_execute_control_method+0x43/0x52
 [<c01c6ac6>] acpi_ns_evaluate_by_handle+0x6f/0x93
 [<c01c69b9>] acpi_ns_evaluate_relative+0x9d/0xb4
 [<c01c531c>] acpi_hw_low_level_read+0x5e/0xa3
 [<c01c531c>] acpi_hw_low_level_read+0x5e/0xa3
 [<c01c62b8>] acpi_evaluate_object+0x10f/0x1be
 [<c01d21ce>] acpi_ec_gpe_query+0xdd/0xf4
 [<c01bf3aa>] acpi_ev_gpe_dispatch+0x33/0x105
 [<c01bf273>] acpi_ev_gpe_detect+0xc7/0x118
 [<c01bdea1>] acpi_ev_sci_xrupt_handler+0x11/0x18
 [<c01b9fef>] acpi_irq+0xc/0x16
 [<c010b6b9>] handle_IRQ_event+0x49/0x80
 [<c01b9fe3>] acpi_irq+0x0/0x16
 [<c010ba5f>] do_IRQ+0x8f/0x130
 [<c0109d98>] common_interrupt+0x18/0x20
 [<c01230ce>] do_softirq+0x3e/0xa0
 [<c010bacb>] do_IRQ+0xfb/0x130
 [<c0109d98>] common_interrupt+0x18/0x20
 [<c01d441b>] acpi_processor_idle+0x126/0x1c5
 [<c0105000>] _stext+0x0/0x60
 [<c01070e4>] cpu_idle+0x34/0x40
 [<c035475d>] start_kernel+0x14d/0x160
 [<c03544b0>] unknown_bootoption+0x0/0x120

Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c011b519
*pde = 00000000
Oops: 0002 [#1]
CPU:    0
EIP:    0060:[<c011b519>]    Not tainted
EFLAGS: 00010097
EIP is at schedule+0xf9/0x5b0
eax: 00000001   ebx: c02e4b00   ecx: c02e4b20   edx: c53d5f5a
esi: 00000000   edi: dfe06cc0   ebp: c0353c48   esp: c0353c04
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c0352000 task=c02e4b00)
Stack: c02b75a0 0000003c c01cd1f6 00000009 00000000 0000000a 00000001 00000040 
       c0352000 00000246 c0126d53 17b03c66 dced9bc0 0000000b fffca45e c0353c5c 
       dfe06cc0 00000000 c01278a3 c0353c5c fffca45e 00000000 c039ec38 c039ec38 
Call Trace:
 [<c01cd1f6>] acpi_ut_acquire_from_cache+0x48/0xa2
 [<c0126d53>] __mod_timer+0x123/0x170
 [<c01278a3>] schedule_timeout+0x63/0xc0
 [<c0127830>] process_timeout+0x0/0x10
 [<c01c2b40>] acpi_ex_system_do_suspend+0x1f/0x28
 [<c01c1b84>] acpi_ex_opcode_1A_0T_0R+0x48/0x8d
 [<c01bbd16>] acpi_ds_exec_end_op+0xa2/0x228
 [<c01c8b95>] acpi_ps_parse_loop+0x518/0x823
 [<c01ce042>] acpi_ut_acquire_mutex+0x5c/0x72
 [<c01cd155>] acpi_ut_release_to_cache+0x31/0x8a
 [<c01ce0bf>] acpi_ut_release_mutex+0x67/0x6c
 [<c01ce1f1>] acpi_ut_delete_generic_state+0xb/0xe
 [<c01ceef3>] acpi_ut_update_object_reference+0x1d0/0x20e
 [<c01cef6e>] acpi_ut_remove_reference+0x21/0x27
 [<c01bc1fb>] acpi_ds_call_control_method+0x146/0x181
 [<c01c8eee>] acpi_ps_parse_aml+0x4e/0x17c
 [<c01c979d>] acpi_psx_execute+0x155/0x1a0
 [<c01c6b2d>] acpi_ns_execute_control_method+0x43/0x52
 [<c01c6ac6>] acpi_ns_evaluate_by_handle+0x6f/0x93
 [<c01c69b9>] acpi_ns_evaluate_relative+0x9d/0xb4
 [<c01c531c>] acpi_hw_low_level_read+0x5e/0xa3
 [<c01c531c>] acpi_hw_low_level_read+0x5e/0xa3
 [<c01c62b8>] acpi_evaluate_object+0x10f/0x1be
 [<c01d21ce>] acpi_ec_gpe_query+0xdd/0xf4
 [<c01bf3aa>] acpi_ev_gpe_dispatch+0x33/0x105
 [<c01bf273>] acpi_ev_gpe_detect+0xc7/0x118
 [<c01bdea1>] acpi_ev_sci_xrupt_handler+0x11/0x18
 [<c01b9fef>] acpi_irq+0xc/0x16
 [<c010b6b9>] handle_IRQ_event+0x49/0x80
 [<c01b9fe3>] acpi_irq+0x0/0x16
 [<c010ba5f>] do_IRQ+0x8f/0x130
 [<c0109d98>] common_interrupt+0x18/0x20
 [<c01230ce>] do_softirq+0x3e/0xa0
 [<c010bacb>] do_IRQ+0xfb/0x130
 [<c0109d98>] common_interrupt+0x18/0x20
 [<c01d441b>] acpi_processor_idle+0x126/0x1c5
 [<c0105000>] _stext+0x0/0x60
 [<c01070e4>] cpu_idle+0x34/0x40
 [<c035475d>] start_kernel+0x14d/0x160
 [<c03544b0>] unknown_bootoption+0x0/0x120

Code: ff 0e 8b 51 04 8b 43 20 89 50 04 89 02 c7 41 04 00 02 20 00 
 <0>Kernel panic: Fatal exception in interrupt
In interrupt handler - not syncing

--- linux/drivers/acpi/osl.c~	2003-10-26 20:14:19.000000000 -0700
+++ linux/drivers/acpi/osl.c	2003-10-26 19:53:33.000000000 -0700
@@ -40,6 +40,7 @@
 #include <asm/io.h>
 #include <acpi/acpi_bus.h>
 #include <asm/uaccess.h>
+#include <asm/hardirq.h>
 
 #ifdef CONFIG_ACPI_EFI
 #include <linux/efi.h>
@@ -290,8 +291,12 @@
 void
 acpi_os_sleep(u32 sec, u32 ms)
 {
-	current->state = TASK_INTERRUPTIBLE;
-	schedule_timeout(HZ * sec + (ms * HZ) / 1000);
+	if (!in_atomic()) {
+		current->state = TASK_INTERRUPTIBLE;
+		schedule_timeout(HZ * sec + (ms * HZ) / 1000);
+	} else {
+		acpi_os_stall(sec * 1000000 + ms * 1000);
+	}
 }
 
 void



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-10-28 23:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <571ACEFD467F7749BC50E0A98C17CDD8D5FDBB@pdsmsx403.ccr.corp.intel.com>
2003-10-28 23:56 ` [BUG] test9 ACPI bad: scheduling while atomic! Noah J. Misch
2003-10-27  8:22 Noah J. Misch
2003-10-27 16:47 ` Alex Williamson
2003-10-27 18:02   ` Noah J. Misch
  -- strict thread matches above, loose matches on Subject: below --
2003-10-27  2:23 Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).