linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Register atomic_notifiers in atomic context
@ 2006-02-21 15:54 Alan Stern
  2006-02-21 23:28 ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-21 15:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chandra Seetharaman, Kernel development list

Some atomic notifier chains require registrations to take place in atomic
context.  An example is the die_notifier, which on some architectures may
be accessed very early during the boot-up procedure, before task-switching
is legal.  To accomodate these chains, this patch (as655) replaces the
mutex in the atomic_notifier_head structure with a spinlock.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>

---

Andrew:

This ought to fix the problem you were seeing with the new notifier chains 
on your emt64 system.

Alan Stern



Index: usb-2.6/kernel/sys.c
===================================================================
--- usb-2.6.orig/kernel/sys.c
+++ usb-2.6/kernel/sys.c
@@ -155,7 +155,6 @@ static int __kprobes notifier_call_chain
  *	@n: New entry in notifier chain
  *
  *	Adds a notifier to an atomic notifier chain.
- *	Must be called in process context.
  *
  *	Currently always returns zero.
  */
@@ -163,11 +162,12 @@ static int __kprobes notifier_call_chain
 int atomic_notifier_chain_register(struct atomic_notifier_head *nh,
 		struct notifier_block *n)
 {
+	unsigned long flags;
 	int ret;
 
-	mutex_lock(&nh->mutex);
+	spin_lock_irqsave(&nh->lock, flags);
 	ret = notifier_chain_register(&nh->head, n);
-	mutex_unlock(&nh->mutex);
+	spin_unlock_irqrestore(&nh->lock, flags);
 	return ret;
 }
 
@@ -179,18 +179,18 @@ EXPORT_SYMBOL(atomic_notifier_chain_regi
  *	@n: Entry to remove from notifier chain
  *
  *	Removes a notifier from an atomic notifier chain.
- *	Must be called from process context.
  *
  *	Returns zero on success or %-ENOENT on failure.
  */
 int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh,
 		struct notifier_block *n)
 {
+	unsigned long flags;
 	int ret;
 
-	mutex_lock(&nh->mutex);
+	spin_lock_irqsave(&nh->lock, flags);
 	ret = notifier_chain_unregister(&nh->head, n);
-	mutex_unlock(&nh->mutex);
+	spin_unlock_irqrestore(&nh->lock, flags);
 	synchronize_rcu();
 	return ret;
 }
Index: usb-2.6/include/linux/notifier.h
===================================================================
--- usb-2.6.orig/include/linux/notifier.h
+++ usb-2.6/include/linux/notifier.h
@@ -24,9 +24,9 @@
  *		registration, or unregistration.  All locking and protection
  *		must be provided by the caller.
  *
- * atomic_notifier_chain_register() and blocking_notifier_chain_register()
- * may be called only from process context, and likewise for the
- * corresponding _unregister() routines.
+ * atomic_notifier_chain_register() may be called from an atomic context,
+ * but blocking_notifier_chain_register() must be called from a process
+ * context.  Ditto for the corresponding _unregister() routines.
  *
  * atomic_notifier_chain_unregister() and blocking_notifier_chain_unregister()
  * _must not_ be called from within the call chain.
@@ -39,7 +39,7 @@ struct notifier_block {
 };
 
 struct atomic_notifier_head {
-	struct mutex mutex;
+	spinlock_t lock;
 	struct notifier_block *head;
 };
 
@@ -53,7 +53,7 @@ struct raw_notifier_head {
 };
 
 #define ATOMIC_INIT_NOTIFIER_HEAD(name) do {	\
-		mutex_init(&(name)->mutex);	\
+		spin_lock_init(&(name)->lock);	\
 		(name)->head = NULL;		\
 	} while (0)
 #define BLOCKING_INIT_NOTIFIER_HEAD(name) do {	\
@@ -66,7 +66,7 @@ struct raw_notifier_head {
 
 #define ATOMIC_NOTIFIER_HEAD(name)				\
 	struct atomic_notifier_head name = {			\
-		.mutex = __MUTEX_INITIALIZER((name).mutex),	\
+		.lock = SPIN_LOCK_UNLOCKED,			\
 		.head = NULL }
 #define BLOCKING_NOTIFIER_HEAD(name)				\
 	struct blocking_notifier_head name = {			\


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Register atomic_notifiers in atomic context
  2006-02-21 15:54 [PATCH] Register atomic_notifiers in atomic context Alan Stern
@ 2006-02-21 23:28 ` Andrew Morton
  2006-02-22 16:08   ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2006-02-21 23:28 UTC (permalink / raw)
  To: Alan Stern; +Cc: sekharan, linux-kernel

Alan Stern <stern@rowland.harvard.edu> wrote:
>
> Some atomic notifier chains require registrations to take place in atomic
>  context.  An example is the die_notifier, which on some architectures may
>  be accessed very early during the boot-up procedure, before task-switching
>  is legal.  To accomodate these chains, this patch (as655) replaces the
>  mutex in the atomic_notifier_head structure with a spinlock.

I think that's a good change, however x86_64 still crashes.

At great personal expense (ie: using winxp hyperterminal (I now understand
why some of the traces we get are so crappy)) I have a trace.  It's still
bugging out in the BUG_ON(!irqs_disabled());


Initializing CPU#0                  
Kernel command line: root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0                 
kernel profiling enabled (shift: 1)                                   
PID hash table entries: 4096 (order: 12, 131072 bytes)                                                      
time.c: Using 14.318180 MHz HPET timer.                                       
time.c: Detected 3400.160 MHz processor.                                        
----------- [cut here ] --------- [please bite here ] ---------                                                               
Kernel BUG at kernel/posix-cpu-timers.c:1279                                            
invalid opcode: 0000 [1] PREEMPT SMP
last sysfs file:                
CPU 0     
Modules linked in:                  
Pid: 0, comm: swapper Not tainted 2.6.16-rc4-mm1 #147
RIP: 0010:[<ffffffff801401a2>] <ffffffff801401a2>{run_posix_cpu_timers+42}
RSP: 0018:ffffffff805a3dd8  EFLAGS: 00010202
RAX: ffffffff805a3e18 RBX: ffffffff804d6300 RCX: 0000000000000000
RDX: ffffffff80617000 RSI: 0000000000000000 RDI: ffffffff804d6300
RBP: ffffffff805a3e58 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff804d6300 R14: 0000000000000000 R15: ffffffff8062be88
FS:  0000000000000000(0000) GS:ffffffff80617000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffffffff8062a000, task ffffffff804d6300)
Stack: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
       ffff810007042840 0000000000000000 0000000000000000 0000000000000296
       ffffffff805a3e18 ffffffff805a3e18
Call Trace: <IRQ> <ffffffff80134211>{update_process_times+110}
       <ffffffff80114f3c>{smp_local_timer_interrupt+40} <ffffffff8010d39e>{main_timer_handler+527}
       <ffffffff8010d554>{timer_interrupt+21} <ffffffff8014c44f>{handle_IRQ_event+48}
       <ffffffff8014c528>{__do_IRQ+163} <ffffffff8010c073>{do_IRQ+50}
       <ffffffff80109e36>{ret_from_intr+0} <EOI> <ffffffff803cb47a>{_spin_unlock_irqrestore+17}
       <ffffffff8014c85a>{setup_irq+226} <ffffffff80630b5a>{time_init+698}
       <ffffffff8062d6f9>{start_kernel+218} <ffffffff8062d286>{_sinittext+646}
Code: 0f 0b 68 42 34 3f 80 c2 ff 04 49 8b 95 50 02 00 00 48 85 d2
RIP <ffffffff801401a2>{run_posix_cpu_timer
RIP <ffffffff801401a2>{run_posix_cpu_timer
 BUG: warning at kernel/panic.c:138/panic()
Call Trace: <IRQ> <ffffffff8012ab9d>{panic+557} <ffffffff803cb484>{_spin_unlock_irqrestore+27}
       <ffffffff80231aaa>{__up_read+193} <ffffffff8013827d>{blocking_notifier_call_chain+71}
       <ffffffff8012d565>{do_exit+158} <ffffffff803cb484>{_spin_unlock_irqrestore+27}
       <ffffffff8010b34f>{do_divide_error+0} <ffffffff803cbff1>{do_trap+235}
       <ffffffff8010b596>{do_invalid_op+172} <ffffffff801401a2>{run_posix_cpu_timers+42}
       <ffffffff8010a821>{error_exit+0} <ffffffff801401a2>{run_posix_cpu_timers+42}
       <ffffffff80134211>{update_process_times+110} <ffffffff80114f3c>{smp_local_timer_interrupt+40}
       <ffffffff8010d39e>{main_timer_handler+527} <ffffffff8010d554>{timer_interrupt+21}
       <ffffffff8014c44f>{handle_IRQ_event+48} <ffffffff8014c528>{__do_IRQ+163}
       <ffffffff8010c073>{do_IRQ+50} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff803cb47a>{_spin_unlock_irqrestore+17} <ffffffff8014c85a>{setup_irq+226}
       <ffffffff80630b5a>{time_init+698} <ffffffff8062d6f9>{start_kernel+218}
       <ffffffff8062d286>{_sinittext+646}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Register atomic_notifiers in atomic context
  2006-02-21 23:28 ` Andrew Morton
@ 2006-02-22 16:08   ` Alan Stern
  2006-02-22 16:12     ` Randy.Dunlap
  2006-02-23  2:26     ` Andrew Morton
  0 siblings, 2 replies; 26+ messages in thread
From: Alan Stern @ 2006-02-22 16:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: sekharan, linux-kernel

On Tue, 21 Feb 2006, Andrew Morton wrote:

> Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > Some atomic notifier chains require registrations to take place in atomic
> >  context.  An example is the die_notifier, which on some architectures may
> >  be accessed very early during the boot-up procedure, before task-switching
> >  is legal.  To accomodate these chains, this patch (as655) replaces the
> >  mutex in the atomic_notifier_head structure with a spinlock.
> 
> I think that's a good change, however x86_64 still crashes.
> 
> At great personal expense (ie: using winxp hyperterminal (I now understand
> why some of the traces we get are so crappy)) I have a trace.  It's still
> bugging out in the BUG_ON(!irqs_disabled());

I hate to keep asking you to test this since you're so busy.  If you know
anyone else with an x86_64 who could investigate instead, don't hesitate
to pass this on.

The only reason this patch set could cause interrupts to get enabled (when
they weren't enabled before) would be if some code was using one of the
blocking notifier chains.  If one of the down_read() or down_write() calls
blocked, the scheduler would enable interrupts.  On the other hand, if
there is only a single task running, how could a down_read() or
down_write() call manage to block?

The diagnostic patch below is a heavy-handed approach, but it ought to
indicate the source of the problem.  Doing anything to a blocking notifier
chain at a time when task switching is not legal should be a no-no.  
Maybe this means that a chain got misclassified as blocking when it
really should be atomic -- or maybe it means there has always been a more
serious problem that has never been detected before.

Alan Stern

P.S. (off-topic): Would it be possible to make the -mm series visible
through the web interface at <http://www.kernel.org/git/>?



Index: usb-2.6/kernel/sys.c
===================================================================
--- usb-2.6.orig/kernel/sys.c
+++ usb-2.6/kernel/sys.c
@@ -249,7 +249,11 @@ int blocking_notifier_chain_register(str
 {
 	int ret;
 
-	down_write(&nh->rwsem);
+	if (!down_write_trylock(&nh->rwsem)) {
+		printk(KERN_WARNING "%s\n", __FUNCTION__);
+		dump_stack();
+		down_write(&nh->rwsem);
+	}
 	ret = notifier_chain_register(&nh->head, n);
 	up_write(&nh->rwsem);
 	return ret;
@@ -272,7 +276,11 @@ int blocking_notifier_chain_unregister(s
 {
 	int ret;
 
-	down_write(&nh->rwsem);
+	if (!down_write_trylock(&nh->rwsem)) {
+		printk(KERN_WARNING "%s\n", __FUNCTION__);
+		dump_stack();
+		down_write(&nh->rwsem);
+	}
 	ret = notifier_chain_unregister(&nh->head, n);
 	up_write(&nh->rwsem);
 	return ret;
@@ -302,7 +310,11 @@ int blocking_notifier_call_chain(struct 
 {
 	int ret;
 
-	down_read(&nh->rwsem);
+	if (!down_read_trylock(&nh->rwsem)) {
+		printk(KERN_WARNING "%s\n", __FUNCTION__);
+		dump_stack();
+		down_read(&nh->rwsem);
+	}
 	ret = notifier_call_chain(&nh->head, val, v);
 	up_read(&nh->rwsem);
 	return ret;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Register atomic_notifiers in atomic context
  2006-02-22 16:08   ` Alan Stern
@ 2006-02-22 16:12     ` Randy.Dunlap
  2006-02-23  2:26     ` Andrew Morton
  1 sibling, 0 replies; 26+ messages in thread
From: Randy.Dunlap @ 2006-02-22 16:12 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andrew Morton, sekharan, linux-kernel

On Wed, 22 Feb 2006, Alan Stern wrote:

>
> P.S. (off-topic): Would it be possible to make the -mm series visible
> through the web interface at <http://www.kernel.org/git/>?

see
http://www.kernel.org/git/?p=linux/kernel/git/smurf/linux-trees.git;a=summary
and scroll down to tags

-- 
~Randy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Register atomic_notifiers in atomic context
  2006-02-22 16:08   ` Alan Stern
  2006-02-22 16:12     ` Randy.Dunlap
@ 2006-02-23  2:26     ` Andrew Morton
  2006-02-23 17:15       ` Alan Stern
  1 sibling, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2006-02-23  2:26 UTC (permalink / raw)
  To: Alan Stern; +Cc: sekharan, linux-kernel

Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Tue, 21 Feb 2006, Andrew Morton wrote:
> 
> > Alan Stern <stern@rowland.harvard.edu> wrote:
> > >
> > > Some atomic notifier chains require registrations to take place in atomic
> > >  context.  An example is the die_notifier, which on some architectures may
> > >  be accessed very early during the boot-up procedure, before task-switching
> > >  is legal.  To accomodate these chains, this patch (as655) replaces the
> > >  mutex in the atomic_notifier_head structure with a spinlock.
> > 
> > I think that's a good change, however x86_64 still crashes.
> > 
> > At great personal expense (ie: using winxp hyperterminal (I now understand
> > why some of the traces we get are so crappy)) I have a trace.  It's still
> > bugging out in the BUG_ON(!irqs_disabled());
> 
> I hate to keep asking you to test this since you're so busy.  If you know
> anyone else with an x86_64 who could investigate instead, don't hesitate
> to pass this on.

dood, I spend probably half my time weeding out bugs people sent me and
another 25% finding bugs we already merged...

> The diagnostic patch below is a heavy-handed approach, but it ought to
> indicate the source of the problem.  Doing anything to a blocking notifier
> chain at a time when task switching is not legal should be a no-no.  
> Maybe this means that a chain got misclassified as blocking when it
> really should be atomic -- or maybe it means there has always been a more
> serious problem that has never been detected before.

heh, it fixed the bug.

> Index: usb-2.6/kernel/sys.c
> ===================================================================
> --- usb-2.6.orig/kernel/sys.c
> +++ usb-2.6/kernel/sys.c
> @@ -249,7 +249,11 @@ int blocking_notifier_chain_register(str
>  {
>  	int ret;
>  
> -	down_write(&nh->rwsem);
> +	if (!down_write_trylock(&nh->rwsem)) {
> +		printk(KERN_WARNING "%s\n", __FUNCTION__);
> +		dump_stack();
> +		down_write(&nh->rwsem);
> +	}

See, we avoid doing down_write() if the lock is uncontended.  And x86_64's
uncontended down_write() unconditionally enables interrupts.  This evaded
notice because might_sleep() warnings are disabled early in boot due to
various horrid things.

Maybe we should enable the might_sleep() warnings because of this nasty
rwsem trap.  We tried to do that a couple of weeks ago but we needed a pile
of nasty workarounds to avoid false positives.


Applying this:

--- 25/kernel/sys.c~notifier-sleep-debug-update	2006-02-22 18:23:26.000000000 -0800
+++ 25-akpm/kernel/sys.c	2006-02-22 18:25:45.000000000 -0800
@@ -249,6 +249,8 @@ int blocking_notifier_chain_register(str
 {
 	int ret;
 
+	if (irqs_disabled())
+		dump_stack();
 	if (!down_write_trylock(&nh->rwsem)) {
 		printk(KERN_WARNING "%s\n", __FUNCTION__);
 		dump_stack();
@@ -276,6 +278,8 @@ int blocking_notifier_chain_unregister(s
 {
 	int ret;
 
+	if (irqs_disabled())
+		dump_stack();
 	if (!down_write_trylock(&nh->rwsem)) {
 		printk(KERN_WARNING "%s\n", __FUNCTION__);
 		dump_stack();
@@ -310,6 +314,8 @@ int blocking_notifier_call_chain(struct 
 {
 	int ret;
 
+	if (irqs_disabled())
+		dump_stack();
 	if (!down_read_trylock(&nh->rwsem)) {
 		printk(KERN_WARNING "%s\n", __FUNCTION__);
 		dump_stack();
_

Gives:

Bootdata ok (command line is root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0)
Linux version 2.6.16-rc4-mm1 (akpm@x) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #151 SMP PREEMPT Wed Feb 22 18:25:50 PST 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000ebbd0 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007ffd0000 (usable)
 BIOS-e820: 000000007ffd0000 - 000000007ffdf000 (ACPI data)
 BIOS-e820: 000000007ffdf000 - 0000000080000000 (ACPI NVS)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
kernel direct mapping tables up to 180000000 @ 8000-8000
ACPI: PM-Timer IO Port: 0x408
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled)
Processor #6 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled)
Processor #7 15:3 APIC version 20
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfec81000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 32, address 0xfec81000, GSI 24-47
ACPI: IOAPIC (id[0x0a] address[0xfec81400] gsi_base[48])
IOAPIC[2]: apic_id 10, version 32, address 0xfec81400, GSI 48-71
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Setting APIC routing to flat
ACPI: HPET id: 0x8086a202 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:60000000)
Checking aperture...
Built 1 zonelists
Kernel command line: root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0
kernel profiling enabled (shift: 1)
Initializing CPU#0

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
       <ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639482>{rcu_init+40}
       <ffffffff806296f8>{start_kernel+217} <ffffffff80629286>{_sinittext+646}
PID hash table entries: 4096 (order: 12, 131072 bytes)

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
       <ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639259>{init_timers+40}
       <ffffffff80629707>{start_kernel+232} <ffffffff80629286>{_sinittext+646}

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
       <ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639821>{hrtimers_init+40}
       <ffffffff8062970c>{start_kernel+237} <ffffffff80629286>{_sinittext+646}
time.c: Using 14.318180 MHz HPET timer.
time.c: Detected 3400.166 MHz processor.
disabling early console
Bootdata ok (command line is root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0)
Linux version 2.6.16-rc4-mm1 (akpm@x) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #151 SMP PREEMPT Wed Feb 22 18:25:50 PST 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000ebbd0 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007ffd0000 (usable)
 BIOS-e820: 000000007ffd0000 - 000000007ffdf000 (ACPI data)
 BIOS-e820: 000000007ffdf000 - 0000000080000000 (ACPI NVS)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
ACPI: PM-Timer IO Port: 0x408
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled)
Processor #6 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled)
Processor #7 15:3 APIC version 20
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfec81000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 32, address 0xfec81000, GSI 24-47
ACPI: IOAPIC (id[0x0a] address[0xfec81400] gsi_base[48])
IOAPIC[2]: apic_id 10, version 32, address 0xfec81400, GSI 48-71
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Setting APIC routing to flat
ACPI: HPET id: 0x8086a202 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:60000000)
Checking aperture...
Built 1 zonelists
Kernel command line: root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0
kernel profiling enabled (shift: 1)
Initializing CPU#0

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
       <ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639482>{rcu_init+40}
       <ffffffff806296f8>{start_kernel+217} <ffffffff80629286>{_sinittext+646}
PID hash table entries: 4096 (order: 12, 131072 bytes)

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
       <ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639259>{init_timers+40}
       <ffffffff80629707>{start_kernel+232} <ffffffff80629286>{_sinittext+646}

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
       <ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639821>{hrtimers_init+40}
       <ffffffff8062970c>{start_kernel+237} <ffffffff80629286>{_sinittext+646}
time.c: Using 14.318180 MHz HPET timer.
time.c: Detected 3400.166 MHz processor.
disabling early console
Console: colour VGA+ 80x25

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8063b983>{vfs_caches_init_early+0} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8063a155>{__alloc_bootmem_core+773} <ffffffff8012b0fa>{__call_console_drivers+75}
       <ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063aeae>{alloc_large_system_hash+220}
       <ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
       <ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
       <ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
       <ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
       <ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
       <ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
       <ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
       <ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
       <ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
       <ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
       <ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
       <ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
       <ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
       <ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
       <ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
       <ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
       <ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
       <ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
       <ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
       <ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
       <ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
       <ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
       <ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
       <ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
       <ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
       <ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
       <ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
       <ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
       <ffffffff80629286>{_sinittext+646}


We're using blocking notifier chains in places where it's really risky, such as
__exit_idle().  Time for a rethink, methinks.

I'd suggest that in further development, you enable might_sleep() in early
boot - that would have caught such things..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Register atomic_notifiers in atomic context
  2006-02-23  2:26     ` Andrew Morton
@ 2006-02-23 17:15       ` Alan Stern
  2006-02-23 19:03         ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-23 17:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: sekharan, linux-kernel

On Wed, 22 Feb 2006, Andrew Morton wrote:

> See, we avoid doing down_write() if the lock is uncontended.  And x86_64's
> uncontended down_write() unconditionally enables interrupts.

Not just x86_64; that's part of the general rw-semaphore implementation in
lib/rwsem-spinlock.c.  down_write() and down_read() always enable
interrupts, whether contended or not.

>  This evaded
> notice because might_sleep() warnings are disabled early in boot due to
> various horrid things.
> 
> Maybe we should enable the might_sleep() warnings because of this nasty
> rwsem trap.  We tried to do that a couple of weeks ago but we needed a pile
> of nasty workarounds to avoid false positives.

The situation is prone to bugs.  Special-case situations (like not
allowing task switching during system start-up) are always hard to handle.

> Applying this:
> 
> --- 25/kernel/sys.c~notifier-sleep-debug-update	2006-02-22 18:23:26.000000000 -0800
> +++ 25-akpm/kernel/sys.c	2006-02-22 18:25:45.000000000 -0800
> @@ -249,6 +249,8 @@ int blocking_notifier_chain_register(str
>  {
>  	int ret;
>  
> +	if (irqs_disabled())
> +		dump_stack();
>  	if (!down_write_trylock(&nh->rwsem)) {
>  		printk(KERN_WARNING "%s\n", __FUNCTION__);
>  		dump_stack();
> @@ -276,6 +278,8 @@ int blocking_notifier_chain_unregister(s
>  {
>  	int ret;
>  
> +	if (irqs_disabled())
> +		dump_stack();
>  	if (!down_write_trylock(&nh->rwsem)) {
>  		printk(KERN_WARNING "%s\n", __FUNCTION__);
>  		dump_stack();
> @@ -310,6 +314,8 @@ int blocking_notifier_call_chain(struct 
>  {
>  	int ret;
>  
> +	if (irqs_disabled())
> +		dump_stack();
>  	if (!down_read_trylock(&nh->rwsem)) {
>  		printk(KERN_WARNING "%s\n", __FUNCTION__);
>  		dump_stack();
> _
> 
> Gives:

... a bunch of calls to register_cpu_notifier and __exit_idle ...

> We're using blocking notifier chains in places where it's really risky, such as
> __exit_idle().  Time for a rethink, methinks.

Yes, that was clearly a mistake in the notifier-chain patch.  In x86_64,
the idle_notifier chain should be atomic, not blocking.  I missed the fact
that it gets invoked during an interrupt handler.

On the other hand, nothing in the vanilla kernel uses that notifier chain, 
and there doesn't seem to be any equivalent in other architectures.  
Perhaps it should just go away entirely?

The calls to register_cpu_notifier are harder.  That chain really does 
need to be blocking, which means we can't avoid calling down_write.  The 
only solution I can think of is to use down_write_trylock in the 
blocking_notifier_chain_register and unregister routines, even though 
doing that is a crock.

Or else change __down_read and __down_write to use spin_lock_irqsave 
instead of spin_lock_irq.  What do you think would be best?

> I'd suggest that in further development, you enable might_sleep() in early
> boot - that would have caught such things..

Not a bad idea.  I presume that removing the "system_state == 
SYSTEM_RUNNING" test in __might_sleep will have that effect?

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Register atomic_notifiers in atomic context
  2006-02-23 17:15       ` Alan Stern
@ 2006-02-23 19:03         ` Andrew Morton
  2006-02-23 22:28           ` [PATCH] The idle notifier chain should be atomic Alan Stern
  2006-02-23 22:36           ` [PATCH] Avoid calling down_read and down_write during startup Alan Stern
  0 siblings, 2 replies; 26+ messages in thread
From: Andrew Morton @ 2006-02-23 19:03 UTC (permalink / raw)
  To: Alan Stern; +Cc: sekharan, linux-kernel

Alan Stern <stern@rowland.harvard.edu> wrote:
>
>  The calls to register_cpu_notifier are harder.  That chain really does 
>  need to be blocking

Why?

> which means we can't avoid calling down_write.  The 
>  only solution I can think of is to use down_write_trylock in the 
>  blocking_notifier_chain_register and unregister routines, even though 
>  doing that is a crock.
> 
>  Or else change __down_read and __down_write to use spin_lock_irqsave 
>  instead of spin_lock_irq.  What do you think would be best?

Nothing's pretty.  Perhaps look at system_state and not do any locking at all
in early boot?

>  > I'd suggest that in further development, you enable might_sleep() in early
>  > boot - that would have caught such things..
> 
>  Not a bad idea.  I presume that removing the "system_state == 
>  SYSTEM_RUNNING" test in __might_sleep will have that effect?

Yup.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] The idle notifier chain should be atomic
  2006-02-23 19:03         ` Andrew Morton
@ 2006-02-23 22:28           ` Alan Stern
  2006-02-23 23:49             ` Andi Kleen
  2006-02-23 22:36           ` [PATCH] Avoid calling down_read and down_write during startup Alan Stern
  1 sibling, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-23 22:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: sekharan, linux-kernel

This patch (as658) makes the idle_notifier in x86_64 and idle_chain in
s390 into atomic notifier chains rather than blocking chains.  This is
necessary because they are called during IRQ handling as CPUs leave and
enter the idle state.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>

---

This patch should fix one of the problems you saw.  A second patch to fix 
the other problem is right behind.

Please revert those two debugging changes (the one I sent and the one you 
wrote) before applying this.

Alan Stern

Index: usb-2.6/arch/x86_64/kernel/process.c
===================================================================
--- usb-2.6.orig/arch/x86_64/kernel/process.c
+++ usb-2.6/arch/x86_64/kernel/process.c
@@ -66,17 +66,17 @@ EXPORT_SYMBOL(boot_option_idle_override)
 void (*pm_idle)(void);
 static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
-static BLOCKING_NOTIFIER_HEAD(idle_notifier);
+static ATOMIC_NOTIFIER_HEAD(idle_notifier);
 
 void idle_notifier_register(struct notifier_block *n)
 {
-	blocking_notifier_chain_register(&idle_notifier, n);
+	atomic_notifier_chain_register(&idle_notifier, n);
 }
 EXPORT_SYMBOL_GPL(idle_notifier_register);
 
 void idle_notifier_unregister(struct notifier_block *n)
 {
-	blocking_notifier_chain_unregister(&idle_notifier, n);
+	atomic_notifier_chain_unregister(&idle_notifier, n);
 }
 EXPORT_SYMBOL(idle_notifier_unregister);
 
@@ -86,13 +86,13 @@ static DEFINE_PER_CPU(enum idle_state, i
 void enter_idle(void)
 {
 	__get_cpu_var(idle_state) = CPU_IDLE;
-	blocking_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
+	atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
 }
 
 static void __exit_idle(void)
 {
 	__get_cpu_var(idle_state) = CPU_NOT_IDLE;
-	blocking_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
+	atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
 }
 
 /* Called from interrupts to signify idle end */
Index: usb-2.6/arch/s390/kernel/process.c
===================================================================
--- usb-2.6.orig/arch/s390/kernel/process.c
+++ usb-2.6/arch/s390/kernel/process.c
@@ -76,17 +76,17 @@ unsigned long thread_saved_pc(struct tas
 /*
  * Need to know about CPUs going idle?
  */
-static BLOCKING_NOTIFIER_HEAD(idle_chain);
+static ATOMIC_NOTIFIER_HEAD(idle_chain);
 
 int register_idle_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_register(&idle_chain, nb);
+	return atomic_notifier_chain_register(&idle_chain, nb);
 }
 EXPORT_SYMBOL(register_idle_notifier);
 
 int unregister_idle_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_unregister(&idle_chain, nb);
+	return atomic_notifier_chain_unregister(&idle_chain, nb);
 }
 EXPORT_SYMBOL(unregister_idle_notifier);
 
@@ -95,7 +95,7 @@ void do_monitor_call(struct pt_regs *reg
 	/* disable monitor call class 0 */
 	__ctl_clear_bit(8, 15);
 
-	blocking_notifier_call_chain(&idle_chain, CPU_NOT_IDLE,
+	atomic_notifier_call_chain(&idle_chain, CPU_NOT_IDLE,
 			    (void *)(long) smp_processor_id());
 }
 
@@ -116,7 +116,7 @@ void default_idle(void)
 		return;
 	}
 
-	rc = blocking_notifier_call_chain(&idle_chain,
+	rc = atomic_notifier_call_chain(&idle_chain,
 			CPU_IDLE, (void *)(long) cpu);
 	if (rc != NOTIFY_OK && rc != NOTIFY_DONE)
 		BUG();


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Avoid calling down_read and down_write during startup
  2006-02-23 19:03         ` Andrew Morton
  2006-02-23 22:28           ` [PATCH] The idle notifier chain should be atomic Alan Stern
@ 2006-02-23 22:36           ` Alan Stern
  2006-02-23 22:37             ` Benjamin LaHaise
  1 sibling, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-23 22:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: sekharan, linux-kernel

This patch (as660) changes the registration and unregistration routines 
for blocking notifier chains.  During system startup, when task switching 
is illegal, the routines will avoid calling down_write().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>

---

On Thu, 23 Feb 2006, Andrew Morton wrote:

> Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> >  The calls to register_cpu_notifier are harder.  That chain really does 
> >  need to be blocking
> 
> Why?

Because its callouts invoke blocking functions.  For example,
drivers/base/topology.c:topology_cpu_callback() calls
sysfs_create_group().  In drivers/cpufreq/cpufreq.c,
cpufreq_cpu_callback() calls cpufreq_add_dev(), which does kzalloc with
GFP_KERNEL.

> > which means we can't avoid calling down_write.  The 
> >  only solution I can think of is to use down_write_trylock in the 
> >  blocking_notifier_chain_register and unregister routines, even though 
> >  doing that is a crock.
> > 
> >  Or else change __down_read and __down_write to use spin_lock_irqsave 
> >  instead of spin_lock_irq.  What do you think would be best?
> 
> Nothing's pretty.  Perhaps look at system_state and not do any locking at all
> in early boot?

Okay.  Here's a patch to do that.  It only affects the registration 
routines; the actual call-out function still acquires the lock.  That's 
because (1) I didn't want to add extra overhead to a frequently-used 
routine, and (2) if this notifier chain gets called while interrupts are 
disabled then there's something badly wrong.

Combined with the previous patch, maybe you'll find that everything works
perfectly now...  :-)

Please revert those two debugging patches (the one I sent you and the one 
you wrote) before applying this.

Alan Stern

Index: usb-2.6/kernel/sys.c
===================================================================
--- usb-2.6.orig/kernel/sys.c
+++ usb-2.6/kernel/sys.c
@@ -249,6 +249,14 @@ int blocking_notifier_chain_register(str
 {
 	int ret;
 
+	/*
+	 * This code gets used during boot-up, when task switching is
+	 * not yet working and interrupts must remain disabled.  At
+	 * such times we must not call down_write().
+	 */
+	if (unlikely(system_state == SYSTEM_BOOTING))
+		return notifier_chain_register(&nh->head, n);
+
 	down_write(&nh->rwsem);
 	ret = notifier_chain_register(&nh->head, n);
 	up_write(&nh->rwsem);
@@ -272,6 +280,14 @@ int blocking_notifier_chain_unregister(s
 {
 	int ret;
 
+	/*
+	 * This code gets used during boot-up, when task switching is
+	 * not yet working and interrupts must remain disabled.  At
+	 * such times we must not call down_write().
+	 */
+	if (unlikely(system_state == SYSTEM_BOOTING))
+		return notifier_chain_unregister(&nh->head, n);
+
 	down_write(&nh->rwsem);
 	ret = notifier_chain_unregister(&nh->head, n);
 	up_write(&nh->rwsem);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-23 22:36           ` [PATCH] Avoid calling down_read and down_write during startup Alan Stern
@ 2006-02-23 22:37             ` Benjamin LaHaise
  2006-02-24  0:16               ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2006-02-23 22:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andrew Morton, sekharan, linux-kernel

On Thu, Feb 23, 2006 at 05:36:56PM -0500, Alan Stern wrote:
> This patch (as660) changes the registration and unregistration routines 
> for blocking notifier chains.  During system startup, when task switching 
> is illegal, the routines will avoid calling down_write().

Why is that necessary?  The down_write() will immediately succeed as no 
other process can possibly be holding the lock when the system is booting, 
so the special casing doesn't fix anything.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] The idle notifier chain should be atomic
  2006-02-23 22:28           ` [PATCH] The idle notifier chain should be atomic Alan Stern
@ 2006-02-23 23:49             ` Andi Kleen
  2006-02-24  3:24               ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2006-02-23 23:49 UTC (permalink / raw)
  To: Alan Stern; +Cc: sekharan, linux-kernel

Alan Stern <stern@rowland.harvard.edu> writes:

> This patch (as658) makes the idle_notifier in x86_64 and idle_chain in
> s390 into atomic notifier chains rather than blocking chains.  This is
> necessary because they are called during IRQ handling as CPUs leave and
> enter the idle state.

Actually they aren't. While the code is called from the interrupt
handler logically it belong to the idle thread, not the interrupt handler.
They are only called when the interrupt directly interrupts the idle 
thread, so no atomicity needed.

-Andi

P.S.: Please cc maintainers in the future.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-23 22:37             ` Benjamin LaHaise
@ 2006-02-24  0:16               ` Andrew Morton
  2006-02-24  3:18                 ` Alan Stern
  2006-02-24 14:39                 ` Benjamin LaHaise
  0 siblings, 2 replies; 26+ messages in thread
From: Andrew Morton @ 2006-02-24  0:16 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: stern, sekharan, linux-kernel

Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> On Thu, Feb 23, 2006 at 05:36:56PM -0500, Alan Stern wrote:
>  > This patch (as660) changes the registration and unregistration routines 
>  > for blocking notifier chains.  During system startup, when task switching 
>  > is illegal, the routines will avoid calling down_write().
> 
>  Why is that necessary?  The down_write() will immediately succeed as no 
>  other process can possibly be holding the lock when the system is booting, 
>  so the special casing doesn't fix anything.

down_write() unconditionally (and reasonably) does local_irq_enable() in
the uncontended case.  And enabling local interrupts early in boot can
cause crashes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24  0:16               ` Andrew Morton
@ 2006-02-24  3:18                 ` Alan Stern
  2006-02-24 14:40                   ` Benjamin LaHaise
  2006-02-24 14:39                 ` Benjamin LaHaise
  1 sibling, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-24  3:18 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, sekharan, Kernel development list

On Thu, 23 Feb 2006, Andrew Morton wrote:

> Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > On Thu, Feb 23, 2006 at 05:36:56PM -0500, Alan Stern wrote:
> >  > This patch (as660) changes the registration and unregistration routines 
> >  > for blocking notifier chains.  During system startup, when task switching 
> >  > is illegal, the routines will avoid calling down_write().
> > 
> >  Why is that necessary?  The down_write() will immediately succeed as no 
> >  other process can possibly be holding the lock when the system is booting, 
> >  so the special casing doesn't fix anything.
> 
> down_write() unconditionally (and reasonably) does local_irq_enable() in
> the uncontended case.  And enabling local interrupts early in boot can
> cause crashes.

Ben, earlier you expressed concern about the extra overhead due to 
cache-line contention (on SMP) in the down_read() call added to 
blocking_notifier_call_chain.  I don't remember which notifier chain in 
particular you were worried about; something to do with networking.

Does this still bother you?  I can see a couple of ways around it.

One is to make that notifier chain atomic instead of blocking.  Of course,
this is feasible only if the chain's callout routines never block.  If
they do, then perhaps there's no point worrying about cache misses.

Another possibility is to write a variant implementation of rw-semaphores,
one that wouldn't cause a cache miss in the common case of an uncontested
read lock.  It could be done.  I can write a somewhat inefficient
version easily enough; no doubt someone more experienced in this sort of
thing could do a better job.

What do you think?

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] The idle notifier chain should be atomic
  2006-02-23 23:49             ` Andi Kleen
@ 2006-02-24  3:24               ` Alan Stern
  2006-02-24  3:27                 ` Andi Kleen
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-24  3:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: sekharan, linux-kernel

On 24 Feb 2006, Andi Kleen wrote:

> Alan Stern <stern@rowland.harvard.edu> writes:
> 
> > This patch (as658) makes the idle_notifier in x86_64 and idle_chain in
> > s390 into atomic notifier chains rather than blocking chains.  This is
> > necessary because they are called during IRQ handling as CPUs leave and
> > enter the idle state.
> 
> Actually they aren't. While the code is called from the interrupt
> handler logically it belong to the idle thread, not the interrupt handler.
> They are only called when the interrupt directly interrupts the idle 
> thread, so no atomicity needed.

In do_IRQ() there's a call to exit_idle(), which calls __exit_idle(), 
which runs the idle_notifier call chain.  Surely you're not saying that we 
can do a down_read() in this pathway?

And actually the chain's type doesn't seem to make much difference, since
at the moment there's nothing in the vanilla kernel that registers for the
idle_notifier chain.

> -Andi
> 
> P.S.: Please cc maintainers in the future.

Yes, I should have sent the patch to you too.  I apologize.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] The idle notifier chain should be atomic
  2006-02-24  3:24               ` Alan Stern
@ 2006-02-24  3:27                 ` Andi Kleen
  2006-02-24  4:04                   ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2006-02-24  3:27 UTC (permalink / raw)
  To: Alan Stern; +Cc: sekharan, linux-kernel

On Friday 24 February 2006 04:24, Alan Stern wrote:
 
> In do_IRQ() there's a call to exit_idle(), which calls __exit_idle(), 
> which runs the idle_notifier call chain.  Surely you're not saying that we 
> can do a down_read() in this pathway?

No, but not because it's in an interrupt but because sleeping in the idle
task is illegal.

> And actually the chain's type doesn't seem to make much difference, since
> at the moment there's nothing in the vanilla kernel that registers for the
> idle_notifier chain.

Will come eventually.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] The idle notifier chain should be atomic
  2006-02-24  3:27                 ` Andi Kleen
@ 2006-02-24  4:04                   ` Alan Stern
  0 siblings, 0 replies; 26+ messages in thread
From: Alan Stern @ 2006-02-24  4:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: sekharan, linux-kernel

On Fri, 24 Feb 2006, Andi Kleen wrote:

> On Friday 24 February 2006 04:24, Alan Stern wrote:
>  
> > In do_IRQ() there's a call to exit_idle(), which calls __exit_idle(), 
> > which runs the idle_notifier call chain.  Surely you're not saying that we 
> > can do a down_read() in this pathway?
> 
> No, but not because it's in an interrupt but because sleeping in the idle
> task is illegal.

Well, either reason is sufficient justification for making idle_notifier 
an atomic chain.

> > And actually the chain's type doesn't seem to make much difference, since
> > at the moment there's nothing in the vanilla kernel that registers for the
> > idle_notifier chain.
> 
> Will come eventually.

Will that be just for x86_64 or for all architectures?

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24  0:16               ` Andrew Morton
  2006-02-24  3:18                 ` Alan Stern
@ 2006-02-24 14:39                 ` Benjamin LaHaise
  2006-02-24 15:03                   ` Alan Stern
  1 sibling, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2006-02-24 14:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: stern, sekharan, linux-kernel

On Thu, Feb 23, 2006 at 04:16:31PM -0800, Andrew Morton wrote:
> down_write() unconditionally (and reasonably) does local_irq_enable() in
> the uncontended case.  And enabling local interrupts early in boot can
> cause crashes.

Why not do a down_write_trylock() instead first?  Then the code doesn't 
have the dependancy on system_state, which seems horribly fragile.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24  3:18                 ` Alan Stern
@ 2006-02-24 14:40                   ` Benjamin LaHaise
  2006-02-24 15:04                     ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2006-02-24 14:40 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andrew Morton, sekharan, Kernel development list

On Thu, Feb 23, 2006 at 10:18:18PM -0500, Alan Stern wrote:
> Ben, earlier you expressed concern about the extra overhead due to 
> cache-line contention (on SMP) in the down_read() call added to 
> blocking_notifier_call_chain.  I don't remember which notifier chain in 
> particular you were worried about; something to do with networking.
> 
> Does this still bother you?  I can see a couple of ways around it.

Yes it's a problem.  Any read lock is going to act as a memory barrier, 
and we need fewer of those in hot paths, not more to slow things down.  

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 14:39                 ` Benjamin LaHaise
@ 2006-02-24 15:03                   ` Alan Stern
  0 siblings, 0 replies; 26+ messages in thread
From: Alan Stern @ 2006-02-24 15:03 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, sekharan, linux-kernel

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Thu, Feb 23, 2006 at 04:16:31PM -0800, Andrew Morton wrote:
> > down_write() unconditionally (and reasonably) does local_irq_enable() in
> > the uncontended case.  And enabling local interrupts early in boot can
> > cause crashes.
> 
> Why not do a down_write_trylock() instead first?  Then the code doesn't 
> have the dependancy on system_state, which seems horribly fragile.

I suggested this to Andrew.  His reply was as follows:

> > which means we can't avoid calling down_write.  The 
> >  only solution I can think of is to use down_write_trylock in the 
> >  blocking_notifier_chain_register and unregister routines, even though 
> >  doing that is a crock.
> > 
> >  Or else change __down_read and __down_write to use spin_lock_irqsave 
> >  instead of spin_lock_irq.  What do you think would be best?
> 
> Nothing's pretty.  Perhaps look at system_state and not do any locking at all
> in early boot?

I admit that the whole things is fragile.  IMO the safest approach would
be for __down_read and __down_write not to assume that interrupts are
currently enabled.  But that would introduce more overhead as well; at
least this way the overhead is confined to the notifier chain registration
routines.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 14:40                   ` Benjamin LaHaise
@ 2006-02-24 15:04                     ` Alan Stern
  2006-02-24 15:15                       ` Benjamin LaHaise
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-24 15:04 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, sekharan, Kernel development list

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Thu, Feb 23, 2006 at 10:18:18PM -0500, Alan Stern wrote:
> > Ben, earlier you expressed concern about the extra overhead due to 
> > cache-line contention (on SMP) in the down_read() call added to 
> > blocking_notifier_call_chain.  I don't remember which notifier chain in 
> > particular you were worried about; something to do with networking.
> > 
> > Does this still bother you?  I can see a couple of ways around it.
> 
> Yes it's a problem.  Any read lock is going to act as a memory barrier, 
> and we need fewer of those in hot paths, not more to slow things down.  

What do you think of the two suggestions in my previous message?

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 15:04                     ` Alan Stern
@ 2006-02-24 15:15                       ` Benjamin LaHaise
  2006-02-24 16:44                         ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2006-02-24 15:15 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andrew Morton, sekharan, Kernel development list

On Fri, Feb 24, 2006 at 10:04:23AM -0500, Alan Stern wrote:
> What do you think of the two suggestions in my previous message?

Even if the read version of the lock only touches a cacheline local to 
the cpu, you'd still have to use the lock prefix to allow for correctness 
when a writer comes along.  It is not cacheline bouncing that worries me, 
it is serialising instructions and memory barriers as those hurt immensely 
when the data is in the cache.  I've been looking at a lot of profiles on 
P4s of late, and every single locked instruction is painful as it means 
all of the memory ordering rules come into play.  Neither suggestion 
addresses that overhead that has been introduced.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 16:44                         ` Alan Stern
@ 2006-02-24 16:44                           ` Benjamin LaHaise
  2006-02-24 17:59                             ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2006-02-24 16:44 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andrew Morton, sekharan, Kernel development list

On Fri, Feb 24, 2006 at 11:44:23AM -0500, Alan Stern wrote:
> In that case you should be worried not about acquiring and releasing the 
> rwsem at the beginning and end of blocking_notifier_call_chain; you should 
> be worried about all the RCU serialization in the core 
> notifier_call_chain routine.

RCU doesn't synchronize readers.

> The atomic chains are a different matter.  The ones that don't run in NMI 
> context could use an rw-spinlock for protection, allowing them also to 
> avoid memory barriers while going through the list.  The notifier chains 
> that do run in NMI don't have this luxury.  Fortunately I don't think 
> there are very many of them.

A read lock is a memory barrier.  That's why I'm opposed to using non-rcu 
style locking for them.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 15:15                       ` Benjamin LaHaise
@ 2006-02-24 16:44                         ` Alan Stern
  2006-02-24 16:44                           ` Benjamin LaHaise
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-24 16:44 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, sekharan, Kernel development list

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Fri, Feb 24, 2006 at 10:04:23AM -0500, Alan Stern wrote:
> > What do you think of the two suggestions in my previous message?
> 
> Even if the read version of the lock only touches a cacheline local to 
> the cpu, you'd still have to use the lock prefix to allow for correctness 
> when a writer comes along.  It is not cacheline bouncing that worries me, 
> it is serialising instructions and memory barriers as those hurt immensely 
> when the data is in the cache.  I've been looking at a lot of profiles on 
> P4s of late, and every single locked instruction is painful as it means 
> all of the memory ordering rules come into play.  Neither suggestion 
> addresses that overhead that has been introduced.

In that case you should be worried not about acquiring and releasing the 
rwsem at the beginning and end of blocking_notifier_call_chain; you should 
be worried about all the RCU serialization in the core 
notifier_call_chain routine.

In fact, that could be changed for blocking notifier chains.  Since we own 
the readlock, we know that the list won't get changed while we're using 
it.  So the blocking version of the call-chain routine could avoid using 
the RCU dereferencing mechanism.

The atomic chains are a different matter.  The ones that don't run in NMI 
context could use an rw-spinlock for protection, allowing them also to 
avoid memory barriers while going through the list.  The notifier chains 
that do run in NMI don't have this luxury.  Fortunately I don't think 
there are very many of them.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 16:44                           ` Benjamin LaHaise
@ 2006-02-24 17:59                             ` Alan Stern
  2006-02-24 18:37                               ` Benjamin LaHaise
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2006-02-24 17:59 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, sekharan, Kernel development list

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Fri, Feb 24, 2006 at 11:44:23AM -0500, Alan Stern wrote:
> > In that case you should be worried not about acquiring and releasing the 
> > rwsem at the beginning and end of blocking_notifier_call_chain; you should 
> > be worried about all the RCU serialization in the core 
> > notifier_call_chain routine.
> 
> RCU doesn't synchronize readers.

It does on architectures where smp_read_barrier_depends() expands into
something nontrivial.  Maybe that doesn't include any of the machines
you're interested in.

> > The atomic chains are a different matter.  The ones that don't run in NMI 
> > context could use an rw-spinlock for protection, allowing them also to 
> > avoid memory barriers while going through the list.  The notifier chains 
> > that do run in NMI don't have this luxury.  Fortunately I don't think 
> > there are very many of them.
> 
> A read lock is a memory barrier.  That's why I'm opposed to using non-rcu 
> style locking for them.

But RCU-style locking can't be used in situations where the reader may 
block.  So it's not possible to use it with blocking notifier chains.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 17:59                             ` Alan Stern
@ 2006-02-24 18:37                               ` Benjamin LaHaise
  2006-02-24 20:21                                 ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2006-02-24 18:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andrew Morton, sekharan, Kernel development list

On Fri, Feb 24, 2006 at 12:59:18PM -0500, Alan Stern wrote:
> It does on architectures where smp_read_barrier_depends() expands into
> something nontrivial.  Maybe that doesn't include any of the machines
> you're interested in.

Which includes all of about 1, I think -- alpha.  In other words, it's 
free.

> > > The atomic chains are a different matter.  The ones that don't run in NMI 
> > > context could use an rw-spinlock for protection, allowing them also to 
> > > avoid memory barriers while going through the list.  The notifier chains 
> > > that do run in NMI don't have this luxury.  Fortunately I don't think 
> > > there are very many of them.
> > 
> > A read lock is a memory barrier.  That's why I'm opposed to using non-rcu 
> > style locking for them.
> 
> But RCU-style locking can't be used in situations where the reader may 
> block.  So it's not possible to use it with blocking notifier chains.

Then we shouldn't have non-atomic notifier chains in performance critical 
codepaths.  The original implementation's hooks into critical paths held 
these characteristics.  If that property has been broken, please fix it 
instead of adding more locking.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Avoid calling down_read and down_write during startup
  2006-02-24 18:37                               ` Benjamin LaHaise
@ 2006-02-24 20:21                                 ` Alan Stern
  0 siblings, 0 replies; 26+ messages in thread
From: Alan Stern @ 2006-02-24 20:21 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, sekharan, Kernel development list

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> > > A read lock is a memory barrier.  That's why I'm opposed to using non-rcu 
> > > style locking for them.
> > 
> > But RCU-style locking can't be used in situations where the reader may 
> > block.  So it's not possible to use it with blocking notifier chains.
> 
> Then we shouldn't have non-atomic notifier chains in performance critical 
> codepaths.  The original implementation's hooks into critical paths held 
> these characteristics.  If that property has been broken, please fix it 
> instead of adding more locking.

Sorry, no can do.  You'll have to complain to the people who put blocking
code into critical paths in the first place.  I don't know which paths are
critical nor do I know how to change the code to make it non-blocking.

Or you could write some patches yourself instead of asking other people to 
do it for you.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-02-24 20:21 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-21 15:54 [PATCH] Register atomic_notifiers in atomic context Alan Stern
2006-02-21 23:28 ` Andrew Morton
2006-02-22 16:08   ` Alan Stern
2006-02-22 16:12     ` Randy.Dunlap
2006-02-23  2:26     ` Andrew Morton
2006-02-23 17:15       ` Alan Stern
2006-02-23 19:03         ` Andrew Morton
2006-02-23 22:28           ` [PATCH] The idle notifier chain should be atomic Alan Stern
2006-02-23 23:49             ` Andi Kleen
2006-02-24  3:24               ` Alan Stern
2006-02-24  3:27                 ` Andi Kleen
2006-02-24  4:04                   ` Alan Stern
2006-02-23 22:36           ` [PATCH] Avoid calling down_read and down_write during startup Alan Stern
2006-02-23 22:37             ` Benjamin LaHaise
2006-02-24  0:16               ` Andrew Morton
2006-02-24  3:18                 ` Alan Stern
2006-02-24 14:40                   ` Benjamin LaHaise
2006-02-24 15:04                     ` Alan Stern
2006-02-24 15:15                       ` Benjamin LaHaise
2006-02-24 16:44                         ` Alan Stern
2006-02-24 16:44                           ` Benjamin LaHaise
2006-02-24 17:59                             ` Alan Stern
2006-02-24 18:37                               ` Benjamin LaHaise
2006-02-24 20:21                                 ` Alan Stern
2006-02-24 14:39                 ` Benjamin LaHaise
2006-02-24 15:03                   ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).