linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug
@ 2009-12-10 13:07 Xiaotian Feng
  2009-12-10 13:07 ` [PATCH 1/4] clockevents: use list_for_each_entry_safe Xiaotian Feng
  2009-12-10 14:35 ` [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Thomas Gleixner
  0 siblings, 2 replies; 10+ messages in thread
From: Xiaotian Feng @ 2009-12-10 13:07 UTC (permalink / raw)
  To: tglx, damm, hsweeten, akpm, venkatesh.pallipadi; +Cc: linux-kernel

I've met a list_del corruption, which was reported in
http://lkml.org/lkml/2009/11/27/45. But no response, so I try to debug it
by myself.

After I added some printks to show all elements in clockevent_devices, I
found kernel hangs when I tried to resume from s2ram.

In clockevents_register_device, clockevents_do_notify ADD is always followed
by clockevents_notify_released. Although clockevents_do_notify ADD will use
tick_check_new_device to add new devices and replace old devices to the
clockevents_released list, clockevents_notify_released add them back to
clockevent_devices list.

My system is Quad-Core x86_64, with apic and hpet enables, after boot up,
the elements in clockevent_devices list is :
clockevent_device->lapic(3)->hpet5(3)->lapic(2)->hpet4(2)->lapic(1)->hpet3(1)-
  ->lapic(0)->hpet2(0)->hpet(0)
* () means cpu id

But active clock_event_device is hpet2,hpet3,hpet4,hpet5. Then at s2ram stage,
cpu 1,2,3 is down, then notify CLOCK_EVT_NOTIFY_CPU_DEAD will calls tick_shutdown,
then hpet2,hpet3,hpet4,hpet5 was deleted from clockevent_device list.
So after s2ram, elements in clockevent_device list is:
clockevent_device->lapic(3)->lapic(2)->lapic(1)->lapic(0)->hpet2(0)->hpet(0)

Then at resume stage, cpu 1,2,3 is up, it will register lapic again, and then
perform list_add lapic on clockevent_device list, e.g. list_add lapic(1) on
above list, lapic will move to the clockevent_device->next, but lapic(2)->next
is still point to lapic(1), the list is circular and corrupted then. 

This patchset aims to fixes above behaviour by:
       - on clockevents_register_device, if notify ADD success, move new devices
         to the clockevent_devices list, otherwise move to clockevents_released
         list.
       - on clockevents_notify_released, same behaviour as above.
       - on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
         clockevents_released list.

It makes sure that only active devices on each cpu is on clockevent_devices list.
With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
hotplug works fine on my system. 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] clockevents: use list_for_each_entry_safe
  2009-12-10 13:07 [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Xiaotian Feng
@ 2009-12-10 13:07 ` Xiaotian Feng
  2009-12-10 13:07   ` [PATCH 2/4] clockevents: convert clockevents_do_notify to int Xiaotian Feng
  2009-12-10 14:35 ` [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Thomas Gleixner
  1 sibling, 1 reply; 10+ messages in thread
From: Xiaotian Feng @ 2009-12-10 13:07 UTC (permalink / raw)
  To: tglx, damm, hsweeten, akpm, venkatesh.pallipadi
  Cc: linux-kernel, Xiaotian Feng

Iterating behaviour is same as before, but later patches will
get benifit from this convert.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Magnus Damm <damm@igel.co.jp>
Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
---
 kernel/time/clockevents.c |   12 +++++-------
 1 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 20a8920..1896d9d 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -163,11 +163,9 @@ static void clockevents_do_notify(unsigned long reason, void *dev)
  */
 static void clockevents_notify_released(void)
 {
-	struct clock_event_device *dev;
+	struct clock_event_device *dev, *tmp;
 
-	while (!list_empty(&clockevents_released)) {
-		dev = list_entry(clockevents_released.next,
-				 struct clock_event_device, list);
+	list_for_each_entry_safe(dev, tmp, &clockevents_released, list) {
 		list_del(&dev->list);
 		list_add(&dev->list, &clockevent_devices);
 		clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
@@ -238,7 +236,7 @@ void clockevents_exchange_device(struct clock_event_device *old,
  */
 void clockevents_notify(unsigned long reason, void *arg)
 {
-	struct list_head *node, *tmp;
+	struct clock_event_device *dev, *tmp;
 	unsigned long flags;
 
 	spin_lock_irqsave(&clockevents_lock, flags);
@@ -250,8 +248,8 @@ void clockevents_notify(unsigned long reason, void *arg)
 		 * Unregister the clock event devices which were
 		 * released from the users in the notify chain.
 		 */
-		list_for_each_safe(node, tmp, &clockevents_released)
-			list_del(node);
+		list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
+			list_del(&dev->list);
 		break;
 	default:
 		break;
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] clockevents: convert clockevents_do_notify to int
  2009-12-10 13:07 ` [PATCH 1/4] clockevents: use list_for_each_entry_safe Xiaotian Feng
@ 2009-12-10 13:07   ` Xiaotian Feng
  2009-12-10 13:07     ` [PATCH 3/4] clockevents: add device to clockevent_devices list if notify ADD success Xiaotian Feng
  0 siblings, 1 reply; 10+ messages in thread
From: Xiaotian Feng @ 2009-12-10 13:07 UTC (permalink / raw)
  To: tglx, damm, hsweeten, akpm, venkatesh.pallipadi
  Cc: linux-kernel, Xiaotian Feng

We need to get results from notify CLOCK_EVT_NOTIFY_ADD, so convert
clockevents_do_notify to int, used by later patches.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Magnus Damm <damm@igel.co.jp>
Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
---
 kernel/time/clockevents.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 1896d9d..980c2c0 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -152,9 +152,9 @@ int clockevents_register_notifier(struct notifier_block *nb)
  * Notify about a clock event change. Called with clockevents_lock
  * held.
  */
-static void clockevents_do_notify(unsigned long reason, void *dev)
+static int clockevents_do_notify(unsigned long reason, void *dev)
 {
-	raw_notifier_call_chain(&clockevents_chain, reason, dev);
+	return raw_notifier_call_chain(&clockevents_chain, reason, dev);
 }
 
 /*
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] clockevents: add device to clockevent_devices list if notify ADD success
  2009-12-10 13:07   ` [PATCH 2/4] clockevents: convert clockevents_do_notify to int Xiaotian Feng
@ 2009-12-10 13:07     ` Xiaotian Feng
  2009-12-10 13:07       ` [PATCH 4/4] clockevents: remove related device from clockevents_released list when cpu is DEAD Xiaotian Feng
  0 siblings, 1 reply; 10+ messages in thread
From: Xiaotian Feng @ 2009-12-10 13:07 UTC (permalink / raw)
  To: tglx, damm, hsweeten, akpm, venkatesh.pallipadi
  Cc: linux-kernel, Xiaotian Feng

When we register new clock_event_device, CLOCK_EVT_NOTIFY_ADD was notified,
but tick_check_new_device may fail to use new device, we should only add
device to clockevent_devices list when tick_check_new_device success.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Magnus Damm <damm@igel.co.jp>
Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
---
 kernel/time/clockevents.c |   20 ++++++++++++++------
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 980c2c0..69d6c58 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -164,11 +164,14 @@ static int clockevents_do_notify(unsigned long reason, void *dev)
 static void clockevents_notify_released(void)
 {
 	struct clock_event_device *dev, *tmp;
+	int ret;
 
 	list_for_each_entry_safe(dev, tmp, &clockevents_released, list) {
-		list_del(&dev->list);
-		list_add(&dev->list, &clockevent_devices);
-		clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
+		ret = clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
+		if (ret == NOTIFY_STOP) {
+			list_del(&dev->list);
+			list_add(&dev->list, &clockevent_devices);
+		}
 	}
 }
 
@@ -179,15 +182,19 @@ static void clockevents_notify_released(void)
 void clockevents_register_device(struct clock_event_device *dev)
 {
 	unsigned long flags;
+	int ret;
 
 	BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
 	BUG_ON(!dev->cpumask);
 
 	spin_lock_irqsave(&clockevents_lock, flags);
 
-	list_add(&dev->list, &clockevent_devices);
-	clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
-	clockevents_notify_released();
+	ret = clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
+	if (ret == NOTIFY_STOP) {
+		list_add(&dev->list, &clockevent_devices);
+		clockevents_notify_released();
+	} else
+		list_add(&dev->list, &clockevents_released);
 
 	spin_unlock_irqrestore(&clockevents_lock, flags);
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] clockevents: remove related device from clockevents_released list when cpu is DEAD
  2009-12-10 13:07     ` [PATCH 3/4] clockevents: add device to clockevent_devices list if notify ADD success Xiaotian Feng
@ 2009-12-10 13:07       ` Xiaotian Feng
  0 siblings, 0 replies; 10+ messages in thread
From: Xiaotian Feng @ 2009-12-10 13:07 UTC (permalink / raw)
  To: tglx, damm, hsweeten, akpm, venkatesh.pallipadi
  Cc: linux-kernel, Xiaotian Feng

We have converted clockevent_devices to store all active devices, and
clockevents_released to store all fail-to-add/replace-out devices. So
at cpu offline stage, we should clear the clockevents related with the
offline cpu.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Magnus Damm <damm@igel.co.jp>
Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
---
 kernel/time/clockevents.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index babcb30..d27fcba 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -246,6 +246,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 {
 	struct clock_event_device *dev, *tmp;
 	unsigned long flags;
+	int *cpup;
 
 	spin_lock_irqsave(&clockevents_lock, flags);
 	clockevents_do_notify(reason, arg);
@@ -256,8 +257,10 @@ void clockevents_notify(unsigned long reason, void *arg)
 		 * Unregister the clock event devices which were
 		 * released from the users in the notify chain.
 		 */
+		cpup = (int *)arg;
 		list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
-			list_del(&dev->list);
+			if (cpumask_test_cpu(*cpup, dev->cpumask))
+				list_del(&dev->list);
 		break;
 	default:
 		break;
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug
  2009-12-10 13:07 [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Xiaotian Feng
  2009-12-10 13:07 ` [PATCH 1/4] clockevents: use list_for_each_entry_safe Xiaotian Feng
@ 2009-12-10 14:35 ` Thomas Gleixner
  2009-12-11  2:29   ` Xiaotian Feng
  2010-01-17  9:28   ` Ozan Çağlayan
  1 sibling, 2 replies; 10+ messages in thread
From: Thomas Gleixner @ 2009-12-10 14:35 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: damm, hsweeten, akpm, venkatesh.pallipadi, linux-kernel

On Thu, 10 Dec 2009, Xiaotian Feng wrote:
> I've met a list_del corruption, which was reported in
> http://lkml.org/lkml/2009/11/27/45. But no response, so I try to debug it
> by myself.
> 
> After I added some printks to show all elements in clockevent_devices, I
> found kernel hangs when I tried to resume from s2ram.
> 
> In clockevents_register_device, clockevents_do_notify ADD is always followed
> by clockevents_notify_released. Although clockevents_do_notify ADD will use
> tick_check_new_device to add new devices and replace old devices to the
> clockevents_released list, clockevents_notify_released add them back to
> clockevent_devices list.
> 
> My system is Quad-Core x86_64, with apic and hpet enables, after boot up,
> the elements in clockevent_devices list is :
> clockevent_device->lapic(3)->hpet5(3)->lapic(2)->hpet4(2)->lapic(1)->hpet3(1)-
>   ->lapic(0)->hpet2(0)->hpet(0)
> * () means cpu id
> 
> But active clock_event_device is hpet2,hpet3,hpet4,hpet5. Then at s2ram stage,
> cpu 1,2,3 is down, then notify CLOCK_EVT_NOTIFY_CPU_DEAD will calls tick_shutdown,
> then hpet2,hpet3,hpet4,hpet5 was deleted from clockevent_device list.
> So after s2ram, elements in clockevent_device list is:
> clockevent_device->lapic(3)->lapic(2)->lapic(1)->lapic(0)->hpet2(0)->hpet(0)
> 
> Then at resume stage, cpu 1,2,3 is up, it will register lapic again, and then
> perform list_add lapic on clockevent_device list, e.g. list_add lapic(1) on
> above list, lapic will move to the clockevent_device->next, but lapic(2)->next
> is still point to lapic(1), the list is circular and corrupted then. 

Great detective work !

> This patchset aims to fixes above behaviour by:
>        - on clockevents_register_device, if notify ADD success, move new devices
>          to the clockevent_devices list, otherwise move to clockevents_released
>          list.
>        - on clockevents_notify_released, same behaviour as above.
>        - on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
>          clockevents_released list.
> 
> It makes sure that only active devices on each cpu is on clockevent_devices list.
> With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
> hotplug works fine on my system. 

I'm not happy about that churn. Why don't we simply scan the
clockevent_devices list for leftovers of the dead CPU ?

Untested patch below solves the same problem.

Thanks,

	tglx
----
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 20a8920..5dd857f 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -238,8 +238,9 @@ void clockevents_exchange_device(struct clock_event_device *old,
  */
 void clockevents_notify(unsigned long reason, void *arg)
 {
-	struct list_head *node, *tmp;
+	struct clock_event_device *dev, *tmp;
 	unsigned long flags;
+	int cpu;
 
 	spin_lock_irqsave(&clockevents_lock, flags);
 	clockevents_do_notify(reason, arg);
@@ -250,8 +251,19 @@ void clockevents_notify(unsigned long reason, void *arg)
 		 * Unregister the clock event devices which were
 		 * released from the users in the notify chain.
 		 */
-		list_for_each_safe(node, tmp, &clockevents_released)
-			list_del(node);
+		list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
+			list_del(&dev->list);
+		/*
+		 * Now check whether the CPU has left unused per cpu devices
+		 */
+		cpu = *((int *)arg);
+		list_for_each_entry_safe(dev, tmp, &clockevent_devices, list) {
+			if (cpumask_test_cpu(cpu, dev->cpumask) &&
+			    cpumask_weight(dev->cpumask) == 1) {
+				BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
+				list_del(&dev->list);
+			}
+		}
 		break;
 	default:
 		break;



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug
  2009-12-10 14:35 ` [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Thomas Gleixner
@ 2009-12-11  2:29   ` Xiaotian Feng
  2010-01-17  9:28   ` Ozan Çağlayan
  1 sibling, 0 replies; 10+ messages in thread
From: Xiaotian Feng @ 2009-12-11  2:29 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: damm, hsweeten, akpm, venkatesh.pallipadi, linux-kernel

On 12/10/2009 10:35 PM, Thomas Gleixner wrote:
> On Thu, 10 Dec 2009, Xiaotian Feng wrote:
>> I've met a list_del corruption, which was reported in
>> http://lkml.org/lkml/2009/11/27/45. But no response, so I try to debug it
>> by myself.
>>
>> After I added some printks to show all elements in clockevent_devices, I
>> found kernel hangs when I tried to resume from s2ram.
>>
>> In clockevents_register_device, clockevents_do_notify ADD is always followed
>> by clockevents_notify_released. Although clockevents_do_notify ADD will use
>> tick_check_new_device to add new devices and replace old devices to the
>> clockevents_released list, clockevents_notify_released add them back to
>> clockevent_devices list.
>>
>> My system is Quad-Core x86_64, with apic and hpet enables, after boot up,
>> the elements in clockevent_devices list is :
>> clockevent_device->lapic(3)->hpet5(3)->lapic(2)->hpet4(2)->lapic(1)->hpet3(1)-
>>    ->lapic(0)->hpet2(0)->hpet(0)
>> * () means cpu id
>>
>> But active clock_event_device is hpet2,hpet3,hpet4,hpet5. Then at s2ram stage,
>> cpu 1,2,3 is down, then notify CLOCK_EVT_NOTIFY_CPU_DEAD will calls tick_shutdown,
>> then hpet2,hpet3,hpet4,hpet5 was deleted from clockevent_device list.
>> So after s2ram, elements in clockevent_device list is:
>> clockevent_device->lapic(3)->lapic(2)->lapic(1)->lapic(0)->hpet2(0)->hpet(0)
>>
>> Then at resume stage, cpu 1,2,3 is up, it will register lapic again, and then
>> perform list_add lapic on clockevent_device list, e.g. list_add lapic(1) on
>> above list, lapic will move to the clockevent_device->next, but lapic(2)->next
>> is still point to lapic(1), the list is circular and corrupted then.
>
> Great detective work !
>
>> This patchset aims to fixes above behaviour by:
>>         - on clockevents_register_device, if notify ADD success, move new devices
>>           to the clockevent_devices list, otherwise move to clockevents_released
>>           list.
>>         - on clockevents_notify_released, same behaviour as above.
>>         - on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
>>           clockevents_released list.
>>
>> It makes sure that only active devices on each cpu is on clockevent_devices list.
>> With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
>> hotplug works fine on my system.
>
> I'm not happy about that churn. Why don't we simply scan the
> clockevent_devices list for leftovers of the dead CPU ?
My only thought is to make clockevent_devices list only store active 
devices on each cpu, all other inactive devices stored on 
clockevents_released, but this make things complex. Your patch is better.
>
> Untested patch below solves the same problem.
Yes, this also resolves my list_del warning. Thanks
>
> Thanks,
>
> 	tglx
> ----
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 20a8920..5dd857f 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -238,8 +238,9 @@ void clockevents_exchange_device(struct clock_event_device *old,
>    */
>   void clockevents_notify(unsigned long reason, void *arg)
>   {
> -	struct list_head *node, *tmp;
> +	struct clock_event_device *dev, *tmp;
>   	unsigned long flags;
> +	int cpu;
>
>   	spin_lock_irqsave(&clockevents_lock, flags);
>   	clockevents_do_notify(reason, arg);
> @@ -250,8 +251,19 @@ void clockevents_notify(unsigned long reason, void *arg)
>   		 * Unregister the clock event devices which were
>   		 * released from the users in the notify chain.
>   		 */
> -		list_for_each_safe(node, tmp,&clockevents_released)
> -			list_del(node);
> +		list_for_each_entry_safe(dev, tmp,&clockevents_released, list)
> +			list_del(&dev->list);
> +		/*
> +		 * Now check whether the CPU has left unused per cpu devices
> +		 */
> +		cpu = *((int *)arg);
> +		list_for_each_entry_safe(dev, tmp,&clockevent_devices, list) {
> +			if (cpumask_test_cpu(cpu, dev->cpumask)&&
> +			    cpumask_weight(dev->cpumask) == 1) {
> +				BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
> +				list_del(&dev->list);
> +			}
> +		}
>   		break;
>   	default:
>   		break;
>
>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug
  2009-12-10 14:35 ` [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Thomas Gleixner
  2009-12-11  2:29   ` Xiaotian Feng
@ 2010-01-17  9:28   ` Ozan Çağlayan
  2010-01-18  2:30     ` Xiaotian Feng
  1 sibling, 1 reply; 10+ messages in thread
From: Ozan Çağlayan @ 2010-01-17  9:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Xiaotian Feng, damm, hsweeten, akpm, venkatesh.pallipadi, linux-kernel

Thomas Gleixner wrote:

> 
> I'm not happy about that churn. Why don't we simply scan the
> clockevent_devices list for leftovers of the dead CPU ?
> 
> Untested patch below solves the same problem.


Hi, I have 3 bug reports about a clockevent failure while halting the system (2.6.31.11 kernel).
It exactly pinpoints to the line 262 which got changed with this patch merged to 2.6.31.10 (and also 2.6.32.3):

[4406.986777] kernel BUG at kernel/time/clockevents.c:262!
[4406.986777] invalid opcode: 0000 [#1] SMP
[4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate


All 3 systems have Pentium 4 processors with different clock speeds.

Here's a screenshot of the full stacktrace:
http://bugs.pardus.org.tr/attachment.cgi?id=4820

and the relevant bugzilla report:
http://bugzilla.kernel.org/show_bug.cgi?id=15073

Thanks,
Ozan Caglayan
Pardus Linux -- http://www.pardus.org.tr/eng

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug
  2010-01-17  9:28   ` Ozan Çağlayan
@ 2010-01-18  2:30     ` Xiaotian Feng
  2010-01-18 13:51       ` Thomas Gleixner
  0 siblings, 1 reply; 10+ messages in thread
From: Xiaotian Feng @ 2010-01-18  2:30 UTC (permalink / raw)
  To: Ozan Çağlayan
  Cc: Thomas Gleixner, damm, hsweeten, akpm, venkatesh.pallipadi, linux-kernel

On 01/17/2010 05:28 PM, Ozan Çağlayan wrote:
> Thomas Gleixner wrote:
>
>>
>> I'm not happy about that churn. Why don't we simply scan the
>> clockevent_devices list for leftovers of the dead CPU ?
>>
>> Untested patch below solves the same problem.
>
>
> Hi, I have 3 bug reports about a clockevent failure while halting the system (2.6.31.11 kernel).
> It exactly pinpoints to the line 262 which got changed with this patch merged to 2.6.31.10 (and also 2.6.32.3):
>
> [4406.986777] kernel BUG at kernel/time/clockevents.c:262!
> [4406.986777] invalid opcode: 0000 [#1] SMP
> [4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate

I think this one is duplicated of 
http://bugzilla.kernel.org/show_bug.cgi?id=15037
Thomas is working on the fix. I've sent a workaround patch on
http://patchwork.kernel.org/patch/71537/

>
>
> All 3 systems have Pentium 4 processors with different clock speeds.
>
> Here's a screenshot of the full stacktrace:
> http://bugs.pardus.org.tr/attachment.cgi?id=4820
>
> and the relevant bugzilla report:
> http://bugzilla.kernel.org/show_bug.cgi?id=15073
>
> Thanks,
> Ozan Caglayan
> Pardus Linux -- http://www.pardus.org.tr/eng
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug
  2010-01-18  2:30     ` Xiaotian Feng
@ 2010-01-18 13:51       ` Thomas Gleixner
  0 siblings, 0 replies; 10+ messages in thread
From: Thomas Gleixner @ 2010-01-18 13:51 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Ozan Çağlayan, damm, hsweeten, akpm,
	venkatesh.pallipadi, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1127 bytes --]

On Mon, 18 Jan 2010, Xiaotian Feng wrote:

> On 01/17/2010 05:28 PM, Ozan Çağlayan wrote:
> > Thomas Gleixner wrote:
> > 
> > > 
> > > I'm not happy about that churn. Why don't we simply scan the
> > > clockevent_devices list for leftovers of the dead CPU ?
> > > 
> > > Untested patch below solves the same problem.
> > 
> > 
> > Hi, I have 3 bug reports about a clockevent failure while halting the system
> > (2.6.31.11 kernel).
> > It exactly pinpoints to the line 262 which got changed with this patch
> > merged to 2.6.31.10 (and also 2.6.32.3):
> > 
> > [4406.986777] kernel BUG at kernel/time/clockevents.c:262!
> > [4406.986777] invalid opcode: 0000 [#1] SMP
> > [4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate
> 
> I think this one is duplicated of
> http://bugzilla.kernel.org/show_bug.cgi?id=15037
> Thomas is working on the fix. I've sent a workaround patch on
> http://patchwork.kernel.org/patch/71537/

I just applied your patch, but kept the cpuweight check. This is the
least intrusive solution for now. The logic needs an overhaul, but
thats neither rc4 nor stable material.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-01-18 13:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-10 13:07 [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Xiaotian Feng
2009-12-10 13:07 ` [PATCH 1/4] clockevents: use list_for_each_entry_safe Xiaotian Feng
2009-12-10 13:07   ` [PATCH 2/4] clockevents: convert clockevents_do_notify to int Xiaotian Feng
2009-12-10 13:07     ` [PATCH 3/4] clockevents: add device to clockevent_devices list if notify ADD success Xiaotian Feng
2009-12-10 13:07       ` [PATCH 4/4] clockevents: remove related device from clockevents_released list when cpu is DEAD Xiaotian Feng
2009-12-10 14:35 ` [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug Thomas Gleixner
2009-12-11  2:29   ` Xiaotian Feng
2010-01-17  9:28   ` Ozan Çağlayan
2010-01-18  2:30     ` Xiaotian Feng
2010-01-18 13:51       ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).