All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
@ 2012-07-06  3:16 ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-06  3:16 UTC (permalink / raw)
  To: linux-acpi, lenb; +Cc: linux-kernel

Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
But in this case, it should return error number since some process may run on
the cpu. If the cpu has a running process and the cpu is turned the power off,
the system cannot work well.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 drivers/acpi/processor_driver.c |   18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
+++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
@@ -610,7 +610,7 @@ err_free_pr:
 static int acpi_processor_remove(struct acpi_device *device, int type)
 {
 	struct acpi_processor *pr = NULL;
-
+	int ret;

 	if (!device || !acpi_driver_data(device))
 		return -EINVAL;
@@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
 		goto free;

 	if (type == ACPI_BUS_REMOVAL_EJECT) {
-		if (acpi_processor_handle_eject(pr))
-			return -EINVAL;
+		ret = acpi_processor_handle_eject(pr);
+		if (ret)
+			return ret;
 	}

 	acpi_processor_power_exit(pr, device);
@@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd

 static int acpi_processor_handle_eject(struct acpi_processor *pr)
 {
-	if (cpu_online(pr->id))
-		cpu_down(pr->id);
+	int ret;
+
+	if (cpu_online(pr->id)) {
+		ret = cpu_down(pr->id);
+		if (ret)
+			return ret;
+	}

 	arch_unregister_cpu(pr->id);
 	acpi_unmap_lsapic(pr->id);
-	return (0);
+	return ret;
 }
 #else
 static acpi_status acpi_processor_hotadd_init(struct acpi_processor *pr)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
@ 2012-07-06  3:16 ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-06  3:16 UTC (permalink / raw)
  To: linux-acpi, lenb; +Cc: linux-kernel

Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
But in this case, it should return error number since some process may run on
the cpu. If the cpu has a running process and the cpu is turned the power off,
the system cannot work well.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 drivers/acpi/processor_driver.c |   18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
+++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
@@ -610,7 +610,7 @@ err_free_pr:
 static int acpi_processor_remove(struct acpi_device *device, int type)
 {
 	struct acpi_processor *pr = NULL;
-
+	int ret;

 	if (!device || !acpi_driver_data(device))
 		return -EINVAL;
@@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
 		goto free;

 	if (type == ACPI_BUS_REMOVAL_EJECT) {
-		if (acpi_processor_handle_eject(pr))
-			return -EINVAL;
+		ret = acpi_processor_handle_eject(pr);
+		if (ret)
+			return ret;
 	}

 	acpi_processor_power_exit(pr, device);
@@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd

 static int acpi_processor_handle_eject(struct acpi_processor *pr)
 {
-	if (cpu_online(pr->id))
-		cpu_down(pr->id);
+	int ret;
+
+	if (cpu_online(pr->id)) {
+		ret = cpu_down(pr->id);
+		if (ret)
+			return ret;
+	}

 	arch_unregister_cpu(pr->id);
 	acpi_unmap_lsapic(pr->id);
-	return (0);
+	return ret;
 }
 #else
 static acpi_status acpi_processor_hotadd_init(struct acpi_processor *pr)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2] acpi_bus_trim() stops removing devices when failing to remove the device
  2012-07-06  3:16 ` Yasuaki Ishimatsu
@ 2012-07-06  3:19   ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-06  3:19 UTC (permalink / raw)
  To: linux-acpi, lenb; +Cc: linux-kernel

acpi_bus_trim() stops removing devices, when acpi_bus_remove() return error
number. But acpi_bus_remove() cannot return error number correctly.
acpi_bus_remove() only return -EINVAL, when dev argument is NULL. Thus even if
device cannot be removed correctly, acpi_bus_trim() ignores and continues to
remove devices. acpi_bus_hot_remove_device() uses acpi_bus_trim() for removing
devices. Therefore acpi_bus_hot_remove_device() can send "_EJ0" to firmware,
even if the device is running on the system. In this case, the system cannot
work well. So acpi_bus_trim() should check whether device was removed or not
correctly. The patch adds error check into some functions to remove the device.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 drivers/acpi/scan.c    |   15 ++++++++++++---
 drivers/base/dd.c      |   22 +++++++++++++++++-----
 include/linux/device.h |    2 +-
 3 files changed, 30 insertions(+), 9 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/scan.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/scan.c	2012-07-05 21:14:39.196187725 +0900
+++ linux-3.5-rc4/drivers/acpi/scan.c	2012-07-05 21:15:06.077851657 +0900
@@ -425,12 +425,17 @@ static int acpi_device_remove(struct dev
 {
 	struct acpi_device *acpi_dev = to_acpi_device(dev);
 	struct acpi_driver *acpi_drv = acpi_dev->driver;
+	int ret;

 	if (acpi_drv) {
 		if (acpi_drv->ops.notify)
 			acpi_device_remove_notify_handler(acpi_dev);
-		if (acpi_drv->ops.remove)
-			acpi_drv->ops.remove(acpi_dev, acpi_dev->removal_type);
+		if (acpi_drv->ops.remove) {
+			ret = acpi_drv->ops.remove(acpi_dev,
+						   acpi_dev->removal_type);
+			if (ret)
+				return ret;
+		}
 	}
 	acpi_dev->driver = NULL;
 	acpi_dev->driver_data = NULL;
@@ -1208,11 +1213,15 @@ static int acpi_device_set_context(struc

 static int acpi_bus_remove(struct acpi_device *dev, int rmdevice)
 {
+	int ret;
+
 	if (!dev)
 		return -EINVAL;

 	dev->removal_type = ACPI_BUS_REMOVAL_EJECT;
-	device_release_driver(&dev->dev);
+	ret = device_release_driver(&dev->dev);
+	if (ret)
+		return ret;

 	if (!rmdevice)
 		return 0;
Index: linux-3.5-rc4/drivers/base/dd.c
===================================================================
--- linux-3.5-rc4.orig/drivers/base/dd.c	2012-07-05 21:14:39.192187776 +0900
+++ linux-3.5-rc4/drivers/base/dd.c	2012-07-05 21:15:06.079851633 +0900
@@ -464,9 +464,10 @@ EXPORT_SYMBOL_GPL(driver_attach);
  * __device_release_driver() must be called with @dev lock held.
  * When called for a USB interface, @dev->parent lock must be held as well.
  */
-static void __device_release_driver(struct device *dev)
+static int __device_release_driver(struct device *dev)
 {
 	struct device_driver *drv;
+	int ret;

 	drv = dev->driver;
 	if (drv) {
@@ -482,9 +483,11 @@ static void __device_release_driver(stru
 		pm_runtime_put_sync(dev);

 		if (dev->bus && dev->bus->remove)
-			dev->bus->remove(dev);
+			ret = dev->bus->remove(dev);
 		else if (drv->remove)
-			drv->remove(dev);
+			ret = drv->remove(dev);
+		if (ret)
+			goto rollback;
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
@@ -494,6 +497,12 @@ static void __device_release_driver(stru
 						     dev);

 	}
+
+	return ret;
+
+rollback:
+	driver_sysfs_add(dev);
+	return ret;
 }

 /**
@@ -503,16 +512,19 @@ static void __device_release_driver(stru
  * Manually detach device from driver.
  * When called for a USB interface, @dev->parent lock must be held.
  */
-void device_release_driver(struct device *dev)
+int device_release_driver(struct device *dev)
 {
+	int ret;
 	/*
 	 * If anyone calls device_release_driver() recursively from
 	 * within their ->remove callback for the same device, they
 	 * will deadlock right here.
 	 */
 	device_lock(dev);
-	__device_release_driver(dev);
+	ret = __device_release_driver(dev);
 	device_unlock(dev);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(device_release_driver);

Index: linux-3.5-rc4/include/linux/device.h
===================================================================
--- linux-3.5-rc4.orig/include/linux/device.h	2012-07-05 21:14:39.202187649 +0900
+++ linux-3.5-rc4/include/linux/device.h	2012-07-05 21:15:06.095851431 +0900
@@ -827,7 +827,7 @@ static inline void *dev_get_platdata(con
  * for information on use.
  */
 extern int __must_check device_bind_driver(struct device *dev);
-extern void device_release_driver(struct device *dev);
+extern int device_release_driver(struct device *dev);
 extern int  __must_check device_attach(struct device *dev);
 extern int __must_check driver_attach(struct device_driver *drv);
 extern int __must_check device_reprobe(struct device *dev);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2] acpi_bus_trim() stops removing devices when failing to remove the device
@ 2012-07-06  3:19   ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-06  3:19 UTC (permalink / raw)
  To: linux-acpi, lenb; +Cc: linux-kernel

acpi_bus_trim() stops removing devices, when acpi_bus_remove() return error
number. But acpi_bus_remove() cannot return error number correctly.
acpi_bus_remove() only return -EINVAL, when dev argument is NULL. Thus even if
device cannot be removed correctly, acpi_bus_trim() ignores and continues to
remove devices. acpi_bus_hot_remove_device() uses acpi_bus_trim() for removing
devices. Therefore acpi_bus_hot_remove_device() can send "_EJ0" to firmware,
even if the device is running on the system. In this case, the system cannot
work well. So acpi_bus_trim() should check whether device was removed or not
correctly. The patch adds error check into some functions to remove the device.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 drivers/acpi/scan.c    |   15 ++++++++++++---
 drivers/base/dd.c      |   22 +++++++++++++++++-----
 include/linux/device.h |    2 +-
 3 files changed, 30 insertions(+), 9 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/scan.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/scan.c	2012-07-05 21:14:39.196187725 +0900
+++ linux-3.5-rc4/drivers/acpi/scan.c	2012-07-05 21:15:06.077851657 +0900
@@ -425,12 +425,17 @@ static int acpi_device_remove(struct dev
 {
 	struct acpi_device *acpi_dev = to_acpi_device(dev);
 	struct acpi_driver *acpi_drv = acpi_dev->driver;
+	int ret;

 	if (acpi_drv) {
 		if (acpi_drv->ops.notify)
 			acpi_device_remove_notify_handler(acpi_dev);
-		if (acpi_drv->ops.remove)
-			acpi_drv->ops.remove(acpi_dev, acpi_dev->removal_type);
+		if (acpi_drv->ops.remove) {
+			ret = acpi_drv->ops.remove(acpi_dev,
+						   acpi_dev->removal_type);
+			if (ret)
+				return ret;
+		}
 	}
 	acpi_dev->driver = NULL;
 	acpi_dev->driver_data = NULL;
@@ -1208,11 +1213,15 @@ static int acpi_device_set_context(struc

 static int acpi_bus_remove(struct acpi_device *dev, int rmdevice)
 {
+	int ret;
+
 	if (!dev)
 		return -EINVAL;

 	dev->removal_type = ACPI_BUS_REMOVAL_EJECT;
-	device_release_driver(&dev->dev);
+	ret = device_release_driver(&dev->dev);
+	if (ret)
+		return ret;

 	if (!rmdevice)
 		return 0;
Index: linux-3.5-rc4/drivers/base/dd.c
===================================================================
--- linux-3.5-rc4.orig/drivers/base/dd.c	2012-07-05 21:14:39.192187776 +0900
+++ linux-3.5-rc4/drivers/base/dd.c	2012-07-05 21:15:06.079851633 +0900
@@ -464,9 +464,10 @@ EXPORT_SYMBOL_GPL(driver_attach);
  * __device_release_driver() must be called with @dev lock held.
  * When called for a USB interface, @dev->parent lock must be held as well.
  */
-static void __device_release_driver(struct device *dev)
+static int __device_release_driver(struct device *dev)
 {
 	struct device_driver *drv;
+	int ret;

 	drv = dev->driver;
 	if (drv) {
@@ -482,9 +483,11 @@ static void __device_release_driver(stru
 		pm_runtime_put_sync(dev);

 		if (dev->bus && dev->bus->remove)
-			dev->bus->remove(dev);
+			ret = dev->bus->remove(dev);
 		else if (drv->remove)
-			drv->remove(dev);
+			ret = drv->remove(dev);
+		if (ret)
+			goto rollback;
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
@@ -494,6 +497,12 @@ static void __device_release_driver(stru
 						     dev);

 	}
+
+	return ret;
+
+rollback:
+	driver_sysfs_add(dev);
+	return ret;
 }

 /**
@@ -503,16 +512,19 @@ static void __device_release_driver(stru
  * Manually detach device from driver.
  * When called for a USB interface, @dev->parent lock must be held.
  */
-void device_release_driver(struct device *dev)
+int device_release_driver(struct device *dev)
 {
+	int ret;
 	/*
 	 * If anyone calls device_release_driver() recursively from
 	 * within their ->remove callback for the same device, they
 	 * will deadlock right here.
 	 */
 	device_lock(dev);
-	__device_release_driver(dev);
+	ret = __device_release_driver(dev);
 	device_unlock(dev);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(device_release_driver);

Index: linux-3.5-rc4/include/linux/device.h
===================================================================
--- linux-3.5-rc4.orig/include/linux/device.h	2012-07-05 21:14:39.202187649 +0900
+++ linux-3.5-rc4/include/linux/device.h	2012-07-05 21:15:06.095851431 +0900
@@ -827,7 +827,7 @@ static inline void *dev_get_platdata(con
  * for information on use.
  */
 extern int __must_check device_bind_driver(struct device *dev);
-extern void device_release_driver(struct device *dev);
+extern int device_release_driver(struct device *dev);
 extern int  __must_check device_attach(struct device *dev);
 extern int __must_check driver_attach(struct device_driver *drv);
 extern int __must_check device_reprobe(struct device *dev);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-06  3:16 ` Yasuaki Ishimatsu
  (?)
  (?)
@ 2012-07-06  9:51 ` Srivatsa S. Bhat
  2012-07-09  2:31     ` Yasuaki Ishimatsu
  -1 siblings, 1 reply; 21+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-06  9:51 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: linux-acpi, lenb, linux-kernel

On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.

Ouch!

> But in this case, it should return error number since some process may run on
> the cpu. If the cpu has a running process and the cpu is turned the power off,
> the system cannot work well.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> ---
>  drivers/acpi/processor_driver.c |   18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> ===================================================================
> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
> @@ -610,7 +610,7 @@ err_free_pr:
>  static int acpi_processor_remove(struct acpi_device *device, int type)
>  {
>  	struct acpi_processor *pr = NULL;
> -
> +	int ret;
> 
>  	if (!device || !acpi_driver_data(device))
>  		return -EINVAL;
> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>  		goto free;
> 
>  	if (type == ACPI_BUS_REMOVAL_EJECT) {
> -		if (acpi_processor_handle_eject(pr))
> -			return -EINVAL;
> +		ret = acpi_processor_handle_eject(pr);
> +		if (ret)
> +			return ret;
>  	}
> 
>  	acpi_processor_power_exit(pr, device);
> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
> 
>  static int acpi_processor_handle_eject(struct acpi_processor *pr)
>  {
> -	if (cpu_online(pr->id))
> -		cpu_down(pr->id);
> +	int ret;
> +
> +	if (cpu_online(pr->id)) {
> +		ret = cpu_down(pr->id);
> +		if (ret)
> +			return ret;
> +	}
>

Strictly speaking, this is not thorough enough. What prevents someone
from onlining that same cpu again, at this point?
So, IMHO, you need to wrap the contents of this function inside a
get_online_cpus()/put_online_cpus() block, to prevent anyone else
from messing with CPU hotplug at the same time.

Regards,
Srivatsa S. Bhat

>  	arch_unregister_cpu(pr->id);
>  	acpi_unmap_lsapic(pr->id);
> -	return (0);
> +	return ret;
>  }
>  #else
>  static acpi_status acpi_processor_hotadd_init(struct acpi_processor *pr)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-06  9:51 ` [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails Srivatsa S. Bhat
@ 2012-07-09  2:31     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-09  2:31 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-acpi, lenb, linux-kernel

Hi Srivatsa,

Thank you for your reviewing.

2012/07/06 18:51, Srivatsa S. Bhat wrote:
> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
> 
> Ouch!
> 
>> But in this case, it should return error number since some process may run on
>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>> the system cannot work well.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>
>> ---
>>   drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>   1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>> @@ -610,7 +610,7 @@ err_free_pr:
>>   static int acpi_processor_remove(struct acpi_device *device, int type)
>>   {
>>   	struct acpi_processor *pr = NULL;
>> -
>> +	int ret;
>>
>>   	if (!device || !acpi_driver_data(device))
>>   		return -EINVAL;
>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>   		goto free;
>>
>>   	if (type == ACPI_BUS_REMOVAL_EJECT) {
>> -		if (acpi_processor_handle_eject(pr))
>> -			return -EINVAL;
>> +		ret = acpi_processor_handle_eject(pr);
>> +		if (ret)
>> +			return ret;
>>   	}
>>
>>   	acpi_processor_power_exit(pr, device);
>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>
>>   static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>   {
>> -	if (cpu_online(pr->id))
>> -		cpu_down(pr->id);
>> +	int ret;
>> +
>> +	if (cpu_online(pr->id)) {
>> +		ret = cpu_down(pr->id);
>> +		if (ret)
>> +			return ret;
>> +	}
>>
> 
> Strictly speaking, this is not thorough enough. What prevents someone
> from onlining that same cpu again, at this point?
> So, IMHO, you need to wrap the contents of this function inside a
> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> from messing with CPU hotplug at the same time.

If I understand your comment by mistake, please let me know.
If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.

+	get_online_cpus()
+	if (cpu_online(pr->id)) {
+		ret = cpu_down(pr->id);
+		if (ret)
+			return ret;
+	}
+	put_online_cpus()

I think following patch can prevent it correctly. How about the patch?

---
 drivers/acpi/processor_driver.c |   12 ++++++++++++
 kernel/cpu.c                    |    8 +++++---
 2 files changed, 17 insertions(+), 3 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
+++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
@@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
 {
 	int ret;

+retry:
 	if (cpu_online(pr->id)) {
 		ret = cpu_down(pr->id);
 		if (ret)
 			return ret;
 	}

+	get_online_cpus();
+	/*
+	 * Someone might online the cpu again at this point. So we check that
+	 * cpu has been onlined or not. If cpu is online, we try to offline
+	 * the cpu again.
+	 */
+	if (cpu_online(pr->id)) {
+		put_online_cpus();
+		goto retry;
+	}
 	arch_unregister_cpu(pr->id);
 	acpi_unmap_lsapic(pr->id);
+	put_online_cpus();
 	return ret;
 }
 #else
Index: linux-3.5-rc4/kernel/cpu.c
===================================================================
--- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
+++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
@@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct task_struct *idle;

-	if (cpu_online(cpu) || !cpu_present(cpu))
-		return -EINVAL;
-
 	cpu_hotplug_begin();

+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	idle = idle_thread_get(cpu);
 	if (IS_ERR(idle)) {
 		ret = PTR_ERR(idle);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
@ 2012-07-09  2:31     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-09  2:31 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-acpi, lenb, linux-kernel

Hi Srivatsa,

Thank you for your reviewing.

2012/07/06 18:51, Srivatsa S. Bhat wrote:
> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
> 
> Ouch!
> 
>> But in this case, it should return error number since some process may run on
>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>> the system cannot work well.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>
>> ---
>>   drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>   1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>> @@ -610,7 +610,7 @@ err_free_pr:
>>   static int acpi_processor_remove(struct acpi_device *device, int type)
>>   {
>>   	struct acpi_processor *pr = NULL;
>> -
>> +	int ret;
>>
>>   	if (!device || !acpi_driver_data(device))
>>   		return -EINVAL;
>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>   		goto free;
>>
>>   	if (type == ACPI_BUS_REMOVAL_EJECT) {
>> -		if (acpi_processor_handle_eject(pr))
>> -			return -EINVAL;
>> +		ret = acpi_processor_handle_eject(pr);
>> +		if (ret)
>> +			return ret;
>>   	}
>>
>>   	acpi_processor_power_exit(pr, device);
>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>
>>   static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>   {
>> -	if (cpu_online(pr->id))
>> -		cpu_down(pr->id);
>> +	int ret;
>> +
>> +	if (cpu_online(pr->id)) {
>> +		ret = cpu_down(pr->id);
>> +		if (ret)
>> +			return ret;
>> +	}
>>
> 
> Strictly speaking, this is not thorough enough. What prevents someone
> from onlining that same cpu again, at this point?
> So, IMHO, you need to wrap the contents of this function inside a
> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> from messing with CPU hotplug at the same time.

If I understand your comment by mistake, please let me know.
If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.

+	get_online_cpus()
+	if (cpu_online(pr->id)) {
+		ret = cpu_down(pr->id);
+		if (ret)
+			return ret;
+	}
+	put_online_cpus()

I think following patch can prevent it correctly. How about the patch?

---
 drivers/acpi/processor_driver.c |   12 ++++++++++++
 kernel/cpu.c                    |    8 +++++---
 2 files changed, 17 insertions(+), 3 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
+++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
@@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
 {
 	int ret;

+retry:
 	if (cpu_online(pr->id)) {
 		ret = cpu_down(pr->id);
 		if (ret)
 			return ret;
 	}

+	get_online_cpus();
+	/*
+	 * Someone might online the cpu again at this point. So we check that
+	 * cpu has been onlined or not. If cpu is online, we try to offline
+	 * the cpu again.
+	 */
+	if (cpu_online(pr->id)) {
+		put_online_cpus();
+		goto retry;
+	}
 	arch_unregister_cpu(pr->id);
 	acpi_unmap_lsapic(pr->id);
+	put_online_cpus();
 	return ret;
 }
 #else
Index: linux-3.5-rc4/kernel/cpu.c
===================================================================
--- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
+++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
@@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct task_struct *idle;

-	if (cpu_online(cpu) || !cpu_present(cpu))
-		return -EINVAL;
-
 	cpu_hotplug_begin();

+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	idle = idle_thread_get(cpu);
 	if (IS_ERR(idle)) {
 		ret = PTR_ERR(idle);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-09  2:31     ` Yasuaki Ishimatsu
  (?)
@ 2012-07-09 11:25     ` Srivatsa S. Bhat
  2012-07-09 21:15       ` Toshi Kani
  2012-07-10  0:13         ` Yasuaki Ishimatsu
  -1 siblings, 2 replies; 21+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-09 11:25 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: linux-acpi, lenb, linux-kernel

On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> Thank you for your reviewing.
> 
> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>
>> Ouch!
>>
>>> But in this case, it should return error number since some process may run on
>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>> the system cannot work well.
>>>
>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>
>>> ---
>>>   drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>   1 file changed, 12 insertions(+), 6 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>   static int acpi_processor_remove(struct acpi_device *device, int type)
>>>   {
>>>   	struct acpi_processor *pr = NULL;
>>> -
>>> +	int ret;
>>>
>>>   	if (!device || !acpi_driver_data(device))
>>>   		return -EINVAL;
>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>   		goto free;
>>>
>>>   	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>> -		if (acpi_processor_handle_eject(pr))
>>> -			return -EINVAL;
>>> +		ret = acpi_processor_handle_eject(pr);
>>> +		if (ret)
>>> +			return ret;
>>>   	}
>>>
>>>   	acpi_processor_power_exit(pr, device);
>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>
>>>   static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>   {
>>> -	if (cpu_online(pr->id))
>>> -		cpu_down(pr->id);
>>> +	int ret;
>>> +
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>>
>>
>> Strictly speaking, this is not thorough enough. What prevents someone
>> from onlining that same cpu again, at this point?
>> So, IMHO, you need to wrap the contents of this function inside a
>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>> from messing with CPU hotplug at the same time.
> 
> If I understand your comment by mistake, please let me know.
> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
> 

You are right. Sorry, I overlooked that.

> +	get_online_cpus()
> +	if (cpu_online(pr->id)) {
> +		ret = cpu_down(pr->id);
> +		if (ret)
> +			return ret;
> +	}
> +	put_online_cpus()
> 
> I think following patch can prevent it correctly. How about the patch?
>
> ---
>  drivers/acpi/processor_driver.c |   12 ++++++++++++
>  kernel/cpu.c                    |    8 +++++---
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> ===================================================================
> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>  {
>  	int ret;
> 
> +retry:
>  	if (cpu_online(pr->id)) {
>  		ret = cpu_down(pr->id);
>  		if (ret)
>  			return ret;
>  	}
> 
> +	get_online_cpus();
> +	/*
> +	 * Someone might online the cpu again at this point. So we check that
> +	 * cpu has been onlined or not. If cpu is online, we try to offline
> +	 * the cpu again.
> +	 */
> +	if (cpu_online(pr->id)) {

How about this:
	if (unlikely(cpu_online(pr->id)) {
since the probability of this happening is quite small...

> +		put_online_cpus();
> +		goto retry;
> +	}
>  	arch_unregister_cpu(pr->id);
>  	acpi_unmap_lsapic(pr->id);
> +	put_online_cpus();
>  	return ret;
>  }

This retry logic doesn't look elegant, but I don't see any better method :-(

>  #else
> Index: linux-3.5-rc4/kernel/cpu.c
> ===================================================================
> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>  	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>  	struct task_struct *idle;
> 
> -	if (cpu_online(cpu) || !cpu_present(cpu))
> -		return -EINVAL;
> -
>  	cpu_hotplug_begin();
> 
> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +

Firstly, why is this change needed?
Secondly, if the change is indeed an improvement, then why is it
in _this_ patch? IMHO, in that case it should be part of a separate patch.

Coming back to my first point, I don't see why this hunk is needed. We
already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
checking the status of the cpu (online or present). And all hotplug
operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
lock. Isn't that enough? Or am I missing something?

>  	idle = idle_thread_get(cpu);
>  	if (IS_ERR(idle)) {
>  		ret = PTR_ERR(idle);
> 
 
Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-09 11:25     ` Srivatsa S. Bhat
@ 2012-07-09 21:15       ` Toshi Kani
  2012-07-10  4:57           ` Yasuaki Ishimatsu
  2012-07-10  0:13         ` Yasuaki Ishimatsu
  1 sibling, 1 reply; 21+ messages in thread
From: Toshi Kani @ 2012-07-09 21:15 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: Yasuaki Ishimatsu, linux-acpi, lenb, linux-kernel

On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
> > Hi Srivatsa,
> > 
> > Thank you for your reviewing.
> > 
> > 2012/07/06 18:51, Srivatsa S. Bhat wrote:
> >> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
> >>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
> >>
> >> Ouch!
> >>
> >>> But in this case, it should return error number since some process may run on
> >>> the cpu. If the cpu has a running process and the cpu is turned the power off,
> >>> the system cannot work well.
> >>>
> >>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> >>>
> >>> ---
> >>>   drivers/acpi/processor_driver.c |   18 ++++++++++++------
> >>>   1 file changed, 12 insertions(+), 6 deletions(-)
> >>>
> >>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>> ===================================================================
> >>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
> >>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
> >>> @@ -610,7 +610,7 @@ err_free_pr:
> >>>   static int acpi_processor_remove(struct acpi_device *device, int type)
> >>>   {
> >>>   	struct acpi_processor *pr = NULL;
> >>> -
> >>> +	int ret;
> >>>
> >>>   	if (!device || !acpi_driver_data(device))
> >>>   		return -EINVAL;
> >>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
> >>>   		goto free;
> >>>
> >>>   	if (type == ACPI_BUS_REMOVAL_EJECT) {
> >>> -		if (acpi_processor_handle_eject(pr))
> >>> -			return -EINVAL;
> >>> +		ret = acpi_processor_handle_eject(pr);
> >>> +		if (ret)
> >>> +			return ret;
> >>>   	}
> >>>
> >>>   	acpi_processor_power_exit(pr, device);
> >>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
> >>>
> >>>   static int acpi_processor_handle_eject(struct acpi_processor *pr)
> >>>   {
> >>> -	if (cpu_online(pr->id))
> >>> -		cpu_down(pr->id);
> >>> +	int ret;
> >>> +
> >>> +	if (cpu_online(pr->id)) {
> >>> +		ret = cpu_down(pr->id);
> >>> +		if (ret)
> >>> +			return ret;
> >>> +	}
> >>>
> >>
> >> Strictly speaking, this is not thorough enough. What prevents someone
> >> from onlining that same cpu again, at this point?
> >> So, IMHO, you need to wrap the contents of this function inside a
> >> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> >> from messing with CPU hotplug at the same time.
> > 
> > If I understand your comment by mistake, please let me know.
> > If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
> > as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
> > cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
> > 
> 
> You are right. Sorry, I overlooked that.
> 
> > +	get_online_cpus()
> > +	if (cpu_online(pr->id)) {
> > +		ret = cpu_down(pr->id);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +	put_online_cpus()
> > 
> > I think following patch can prevent it correctly. How about the patch?
> >
> > ---
> >  drivers/acpi/processor_driver.c |   12 ++++++++++++
> >  kernel/cpu.c                    |    8 +++++---
> >  2 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> > ===================================================================
> > --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
> > +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
> > @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
> >  {
> >  	int ret;
> > 
> > +retry:
> >  	if (cpu_online(pr->id)) {
> >  		ret = cpu_down(pr->id);
> >  		if (ret)
> >  			return ret;
> >  	}
> > 
> > +	get_online_cpus();
> > +	/*
> > +	 * Someone might online the cpu again at this point. So we check that
> > +	 * cpu has been onlined or not. If cpu is online, we try to offline
> > +	 * the cpu again.
> > +	 */
> > +	if (cpu_online(pr->id)) {
> 
> How about this:
> 	if (unlikely(cpu_online(pr->id)) {
> since the probability of this happening is quite small...
> 
> > +		put_online_cpus();
> > +		goto retry;
> > +	}
> >  	arch_unregister_cpu(pr->id);
> >  	acpi_unmap_lsapic(pr->id);
> > +	put_online_cpus();
> >  	return ret;
> >  }
> 
> This retry logic doesn't look elegant, but I don't see any better method :-(

Another possible option is to fail the request instead of retrying it.

It would be quite challenging to allow on-lining and off-lining
operations to run concurrently.  In fact, even if we close this window,
there is yet another window right after the new put_online_cpus() call.
This CPU may become online before calling _EJ0 in the case of
hot-remove.

This goes beyond the scope of this patch, but IMHO, we should serialize
in the request level.  That is, a new on-lining request should not be
allowed to proceed until the current request is complete.  This scheme
only allows a single operation at a time per OS instance, but I do not
think it is a big issue.

Serializing in the request level is also important when supporting
container hot-remove, which can remove multiple children objects under a
parent container object.  For instance, a Node hot-remove may also
remove multiple processors underneath of it.  This kind of the
operations has to make sure that all children objects are remained
off-line until it ejects the parent object.  

Thanks,
-Toshi



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-09 11:25     ` Srivatsa S. Bhat
@ 2012-07-10  0:13         ` Yasuaki Ishimatsu
  2012-07-10  0:13         ` Yasuaki Ishimatsu
  1 sibling, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  0:13 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-acpi, lenb, linux-kernel

Hi Srivatsa,

2012/07/09 20:25, Srivatsa S. Bhat wrote:
> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>> Hi Srivatsa,
>>
>> Thank you for your reviewing.
>>
>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>
>>> Ouch!
>>>
>>>> But in this case, it should return error number since some process may run on
>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>> the system cannot work well.
>>>>
>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>
>>>> ---
>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>
>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>    {
>>>>    	struct acpi_processor *pr = NULL;
>>>> -
>>>> +	int ret;
>>>>
>>>>    	if (!device || !acpi_driver_data(device))
>>>>    		return -EINVAL;
>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>    		goto free;
>>>>
>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>> -		if (acpi_processor_handle_eject(pr))
>>>> -			return -EINVAL;
>>>> +		ret = acpi_processor_handle_eject(pr);
>>>> +		if (ret)
>>>> +			return ret;
>>>>    	}
>>>>
>>>>    	acpi_processor_power_exit(pr, device);
>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>
>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>    {
>>>> -	if (cpu_online(pr->id))
>>>> -		cpu_down(pr->id);
>>>> +	int ret;
>>>> +
>>>> +	if (cpu_online(pr->id)) {
>>>> +		ret = cpu_down(pr->id);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +	}
>>>>
>>>
>>> Strictly speaking, this is not thorough enough. What prevents someone
>>> from onlining that same cpu again, at this point?
>>> So, IMHO, you need to wrap the contents of this function inside a
>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>> from messing with CPU hotplug at the same time.
>>
>> If I understand your comment by mistake, please let me know.
>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>
> 
> You are right. Sorry, I overlooked that.
> 
>> +	get_online_cpus()
>> +	if (cpu_online(pr->id)) {
>> +		ret = cpu_down(pr->id);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	put_online_cpus()
>>
>> I think following patch can prevent it correctly. How about the patch?
>>
>> ---
>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>   kernel/cpu.c                    |    8 +++++---
>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>
>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>   {
>>   	int ret;
>>
>> +retry:
>>   	if (cpu_online(pr->id)) {
>>   		ret = cpu_down(pr->id);
>>   		if (ret)
>>   			return ret;
>>   	}
>>
>> +	get_online_cpus();
>> +	/*
>> +	 * Someone might online the cpu again at this point. So we check that
>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>> +	 * the cpu again.
>> +	 */
>> +	if (cpu_online(pr->id)) {
> 
> How about this:
> 	if (unlikely(cpu_online(pr->id)) {
> since the probability of this happening is quite small...

Thanks. I'll update it.

>> +		put_online_cpus();
>> +		goto retry;
>> +	}
>>   	arch_unregister_cpu(pr->id);
>>   	acpi_unmap_lsapic(pr->id);
>> +	put_online_cpus();
>>   	return ret;
>>   }
> 
> This retry logic doesn't look elegant, but I don't see any better method :-(
> 
>>   #else
>> Index: linux-3.5-rc4/kernel/cpu.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>   	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>   	struct task_struct *idle;
>>
>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>> -		return -EINVAL;
>> -
>>   	cpu_hotplug_begin();
>>
>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
> 
> Firstly, why is this change needed?

I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
there is the following race.

hot-remove cpu                         |  _cpu_up()
------------------------------------- ------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
     call put_online_cpus()            |
                                       | start and continue _cpu_up()
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

So _cpu_up() can continue to itself. And hot-remove cpu can also continue
itself. If I change it, I think the race disappears as below:

hot-remove cpu                         | _cpu_up()
-----------------------------------------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
          cpu's cpu_present is set     |
	  to false by set_cpu_present()|
     call put_online_cpus()            |
                                       | start _cpu_up()
				       | check cpu_present() and return -EINVAL
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

Thus I think the change is necessary.

Thanks,
Yasuaki Ishimatsu

> Secondly, if the change is indeed an improvement, then why is it
> in _this_ patch? IMHO, in that case it should be part of a separate patch.
> 
> Coming back to my first point, I don't see why this hunk is needed. We
> already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
> checking the status of the cpu (online or present). And all hotplug
> operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
> lock. Isn't that enough? Or am I missing something?
> 
>>   	idle = idle_thread_get(cpu);
>>   	if (IS_ERR(idle)) {
>>   		ret = PTR_ERR(idle);
>>
>   
> Regards,
> Srivatsa S. Bhat
> 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
@ 2012-07-10  0:13         ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  0:13 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-acpi, lenb, linux-kernel

Hi Srivatsa,

2012/07/09 20:25, Srivatsa S. Bhat wrote:
> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>> Hi Srivatsa,
>>
>> Thank you for your reviewing.
>>
>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>
>>> Ouch!
>>>
>>>> But in this case, it should return error number since some process may run on
>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>> the system cannot work well.
>>>>
>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>
>>>> ---
>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>
>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>    {
>>>>    	struct acpi_processor *pr = NULL;
>>>> -
>>>> +	int ret;
>>>>
>>>>    	if (!device || !acpi_driver_data(device))
>>>>    		return -EINVAL;
>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>    		goto free;
>>>>
>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>> -		if (acpi_processor_handle_eject(pr))
>>>> -			return -EINVAL;
>>>> +		ret = acpi_processor_handle_eject(pr);
>>>> +		if (ret)
>>>> +			return ret;
>>>>    	}
>>>>
>>>>    	acpi_processor_power_exit(pr, device);
>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>
>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>    {
>>>> -	if (cpu_online(pr->id))
>>>> -		cpu_down(pr->id);
>>>> +	int ret;
>>>> +
>>>> +	if (cpu_online(pr->id)) {
>>>> +		ret = cpu_down(pr->id);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +	}
>>>>
>>>
>>> Strictly speaking, this is not thorough enough. What prevents someone
>>> from onlining that same cpu again, at this point?
>>> So, IMHO, you need to wrap the contents of this function inside a
>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>> from messing with CPU hotplug at the same time.
>>
>> If I understand your comment by mistake, please let me know.
>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>
> 
> You are right. Sorry, I overlooked that.
> 
>> +	get_online_cpus()
>> +	if (cpu_online(pr->id)) {
>> +		ret = cpu_down(pr->id);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	put_online_cpus()
>>
>> I think following patch can prevent it correctly. How about the patch?
>>
>> ---
>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>   kernel/cpu.c                    |    8 +++++---
>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>
>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>   {
>>   	int ret;
>>
>> +retry:
>>   	if (cpu_online(pr->id)) {
>>   		ret = cpu_down(pr->id);
>>   		if (ret)
>>   			return ret;
>>   	}
>>
>> +	get_online_cpus();
>> +	/*
>> +	 * Someone might online the cpu again at this point. So we check that
>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>> +	 * the cpu again.
>> +	 */
>> +	if (cpu_online(pr->id)) {
> 
> How about this:
> 	if (unlikely(cpu_online(pr->id)) {
> since the probability of this happening is quite small...

Thanks. I'll update it.

>> +		put_online_cpus();
>> +		goto retry;
>> +	}
>>   	arch_unregister_cpu(pr->id);
>>   	acpi_unmap_lsapic(pr->id);
>> +	put_online_cpus();
>>   	return ret;
>>   }
> 
> This retry logic doesn't look elegant, but I don't see any better method :-(
> 
>>   #else
>> Index: linux-3.5-rc4/kernel/cpu.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>   	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>   	struct task_struct *idle;
>>
>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>> -		return -EINVAL;
>> -
>>   	cpu_hotplug_begin();
>>
>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
> 
> Firstly, why is this change needed?

I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
there is the following race.

hot-remove cpu                         |  _cpu_up()
------------------------------------- ------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
     call put_online_cpus()            |
                                       | start and continue _cpu_up()
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

So _cpu_up() can continue to itself. And hot-remove cpu can also continue
itself. If I change it, I think the race disappears as below:

hot-remove cpu                         | _cpu_up()
-----------------------------------------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
          cpu's cpu_present is set     |
	  to false by set_cpu_present()|
     call put_online_cpus()            |
                                       | start _cpu_up()
				       | check cpu_present() and return -EINVAL
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

Thus I think the change is necessary.

Thanks,
Yasuaki Ishimatsu

> Secondly, if the change is indeed an improvement, then why is it
> in _this_ patch? IMHO, in that case it should be part of a separate patch.
> 
> Coming back to my first point, I don't see why this hunk is needed. We
> already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
> checking the status of the cpu (online or present). And all hotplug
> operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
> lock. Isn't that enough? Or am I missing something?
> 
>>   	idle = idle_thread_get(cpu);
>>   	if (IS_ERR(idle)) {
>>   		ret = PTR_ERR(idle);
>>
>   
> Regards,
> Srivatsa S. Bhat
> 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-09 21:15       ` Toshi Kani
@ 2012-07-10  4:57           ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  4:57 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Srivatsa S. Bhat, linux-acpi, lenb, linux-kernel

Hi Toshi,

2012/07/10 6:15, Toshi Kani wrote:
> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>    {
>>>>>    	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>    	if (!device || !acpi_driver_data(device))
>>>>>    		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>    		goto free;
>>>>>
>>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>    	}
>>>>>
>>>>>    	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>    {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>   kernel/cpu.c                    |    8 +++++---
>>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>   {
>>>   	int ret;
>>>
>>> +retry:
>>>   	if (cpu_online(pr->id)) {
>>>   		ret = cpu_down(pr->id);
>>>   		if (ret)
>>>   			return ret;
>>>   	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
>>
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>   	arch_unregister_cpu(pr->id);
>>>   	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>   	return ret;
>>>   }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>
> Another possible option is to fail the request instead of retrying it.

Good idea!! I'll update it.

>
> It would be quite challenging to allow on-lining and off-lining
> operations to run concurrently.  In fact, even if we close this window,
> there is yet another window right after the new put_online_cpus() call.

I think if we close the window, another window does not open.
acpi_unmap_lsapic() sets cpu_present mask to false before new put_online_cpus()
is called. So even if _cpu_up() is called, the function returns -EINAVL by
following added code.

@@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
  	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
  	struct task_struct *idle;

-	if (cpu_online(cpu) || !cpu_present(cpu))
-		return -EINVAL;
-
  	cpu_hotplug_begin();

+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+

Thanks,
Yasuaki Ishimatsu

> This CPU may become online before calling _EJ0 in the case of
> hot-remove.
>
> This goes beyond the scope of this patch, but IMHO, we should serialize
> in the request level.  That is, a new on-lining request should not be
> allowed to proceed until the current request is complete.  This scheme
> only allows a single operation at a time per OS instance, but I do not
> think it is a big issue.
>
> Serializing in the request level is also important when supporting
> container hot-remove, which can remove multiple children objects under a
> parent container object.  For instance, a Node hot-remove may also
> remove multiple processors underneath of it.  This kind of the
> operations has to make sure that all children objects are remained
> off-line until it ejects the parent object.
>
> Thanks,
> -Toshi
>
>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
@ 2012-07-10  4:57           ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  4:57 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Srivatsa S. Bhat, linux-acpi, lenb, linux-kernel

Hi Toshi,

2012/07/10 6:15, Toshi Kani wrote:
> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>    {
>>>>>    	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>    	if (!device || !acpi_driver_data(device))
>>>>>    		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>    		goto free;
>>>>>
>>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>    	}
>>>>>
>>>>>    	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>    {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>   kernel/cpu.c                    |    8 +++++---
>>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>   {
>>>   	int ret;
>>>
>>> +retry:
>>>   	if (cpu_online(pr->id)) {
>>>   		ret = cpu_down(pr->id);
>>>   		if (ret)
>>>   			return ret;
>>>   	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
>>
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>   	arch_unregister_cpu(pr->id);
>>>   	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>   	return ret;
>>>   }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>
> Another possible option is to fail the request instead of retrying it.

Good idea!! I'll update it.

>
> It would be quite challenging to allow on-lining and off-lining
> operations to run concurrently.  In fact, even if we close this window,
> there is yet another window right after the new put_online_cpus() call.

I think if we close the window, another window does not open.
acpi_unmap_lsapic() sets cpu_present mask to false before new put_online_cpus()
is called. So even if _cpu_up() is called, the function returns -EINAVL by
following added code.

@@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
  	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
  	struct task_struct *idle;

-	if (cpu_online(cpu) || !cpu_present(cpu))
-		return -EINVAL;
-
  	cpu_hotplug_begin();

+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+

Thanks,
Yasuaki Ishimatsu

> This CPU may become online before calling _EJ0 in the case of
> hot-remove.
>
> This goes beyond the scope of this patch, but IMHO, we should serialize
> in the request level.  That is, a new on-lining request should not be
> allowed to proceed until the current request is complete.  This scheme
> only allows a single operation at a time per OS instance, but I do not
> think it is a big issue.
>
> Serializing in the request level is also important when supporting
> container hot-remove, which can remove multiple children objects under a
> parent container object.  For instance, a Node hot-remove may also
> remove multiple processors underneath of it.  This kind of the
> operations has to make sure that all children objects are remained
> off-line until it ejects the parent object.
>
> Thanks,
> -Toshi
>
>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-10  0:13         ` Yasuaki Ishimatsu
@ 2012-07-10  5:14           ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  5:14 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-acpi, lenb, linux-kernel

Hi Srivatsa,

2012/07/10 9:13, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> 2012/07/09 20:25, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>     static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>     {
>>>>>     	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>     	if (!device || !acpi_driver_data(device))
>>>>>     		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>     		goto free;
>>>>>
>>>>>     	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>     	}
>>>>>
>>>>>     	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>     {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>    kernel/cpu.c                    |    8 +++++---
>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>    {
>>>    	int ret;
>>>
>>> +retry:
>>>    	if (cpu_online(pr->id)) {
>>>    		ret = cpu_down(pr->id);
>>>    		if (ret)
>>>    			return ret;
>>>    	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
> 
> Thanks. I'll update it.
> 
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>    	arch_unregister_cpu(pr->id);
>>>    	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>    	return ret;
>>>    }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>>
>>>    #else
>>> Index: linux-3.5-rc4/kernel/cpu.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>>    	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>>    	struct task_struct *idle;
>>>
>>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>>> -		return -EINVAL;
>>> -
>>>    	cpu_hotplug_begin();
>>>
>>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>
>> Firstly, why is this change needed?
> 
> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
> there is the following race.
> 
> hot-remove cpu                         |  _cpu_up()
> ------------------------------------- ------------------------------------
> call acpi_processor_handle_eject()     |
>       call cpu_down()                   |
>       call get_online_cpus()            |
>                                         | call cpu_hotplug_begin() and stop here
>       call arch_unregister_cpu()        |
>       call acpi_unmap_lsapic()          |
>       call put_online_cpus()            |
>                                         | start and continue _cpu_up()
>       return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> So _cpu_up() can continue to itself. And hot-remove cpu can also continue
> itself. If I change it, I think the race disappears as below:
> 
> hot-remove cpu                         | _cpu_up()
> -----------------------------------------------------------------------
> call acpi_processor_handle_eject()     |
>       call cpu_down()                   |
>       call get_online_cpus()            |
>                                         | call cpu_hotplug_begin() and stop here
>       call arch_unregister_cpu()        |
>       call acpi_unmap_lsapic()          |
>            cpu's cpu_present is set     |
> 	  to false by set_cpu_present()|
>       call put_online_cpus()            |
>                                         | start _cpu_up()
> 				       | check cpu_present() and return -EINVAL
>       return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> Thus I think the change is necessary.
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>> Secondly, if the change is indeed an improvement, then why is it
>> in _this_ patch? IMHO, in that case it should be part of a separate patch.

I forget to answer the question.
As I answered in the above your first question, the fix is related to
acpi_processor_handle_eject(). So the fix should be in the patch.

Thanks,
Yasuaki Ishimatsu

>>
>> Coming back to my first point, I don't see why this hunk is needed. We
>> already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
>> checking the status of the cpu (online or present). And all hotplug
>> operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
>> lock. Isn't that enough? Or am I missing something?
>>
>>>    	idle = idle_thread_get(cpu);
>>>    	if (IS_ERR(idle)) {
>>>    		ret = PTR_ERR(idle);
>>>
>>    
>> Regards,
>> Srivatsa S. Bhat
>>
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
@ 2012-07-10  5:14           ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  5:14 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-acpi, lenb, linux-kernel

Hi Srivatsa,

2012/07/10 9:13, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> 2012/07/09 20:25, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>     static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>     {
>>>>>     	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>     	if (!device || !acpi_driver_data(device))
>>>>>     		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>     		goto free;
>>>>>
>>>>>     	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>     	}
>>>>>
>>>>>     	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>     {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>    kernel/cpu.c                    |    8 +++++---
>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>    {
>>>    	int ret;
>>>
>>> +retry:
>>>    	if (cpu_online(pr->id)) {
>>>    		ret = cpu_down(pr->id);
>>>    		if (ret)
>>>    			return ret;
>>>    	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
> 
> Thanks. I'll update it.
> 
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>    	arch_unregister_cpu(pr->id);
>>>    	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>    	return ret;
>>>    }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>>
>>>    #else
>>> Index: linux-3.5-rc4/kernel/cpu.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>>    	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>>    	struct task_struct *idle;
>>>
>>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>>> -		return -EINVAL;
>>> -
>>>    	cpu_hotplug_begin();
>>>
>>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>
>> Firstly, why is this change needed?
> 
> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
> there is the following race.
> 
> hot-remove cpu                         |  _cpu_up()
> ------------------------------------- ------------------------------------
> call acpi_processor_handle_eject()     |
>       call cpu_down()                   |
>       call get_online_cpus()            |
>                                         | call cpu_hotplug_begin() and stop here
>       call arch_unregister_cpu()        |
>       call acpi_unmap_lsapic()          |
>       call put_online_cpus()            |
>                                         | start and continue _cpu_up()
>       return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> So _cpu_up() can continue to itself. And hot-remove cpu can also continue
> itself. If I change it, I think the race disappears as below:
> 
> hot-remove cpu                         | _cpu_up()
> -----------------------------------------------------------------------
> call acpi_processor_handle_eject()     |
>       call cpu_down()                   |
>       call get_online_cpus()            |
>                                         | call cpu_hotplug_begin() and stop here
>       call arch_unregister_cpu()        |
>       call acpi_unmap_lsapic()          |
>            cpu's cpu_present is set     |
> 	  to false by set_cpu_present()|
>       call put_online_cpus()            |
>                                         | start _cpu_up()
> 				       | check cpu_present() and return -EINVAL
>       return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> Thus I think the change is necessary.
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>> Secondly, if the change is indeed an improvement, then why is it
>> in _this_ patch? IMHO, in that case it should be part of a separate patch.

I forget to answer the question.
As I answered in the above your first question, the fix is related to
acpi_processor_handle_eject(). So the fix should be in the patch.

Thanks,
Yasuaki Ishimatsu

>>
>> Coming back to my first point, I don't see why this hunk is needed. We
>> already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
>> checking the status of the cpu (online or present). And all hotplug
>> operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
>> lock. Isn't that enough? Or am I missing something?
>>
>>>    	idle = idle_thread_get(cpu);
>>>    	if (IS_ERR(idle)) {
>>>    		ret = PTR_ERR(idle);
>>>
>>    
>> Regards,
>> Srivatsa S. Bhat
>>
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-10  0:13         ` Yasuaki Ishimatsu
  (?)
  (?)
@ 2012-07-10  6:51         ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 21+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-10  6:51 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: linux-acpi, lenb, linux-kernel

On 07/10/2012 05:43 AM, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> 2012/07/09 20:25, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>    {
>>>>>    	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>    	if (!device || !acpi_driver_data(device))
>>>>>    		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>    		goto free;
>>>>>
>>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>    	}
>>>>>
>>>>>    	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>    {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>   kernel/cpu.c                    |    8 +++++---
>>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>   {
>>>   	int ret;
>>>
>>> +retry:
>>>   	if (cpu_online(pr->id)) {
>>>   		ret = cpu_down(pr->id);
>>>   		if (ret)
>>>   			return ret;
>>>   	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
> 
> Thanks. I'll update it.
> 
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>   	arch_unregister_cpu(pr->id);
>>>   	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>   	return ret;
>>>   }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>>
>>>   #else
>>> Index: linux-3.5-rc4/kernel/cpu.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>>   	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>>   	struct task_struct *idle;
>>>
>>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>>> -		return -EINVAL;
>>> -
>>>   	cpu_hotplug_begin();
>>>
>>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>
>> Firstly, why is this change needed?
> 
> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
> there is the following race.
> 
> hot-remove cpu                         |  _cpu_up()
> ------------------------------------- ------------------------------------
> call acpi_processor_handle_eject()     |
>      call cpu_down()                   |
>      call get_online_cpus()            |
>                                        | call cpu_hotplug_begin() and stop here
>      call arch_unregister_cpu()        |
>      call acpi_unmap_lsapic()          |
>      call put_online_cpus()            |
>                                        | start and continue _cpu_up()
>      return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> So _cpu_up() can continue to itself. And hot-remove cpu can also continue
> itself. If I change it, I think the race disappears as below:
> 
> hot-remove cpu                         | _cpu_up()
> -----------------------------------------------------------------------
> call acpi_processor_handle_eject()     |
>      call cpu_down()                   |
>      call get_online_cpus()            |
>                                        | call cpu_hotplug_begin() and stop here
>      call arch_unregister_cpu()        |
>      call acpi_unmap_lsapic()          |
>           cpu's cpu_present is set     |
> 	  to false by set_cpu_present()|
>      call put_online_cpus()            |
>                                        | start _cpu_up()
> 				       | check cpu_present() and return -EINVAL
>      return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> Thus I think the change is necessary.
> 

Thanks for the detailed explanation. I had missed this race condition.
Now I see why all the changes in your patch are needed. 

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-10  5:14           ` Yasuaki Ishimatsu
  (?)
@ 2012-07-10  6:52           ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 21+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-10  6:52 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: linux-acpi, lenb, linux-kernel

On 07/10/2012 10:44 AM, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> 2012/07/10 9:13, Yasuaki Ishimatsu wrote:
>> Hi Srivatsa,
>>
>> 2012/07/09 20:25, Srivatsa S. Bhat wrote:
>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>>> Hi Srivatsa,
>>>>
>>>> Thank you for your reviewing.
>>>>
>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>>
>>>>> Ouch!
>>>>>
>>>>>> But in this case, it should return error number since some process may run on
>>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>>> the system cannot work well.
>>>>>>
>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>
>>>>>> ---
>>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>>     static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>>     {
>>>>>>     	struct acpi_processor *pr = NULL;
>>>>>> -
>>>>>> +	int ret;
>>>>>>
>>>>>>     	if (!device || !acpi_driver_data(device))
>>>>>>     		return -EINVAL;
>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>>     		goto free;
>>>>>>
>>>>>>     	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>>> -			return -EINVAL;
>>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>>> +		if (ret)
>>>>>> +			return ret;
>>>>>>     	}
>>>>>>
>>>>>>     	acpi_processor_power_exit(pr, device);
>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>>
>>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>>     {
>>>>>> -	if (cpu_online(pr->id))
>>>>>> -		cpu_down(pr->id);
>>>>>> +	int ret;
>>>>>> +
>>>>>> +	if (cpu_online(pr->id)) {
>>>>>> +		ret = cpu_down(pr->id);
>>>>>> +		if (ret)
>>>>>> +			return ret;
>>>>>> +	}
>>>>>>
>>>>>
>>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>>> from onlining that same cpu again, at this point?
>>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>>> from messing with CPU hotplug at the same time.
>>>>
>>>> If I understand your comment by mistake, please let me know.
>>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>>
>>>
>>> You are right. Sorry, I overlooked that.
>>>
>>>> +	get_online_cpus()
>>>> +	if (cpu_online(pr->id)) {
>>>> +		ret = cpu_down(pr->id);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +	}
>>>> +	put_online_cpus()
>>>>
>>>> I think following patch can prevent it correctly. How about the patch?
>>>>
>>>> ---
>>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>>    kernel/cpu.c                    |    8 +++++---
>>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>>
>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>>    {
>>>>    	int ret;
>>>>
>>>> +retry:
>>>>    	if (cpu_online(pr->id)) {
>>>>    		ret = cpu_down(pr->id);
>>>>    		if (ret)
>>>>    			return ret;
>>>>    	}
>>>>
>>>> +	get_online_cpus();
>>>> +	/*
>>>> +	 * Someone might online the cpu again at this point. So we check that
>>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>>> +	 * the cpu again.
>>>> +	 */
>>>> +	if (cpu_online(pr->id)) {
>>>
>>> How about this:
>>> 	if (unlikely(cpu_online(pr->id)) {
>>> since the probability of this happening is quite small...
>>
>> Thanks. I'll update it.
>>
>>>> +		put_online_cpus();
>>>> +		goto retry;
>>>> +	}
>>>>    	arch_unregister_cpu(pr->id);
>>>>    	acpi_unmap_lsapic(pr->id);
>>>> +	put_online_cpus();
>>>>    	return ret;
>>>>    }
>>>
>>> This retry logic doesn't look elegant, but I don't see any better method :-(
>>>
>>>>    #else
>>>> Index: linux-3.5-rc4/kernel/cpu.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>>>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>>>    	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>>>    	struct task_struct *idle;
>>>>
>>>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>>>> -		return -EINVAL;
>>>> -
>>>>    	cpu_hotplug_begin();
>>>>
>>>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>>>> +		ret = -EINVAL;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>
>>> Firstly, why is this change needed?
>>
>> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
>> there is the following race.
>>
>> hot-remove cpu                         |  _cpu_up()
>> ------------------------------------- ------------------------------------
>> call acpi_processor_handle_eject()     |
>>       call cpu_down()                   |
>>       call get_online_cpus()            |
>>                                         | call cpu_hotplug_begin() and stop here
>>       call arch_unregister_cpu()        |
>>       call acpi_unmap_lsapic()          |
>>       call put_online_cpus()            |
>>                                         | start and continue _cpu_up()
>>       return acpi_processor_remove()    |
>> continue hot-remove the cpu            |
>>
>> So _cpu_up() can continue to itself. And hot-remove cpu can also continue
>> itself. If I change it, I think the race disappears as below:
>>
>> hot-remove cpu                         | _cpu_up()
>> -----------------------------------------------------------------------
>> call acpi_processor_handle_eject()     |
>>       call cpu_down()                   |
>>       call get_online_cpus()            |
>>                                         | call cpu_hotplug_begin() and stop here
>>       call arch_unregister_cpu()        |
>>       call acpi_unmap_lsapic()          |
>>            cpu's cpu_present is set     |
>> 	  to false by set_cpu_present()|
>>       call put_online_cpus()            |
>>                                         | start _cpu_up()
>> 				       | check cpu_present() and return -EINVAL
>>       return acpi_processor_remove()    |
>> continue hot-remove the cpu            |
>>
>> Thus I think the change is necessary.
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>>> Secondly, if the change is indeed an improvement, then why is it
>>> in _this_ patch? IMHO, in that case it should be part of a separate patch.
> 
> I forget to answer the question.
> As I answered in the above your first question, the fix is related to
> acpi_processor_handle_eject(). So the fix should be in the patch.
>

Yep, got it now. Thanks!

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-10  4:57           ` Yasuaki Ishimatsu
  (?)
@ 2012-07-10  7:57           ` Srivatsa S. Bhat
  2012-07-10  8:23               ` Yasuaki Ishimatsu
  2012-07-10 16:32             ` Toshi Kani
  -1 siblings, 2 replies; 21+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-10  7:57 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: Toshi Kani, linux-acpi, lenb, linux-kernel

On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
> Hi Toshi,
> 
> 2012/07/10 6:15, Toshi Kani wrote:
>> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>>> Hi Srivatsa,
>>>>
>>>> Thank you for your reviewing.
>>>>
>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
>>>>>> remove the cpu.
>>>>>
>>>>> Ouch!
>>>>>
>>>>>> But in this case, it should return error number since some process
>>>>>> may run on
>>>>>> the cpu. If the cpu has a running process and the cpu is turned
>>>>>> the power off,
>>>>>> the system cannot work well.
>>>>>>
>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>
>>>>>> ---
>>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c   
>>>>>> 2012-06-25 04:53:04.000000000 +0900
>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-05
>>>>>> 21:02:58.711285382 +0900
>>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>>    static int acpi_processor_remove(struct acpi_device *device,
>>>>>> int type)
>>>>>>    {
>>>>>>        struct acpi_processor *pr = NULL;
>>>>>> -
>>>>>> +    int ret;
>>>>>>
>>>>>>        if (!device || !acpi_driver_data(device))
>>>>>>            return -EINVAL;
>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>>            goto free;
>>>>>>
>>>>>>        if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>>> -        if (acpi_processor_handle_eject(pr))
>>>>>> -            return -EINVAL;
>>>>>> +        ret = acpi_processor_handle_eject(pr);
>>>>>> +        if (ret)
>>>>>> +            return ret;
>>>>>>        }
>>>>>>
>>>>>>        acpi_processor_power_exit(pr, device);
>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>>
>>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>>    {
>>>>>> -    if (cpu_online(pr->id))
>>>>>> -        cpu_down(pr->id);
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    if (cpu_online(pr->id)) {
>>>>>> +        ret = cpu_down(pr->id);
>>>>>> +        if (ret)
>>>>>> +            return ret;
>>>>>> +    }
>>>>>>
>>>>>
>>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>>> from onlining that same cpu again, at this point?
>>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>>> from messing with CPU hotplug at the same time.
>>>>
>>>> If I understand your comment by mistake, please let me know.
>>>> If the contents is wrapped a inside
>>>> get_online_cpus()/put_online_cpus() block
>>>> as below, cpu_down() will stop since cpu_down() calls
>>>> cpu_hotplug_begin() and
>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>>
>>>
>>> You are right. Sorry, I overlooked that.
>>>
>>>> +    get_online_cpus()
>>>> +    if (cpu_online(pr->id)) {
>>>> +        ret = cpu_down(pr->id);
>>>> +        if (ret)
>>>> +            return ret;
>>>> +    }
>>>> +    put_online_cpus()
>>>>
>>>> I think following patch can prevent it correctly. How about the patch?
>>>>
>>>> ---
>>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>>   kernel/cpu.c                    |    8 +++++---
>>>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>>>
>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c    2012-07-09
>>>> 09:59:01.280211202 +0900
>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-09
>>>> 11:05:34.559859236 +0900
>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>>   {
>>>>       int ret;
>>>>
>>>> +retry:
>>>>       if (cpu_online(pr->id)) {
>>>>           ret = cpu_down(pr->id);
>>>>           if (ret)
>>>>               return ret;
>>>>       }
>>>>
>>>> +    get_online_cpus();
>>>> +    /*
>>>> +     * Someone might online the cpu again at this point. So we
>>>> check that
>>>> +     * cpu has been onlined or not. If cpu is online, we try to
>>>> offline
>>>> +     * the cpu again.
>>>> +     */
>>>> +    if (cpu_online(pr->id)) {
>>>
>>> How about this:
>>>     if (unlikely(cpu_online(pr->id)) {
>>> since the probability of this happening is quite small...
>>>
>>>> +        put_online_cpus();
>>>> +        goto retry;
>>>> +    }
>>>>       arch_unregister_cpu(pr->id);
>>>>       acpi_unmap_lsapic(pr->id);
>>>> +    put_online_cpus();
>>>>       return ret;
>>>>   }
>>>
>>> This retry logic doesn't look elegant, but I don't see any better
>>> method :-(
>>
>> Another possible option is to fail the request instead of retrying it.
> 
> Good idea!! I'll update it.
> 
>>
>> It would be quite challenging to allow on-lining and off-lining
>> operations to run concurrently.  In fact, even if we close this window,
>> there is yet another window right after the new put_online_cpus() call.
> 
> I think if we close the window, another window does not open.
> acpi_unmap_lsapic() sets cpu_present mask to false before new
> put_online_cpus()
> is called. So even if _cpu_up() is called, the function returns -EINAVL by
> following added code.
> 
> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>      unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>      struct task_struct *idle;
> 
> -    if (cpu_online(cpu) || !cpu_present(cpu))
> -        return -EINVAL;
> -
>      cpu_hotplug_begin();
> 
> +    if (cpu_online(cpu) || !cpu_present(cpu)) {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +
> 

Right. Yasuaki's patch will ensure that there are no more race conditions
because it does the cpu_present() check after taking the cpu_hotplug.lock.
So I think it is safe and still doable from the kernel's perspective.

But the question is, "should we do it?" I think Toshi's suggestion of failing
the hot-remove request (if we find that the cpu has been onlined again by some
other task) sounds like a good idea for another reason: cpu hotplug is not
initiated by the OS by itself; its requested by the user; so if something is
onlining the cpu back again, the user better take a second look and decide
what exactly he wants to do with that cpu - whether keep it online or
hot-remove it out.

Trying to online as well as hot-remove the same cpu simultaneously sounds like
a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
(ie., failing that request) would give a warning to the user and a chance to
reflect upon what exactly he wants to do with the cpu.

So, IMHO, we should protect against the race condition (between cpu_up and
hot-remove) but choose to fail the hot-remove request, and add a comment saying
why we chose to fail the request, even though we could have gone ahead and
completed it.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-10  7:57           ` Srivatsa S. Bhat
@ 2012-07-10  8:23               ` Yasuaki Ishimatsu
  2012-07-10 16:32             ` Toshi Kani
  1 sibling, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  8:23 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: Toshi Kani, linux-acpi, lenb, linux-kernel

Hi Srivatsa,

2012/07/10 16:57, Srivatsa S. Bhat wrote:
> On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
>> Hi Toshi,
>>
>> 2012/07/10 6:15, Toshi Kani wrote:
>>> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>>>> Hi Srivatsa,
>>>>>
>>>>> Thank you for your reviewing.
>>>>>
>>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
>>>>>>> remove the cpu.
>>>>>>
>>>>>> Ouch!
>>>>>>
>>>>>>> But in this case, it should return error number since some process
>>>>>>> may run on
>>>>>>> the cpu. If the cpu has a running process and the cpu is turned
>>>>>>> the power off,
>>>>>>> the system cannot work well.
>>>>>>>
>>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>>
>>>>>>> ---
>>>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c
>>>>>>> 2012-06-25 04:53:04.000000000 +0900
>>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-05
>>>>>>> 21:02:58.711285382 +0900
>>>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>>>     static int acpi_processor_remove(struct acpi_device *device,
>>>>>>> int type)
>>>>>>>     {
>>>>>>>         struct acpi_processor *pr = NULL;
>>>>>>> -
>>>>>>> +    int ret;
>>>>>>>
>>>>>>>         if (!device || !acpi_driver_data(device))
>>>>>>>             return -EINVAL;
>>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>>>             goto free;
>>>>>>>
>>>>>>>         if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>>>> -        if (acpi_processor_handle_eject(pr))
>>>>>>> -            return -EINVAL;
>>>>>>> +        ret = acpi_processor_handle_eject(pr);
>>>>>>> +        if (ret)
>>>>>>> +            return ret;
>>>>>>>         }
>>>>>>>
>>>>>>>         acpi_processor_power_exit(pr, device);
>>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>>>
>>>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>>>     {
>>>>>>> -    if (cpu_online(pr->id))
>>>>>>> -        cpu_down(pr->id);
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    if (cpu_online(pr->id)) {
>>>>>>> +        ret = cpu_down(pr->id);
>>>>>>> +        if (ret)
>>>>>>> +            return ret;
>>>>>>> +    }
>>>>>>>
>>>>>>
>>>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>>>> from onlining that same cpu again, at this point?
>>>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>>>> from messing with CPU hotplug at the same time.
>>>>>
>>>>> If I understand your comment by mistake, please let me know.
>>>>> If the contents is wrapped a inside
>>>>> get_online_cpus()/put_online_cpus() block
>>>>> as below, cpu_down() will stop since cpu_down() calls
>>>>> cpu_hotplug_begin() and
>>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>>>
>>>>
>>>> You are right. Sorry, I overlooked that.
>>>>
>>>>> +    get_online_cpus()
>>>>> +    if (cpu_online(pr->id)) {
>>>>> +        ret = cpu_down(pr->id);
>>>>> +        if (ret)
>>>>> +            return ret;
>>>>> +    }
>>>>> +    put_online_cpus()
>>>>>
>>>>> I think following patch can prevent it correctly. How about the patch?
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>>>    kernel/cpu.c                    |    8 +++++---
>>>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c    2012-07-09
>>>>> 09:59:01.280211202 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-09
>>>>> 11:05:34.559859236 +0900
>>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>>>    {
>>>>>        int ret;
>>>>>
>>>>> +retry:
>>>>>        if (cpu_online(pr->id)) {
>>>>>            ret = cpu_down(pr->id);
>>>>>            if (ret)
>>>>>                return ret;
>>>>>        }
>>>>>
>>>>> +    get_online_cpus();
>>>>> +    /*
>>>>> +     * Someone might online the cpu again at this point. So we
>>>>> check that
>>>>> +     * cpu has been onlined or not. If cpu is online, we try to
>>>>> offline
>>>>> +     * the cpu again.
>>>>> +     */
>>>>> +    if (cpu_online(pr->id)) {
>>>>
>>>> How about this:
>>>>      if (unlikely(cpu_online(pr->id)) {
>>>> since the probability of this happening is quite small...
>>>>
>>>>> +        put_online_cpus();
>>>>> +        goto retry;
>>>>> +    }
>>>>>        arch_unregister_cpu(pr->id);
>>>>>        acpi_unmap_lsapic(pr->id);
>>>>> +    put_online_cpus();
>>>>>        return ret;
>>>>>    }
>>>>
>>>> This retry logic doesn't look elegant, but I don't see any better
>>>> method :-(
>>>
>>> Another possible option is to fail the request instead of retrying it.
>>
>> Good idea!! I'll update it.
>>
>>>
>>> It would be quite challenging to allow on-lining and off-lining
>>> operations to run concurrently.  In fact, even if we close this window,
>>> there is yet another window right after the new put_online_cpus() call.
>>
>> I think if we close the window, another window does not open.
>> acpi_unmap_lsapic() sets cpu_present mask to false before new
>> put_online_cpus()
>> is called. So even if _cpu_up() is called, the function returns -EINAVL by
>> following added code.
>>
>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>       unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>       struct task_struct *idle;
>>
>> -    if (cpu_online(cpu) || !cpu_present(cpu))
>> -        return -EINVAL;
>> -
>>       cpu_hotplug_begin();
>>
>> +    if (cpu_online(cpu) || !cpu_present(cpu)) {
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +
>>
>
> Right. Yasuaki's patch will ensure that there are no more race conditions
> because it does the cpu_present() check after taking the cpu_hotplug.lock.
> So I think it is safe and still doable from the kernel's perspective.
>
> But the question is, "should we do it?" I think Toshi's suggestion of failing
> the hot-remove request (if we find that the cpu has been onlined again by some
> other task) sounds like a good idea for another reason: cpu hotplug is not
> initiated by the OS by itself; its requested by the user; so if something is
> onlining the cpu back again, the user better take a second look and decide
> what exactly he wants to do with that cpu - whether keep it online or
> hot-remove it out.

I think so too. Failing the hot-remove request is good idea.

>
> Trying to online as well as hot-remove the same cpu simultaneously sounds like
> a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
> (ie., failing that request) would give a warning to the user and a chance to
> reflect upon what exactly he wants to do with the cpu.
>
> So, IMHO, we should protect against the race condition (between cpu_up and
> hot-remove) but choose to fail the hot-remove request, and add a comment saying
> why we chose to fail the request, even though we could have gone ahead and
> completed it.

I have already sent 2nd version of the patch. But the warning message is
not included in the patch. So I will add the warning message into 3rd
version.

Thanks,
Yasuaki Ishimatsu

> Regards,
> Srivatsa S. Bhat
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
@ 2012-07-10  8:23               ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Yasuaki Ishimatsu @ 2012-07-10  8:23 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: Toshi Kani, linux-acpi, lenb, linux-kernel

Hi Srivatsa,

2012/07/10 16:57, Srivatsa S. Bhat wrote:
> On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
>> Hi Toshi,
>>
>> 2012/07/10 6:15, Toshi Kani wrote:
>>> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>>>> Hi Srivatsa,
>>>>>
>>>>> Thank you for your reviewing.
>>>>>
>>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
>>>>>>> remove the cpu.
>>>>>>
>>>>>> Ouch!
>>>>>>
>>>>>>> But in this case, it should return error number since some process
>>>>>>> may run on
>>>>>>> the cpu. If the cpu has a running process and the cpu is turned
>>>>>>> the power off,
>>>>>>> the system cannot work well.
>>>>>>>
>>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>>
>>>>>>> ---
>>>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c
>>>>>>> 2012-06-25 04:53:04.000000000 +0900
>>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-05
>>>>>>> 21:02:58.711285382 +0900
>>>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>>>     static int acpi_processor_remove(struct acpi_device *device,
>>>>>>> int type)
>>>>>>>     {
>>>>>>>         struct acpi_processor *pr = NULL;
>>>>>>> -
>>>>>>> +    int ret;
>>>>>>>
>>>>>>>         if (!device || !acpi_driver_data(device))
>>>>>>>             return -EINVAL;
>>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>>>             goto free;
>>>>>>>
>>>>>>>         if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>>>> -        if (acpi_processor_handle_eject(pr))
>>>>>>> -            return -EINVAL;
>>>>>>> +        ret = acpi_processor_handle_eject(pr);
>>>>>>> +        if (ret)
>>>>>>> +            return ret;
>>>>>>>         }
>>>>>>>
>>>>>>>         acpi_processor_power_exit(pr, device);
>>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>>>
>>>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>>>     {
>>>>>>> -    if (cpu_online(pr->id))
>>>>>>> -        cpu_down(pr->id);
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    if (cpu_online(pr->id)) {
>>>>>>> +        ret = cpu_down(pr->id);
>>>>>>> +        if (ret)
>>>>>>> +            return ret;
>>>>>>> +    }
>>>>>>>
>>>>>>
>>>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>>>> from onlining that same cpu again, at this point?
>>>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>>>> from messing with CPU hotplug at the same time.
>>>>>
>>>>> If I understand your comment by mistake, please let me know.
>>>>> If the contents is wrapped a inside
>>>>> get_online_cpus()/put_online_cpus() block
>>>>> as below, cpu_down() will stop since cpu_down() calls
>>>>> cpu_hotplug_begin() and
>>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>>>
>>>>
>>>> You are right. Sorry, I overlooked that.
>>>>
>>>>> +    get_online_cpus()
>>>>> +    if (cpu_online(pr->id)) {
>>>>> +        ret = cpu_down(pr->id);
>>>>> +        if (ret)
>>>>> +            return ret;
>>>>> +    }
>>>>> +    put_online_cpus()
>>>>>
>>>>> I think following patch can prevent it correctly. How about the patch?
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>>>    kernel/cpu.c                    |    8 +++++---
>>>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c    2012-07-09
>>>>> 09:59:01.280211202 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-09
>>>>> 11:05:34.559859236 +0900
>>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>>>    {
>>>>>        int ret;
>>>>>
>>>>> +retry:
>>>>>        if (cpu_online(pr->id)) {
>>>>>            ret = cpu_down(pr->id);
>>>>>            if (ret)
>>>>>                return ret;
>>>>>        }
>>>>>
>>>>> +    get_online_cpus();
>>>>> +    /*
>>>>> +     * Someone might online the cpu again at this point. So we
>>>>> check that
>>>>> +     * cpu has been onlined or not. If cpu is online, we try to
>>>>> offline
>>>>> +     * the cpu again.
>>>>> +     */
>>>>> +    if (cpu_online(pr->id)) {
>>>>
>>>> How about this:
>>>>      if (unlikely(cpu_online(pr->id)) {
>>>> since the probability of this happening is quite small...
>>>>
>>>>> +        put_online_cpus();
>>>>> +        goto retry;
>>>>> +    }
>>>>>        arch_unregister_cpu(pr->id);
>>>>>        acpi_unmap_lsapic(pr->id);
>>>>> +    put_online_cpus();
>>>>>        return ret;
>>>>>    }
>>>>
>>>> This retry logic doesn't look elegant, but I don't see any better
>>>> method :-(
>>>
>>> Another possible option is to fail the request instead of retrying it.
>>
>> Good idea!! I'll update it.
>>
>>>
>>> It would be quite challenging to allow on-lining and off-lining
>>> operations to run concurrently.  In fact, even if we close this window,
>>> there is yet another window right after the new put_online_cpus() call.
>>
>> I think if we close the window, another window does not open.
>> acpi_unmap_lsapic() sets cpu_present mask to false before new
>> put_online_cpus()
>> is called. So even if _cpu_up() is called, the function returns -EINAVL by
>> following added code.
>>
>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>       unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>       struct task_struct *idle;
>>
>> -    if (cpu_online(cpu) || !cpu_present(cpu))
>> -        return -EINVAL;
>> -
>>       cpu_hotplug_begin();
>>
>> +    if (cpu_online(cpu) || !cpu_present(cpu)) {
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +
>>
>
> Right. Yasuaki's patch will ensure that there are no more race conditions
> because it does the cpu_present() check after taking the cpu_hotplug.lock.
> So I think it is safe and still doable from the kernel's perspective.
>
> But the question is, "should we do it?" I think Toshi's suggestion of failing
> the hot-remove request (if we find that the cpu has been onlined again by some
> other task) sounds like a good idea for another reason: cpu hotplug is not
> initiated by the OS by itself; its requested by the user; so if something is
> onlining the cpu back again, the user better take a second look and decide
> what exactly he wants to do with that cpu - whether keep it online or
> hot-remove it out.

I think so too. Failing the hot-remove request is good idea.

>
> Trying to online as well as hot-remove the same cpu simultaneously sounds like
> a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
> (ie., failing that request) would give a warning to the user and a chance to
> reflect upon what exactly he wants to do with the cpu.
>
> So, IMHO, we should protect against the race condition (between cpu_up and
> hot-remove) but choose to fail the hot-remove request, and add a comment saying
> why we chose to fail the request, even though we could have gone ahead and
> completed it.

I have already sent 2nd version of the patch. But the warning message is
not included in the patch. So I will add the warning message into 3rd
version.

Thanks,
Yasuaki Ishimatsu

> Regards,
> Srivatsa S. Bhat
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails
  2012-07-10  7:57           ` Srivatsa S. Bhat
  2012-07-10  8:23               ` Yasuaki Ishimatsu
@ 2012-07-10 16:32             ` Toshi Kani
  1 sibling, 0 replies; 21+ messages in thread
From: Toshi Kani @ 2012-07-10 16:32 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: Yasuaki Ishimatsu, linux-acpi, lenb, linux-kernel

On Tue, 2012-07-10 at 13:27 +0530, Srivatsa S. Bhat wrote:
> On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
> > Hi Toshi,
> > 
> > 2012/07/10 6:15, Toshi Kani wrote:
> >> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
> >>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
> >>>> Hi Srivatsa,
> >>>>
> >>>> Thank you for your reviewing.
> >>>>
> >>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
> >>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
> >>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
> >>>>>> remove the cpu.
> >>>>>
> >>>>> Ouch!
> >>>>>
> >>>>>> But in this case, it should return error number since some process
> >>>>>> may run on
> >>>>>> the cpu. If the cpu has a running process and the cpu is turned
> >>>>>> the power off,
> >>>>>> the system cannot work well.
> >>>>>>
> >>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> >>>>>>
> >>>>>> ---
> >>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
> >>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>>>>> ===================================================================
> >>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c   
> >>>>>> 2012-06-25 04:53:04.000000000 +0900
> >>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-05
> >>>>>> 21:02:58.711285382 +0900
> >>>>>> @@ -610,7 +610,7 @@ err_free_pr:
> >>>>>>    static int acpi_processor_remove(struct acpi_device *device,
> >>>>>> int type)
> >>>>>>    {
> >>>>>>        struct acpi_processor *pr = NULL;
> >>>>>> -
> >>>>>> +    int ret;
> >>>>>>
> >>>>>>        if (!device || !acpi_driver_data(device))
> >>>>>>            return -EINVAL;
> >>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
> >>>>>>            goto free;
> >>>>>>
> >>>>>>        if (type == ACPI_BUS_REMOVAL_EJECT) {
> >>>>>> -        if (acpi_processor_handle_eject(pr))
> >>>>>> -            return -EINVAL;
> >>>>>> +        ret = acpi_processor_handle_eject(pr);
> >>>>>> +        if (ret)
> >>>>>> +            return ret;
> >>>>>>        }
> >>>>>>
> >>>>>>        acpi_processor_power_exit(pr, device);
> >>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
> >>>>>>
> >>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
> >>>>>>    {
> >>>>>> -    if (cpu_online(pr->id))
> >>>>>> -        cpu_down(pr->id);
> >>>>>> +    int ret;
> >>>>>> +
> >>>>>> +    if (cpu_online(pr->id)) {
> >>>>>> +        ret = cpu_down(pr->id);
> >>>>>> +        if (ret)
> >>>>>> +            return ret;
> >>>>>> +    }
> >>>>>>
> >>>>>
> >>>>> Strictly speaking, this is not thorough enough. What prevents someone
> >>>>> from onlining that same cpu again, at this point?
> >>>>> So, IMHO, you need to wrap the contents of this function inside a
> >>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> >>>>> from messing with CPU hotplug at the same time.
> >>>>
> >>>> If I understand your comment by mistake, please let me know.
> >>>> If the contents is wrapped a inside
> >>>> get_online_cpus()/put_online_cpus() block
> >>>> as below, cpu_down() will stop since cpu_down() calls
> >>>> cpu_hotplug_begin() and
> >>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
> >>>>
> >>>
> >>> You are right. Sorry, I overlooked that.
> >>>
> >>>> +    get_online_cpus()
> >>>> +    if (cpu_online(pr->id)) {
> >>>> +        ret = cpu_down(pr->id);
> >>>> +        if (ret)
> >>>> +            return ret;
> >>>> +    }
> >>>> +    put_online_cpus()
> >>>>
> >>>> I think following patch can prevent it correctly. How about the patch?
> >>>>
> >>>> ---
> >>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
> >>>>   kernel/cpu.c                    |    8 +++++---
> >>>>   2 files changed, 17 insertions(+), 3 deletions(-)
> >>>>
> >>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>>> ===================================================================
> >>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c    2012-07-09
> >>>> 09:59:01.280211202 +0900
> >>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-09
> >>>> 11:05:34.559859236 +0900
> >>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
> >>>>   {
> >>>>       int ret;
> >>>>
> >>>> +retry:
> >>>>       if (cpu_online(pr->id)) {
> >>>>           ret = cpu_down(pr->id);
> >>>>           if (ret)
> >>>>               return ret;
> >>>>       }
> >>>>
> >>>> +    get_online_cpus();
> >>>> +    /*
> >>>> +     * Someone might online the cpu again at this point. So we
> >>>> check that
> >>>> +     * cpu has been onlined or not. If cpu is online, we try to
> >>>> offline
> >>>> +     * the cpu again.
> >>>> +     */
> >>>> +    if (cpu_online(pr->id)) {
> >>>
> >>> How about this:
> >>>     if (unlikely(cpu_online(pr->id)) {
> >>> since the probability of this happening is quite small...
> >>>
> >>>> +        put_online_cpus();
> >>>> +        goto retry;
> >>>> +    }
> >>>>       arch_unregister_cpu(pr->id);
> >>>>       acpi_unmap_lsapic(pr->id);
> >>>> +    put_online_cpus();
> >>>>       return ret;
> >>>>   }
> >>>
> >>> This retry logic doesn't look elegant, but I don't see any better
> >>> method :-(
> >>
> >> Another possible option is to fail the request instead of retrying it.
> > 
> > Good idea!! I'll update it.

Sounds good.  Thanks Yasuaki for updating it.


> >> It would be quite challenging to allow on-lining and off-lining
> >> operations to run concurrently.  In fact, even if we close this window,
> >> there is yet another window right after the new put_online_cpus() call.
> > 
> > I think if we close the window, another window does not open.
> > acpi_unmap_lsapic() sets cpu_present mask to false before new
> > put_online_cpus()
> > is called. So even if _cpu_up() is called, the function returns -EINAVL by
> > following added code.
> > 
> > @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
> >      unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
> >      struct task_struct *idle;
> > 
> > -    if (cpu_online(cpu) || !cpu_present(cpu))
> > -        return -EINVAL;
> > -
> >      cpu_hotplug_begin();
> > 
> > +    if (cpu_online(cpu) || !cpu_present(cpu)) {
> > +        ret = -EINVAL;
> > +        goto out;
> > +    }
> > +
> > 
> 
> Right. Yasuaki's patch will ensure that there are no more race conditions
> because it does the cpu_present() check after taking the cpu_hotplug.lock.
> So I think it is safe and still doable from the kernel's perspective.
> 
> But the question is, "should we do it?" I think Toshi's suggestion of failing
> the hot-remove request (if we find that the cpu has been onlined again by some
> other task) sounds like a good idea for another reason: cpu hotplug is not
> initiated by the OS by itself; its requested by the user; so if something is
> onlining the cpu back again, the user better take a second look and decide
> what exactly he wants to do with that cpu - whether keep it online or
> hot-remove it out.
> 
> Trying to online as well as hot-remove the same cpu simultaneously sounds like
> a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
> (ie., failing that request) would give a warning to the user and a chance to
> reflect upon what exactly he wants to do with the cpu.
> 
> So, IMHO, we should protect against the race condition (between cpu_up and
> hot-remove) but choose to fail the hot-remove request, and add a comment saying
> why we chose to fail the request, even though we could have gone ahead and
> completed it.

Agreed.  Thanks for the nice summary, Srivatsa!

Thanks,
-Toshi



> Regards,
> Srivatsa S. Bhat
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-07-10 16:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-06  3:16 [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails Yasuaki Ishimatsu
2012-07-06  3:16 ` Yasuaki Ishimatsu
2012-07-06  3:19 ` [PATCH 2/2] acpi_bus_trim() stops removing devices when failing to remove the device Yasuaki Ishimatsu
2012-07-06  3:19   ` Yasuaki Ishimatsu
2012-07-06  9:51 ` [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down() fails Srivatsa S. Bhat
2012-07-09  2:31   ` Yasuaki Ishimatsu
2012-07-09  2:31     ` Yasuaki Ishimatsu
2012-07-09 11:25     ` Srivatsa S. Bhat
2012-07-09 21:15       ` Toshi Kani
2012-07-10  4:57         ` Yasuaki Ishimatsu
2012-07-10  4:57           ` Yasuaki Ishimatsu
2012-07-10  7:57           ` Srivatsa S. Bhat
2012-07-10  8:23             ` Yasuaki Ishimatsu
2012-07-10  8:23               ` Yasuaki Ishimatsu
2012-07-10 16:32             ` Toshi Kani
2012-07-10  0:13       ` Yasuaki Ishimatsu
2012-07-10  0:13         ` Yasuaki Ishimatsu
2012-07-10  5:14         ` Yasuaki Ishimatsu
2012-07-10  5:14           ` Yasuaki Ishimatsu
2012-07-10  6:52           ` Srivatsa S. Bhat
2012-07-10  6:51         ` Srivatsa S. Bhat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.