All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-11 18:52 ` Raul E Rangel
  0 siblings, 0 replies; 12+ messages in thread
From: Raul E Rangel @ 2019-04-11 18:52 UTC (permalink / raw)
  To: linux-usb
  Cc: groeck, oneukum, djkurtz, zwisler, Raul E Rangel,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Alan Stern,
	Dmitry Torokhov, Suwan Kim, linux-kernel, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Greg Kroah-Hartman, Mathias Nyman

This change will send a CHANGE event to udev with the DEAD environment
variable set when the HC dies. I chose this instead of any of the other
udev events because it's representing a state change in the host
controller. The only other event that might have fit was OFFLINE, but
that seems to be used for hot-removal and it implies the device could
come ONLINE again.

By notifying user space the appropriate policies can be applied.
i.e.,
 * Collect error logs.
 * Notify the user that USB is no longer functional.
 * Perform a graceful reboot.

Signed-off-by: Raul E Rangel <rrangel@chromium.org>
---
I wasn't able to find any good examples of other drivers sending a dead
notification.

Use an EVENT= format
https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497

Uses SDEV_MEDIA_CHANGE=
https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318

Uses ERROR=1.
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
I'm not a fan because it doesn't signal what the error was.

We could change the DEAD=1 event to maybe ERROR=1?

Also where would be a good place to document this?

Also thanks for the review. This is my first kernel patch so forgive me
if I get the workflow wrong.

Changes in v2:
- Check that the root hub still exists before sending the uevent.
- Ensure died_work has completed before deallocating.

 drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
 include/linux/usb/hcd.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 975d7c1288e3..7835f1a3647d 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
 	return status;
 }
 
+
+/**
+ * hcd_died_work - Workqueue routine for root-hub has died.
+ * @hcd: primary host controller for this root hub.
+ *
+ * Do not call with the shared_hcd.
+ * */
+static void hcd_died_work(struct work_struct *work)
+{
+	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
+
+	mutex_lock(&usb_bus_idr_lock);
+
+	if (hcd->self.root_hub)
+		/* Notify user space that the host controller has died */
+		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
+				   (char *[]){ "DEAD=1", NULL });
+
+	mutex_unlock(&usb_bus_idr_lock);
+}
+
 /* Workqueue routine for root-hub remote wakeup */
 static void hcd_resume_work(struct work_struct *work)
 {
@@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
 			usb_kick_hub_wq(hcd->self.root_hub);
 		}
 	}
+
+	/* Handle the case where this function gets called with a shared HCD */
+	if (usb_hcd_is_primary_hcd(hcd))
+		schedule_work(&hcd->died_work);
+	else
+		schedule_work(&hcd->primary_hcd->died_work);
+
 	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
 	/* Make sure that the other roothub is also deallocated. */
 }
@@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
 	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
 #endif
 
+	INIT_WORK(&hcd->died_work, hcd_died_work);
+
 	hcd->driver = driver;
 	hcd->speed = driver->flags & HCD_MASK;
 	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
@@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
 #ifdef CONFIG_PM
 	cancel_work_sync(&hcd->wakeup_work);
 #endif
+	cancel_work_sync(&hcd->died_work);
 	mutex_lock(&usb_bus_idr_lock);
 	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
 	mutex_unlock(&usb_bus_idr_lock);
@@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
 #ifdef CONFIG_PM
 	cancel_work_sync(&hcd->wakeup_work);
 #endif
+	cancel_work_sync(&hcd->died_work);
 
 	mutex_lock(&usb_bus_idr_lock);
 	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
index 695931b03684..ae51d5bd1dfc 100644
--- a/include/linux/usb/hcd.h
+++ b/include/linux/usb/hcd.h
@@ -98,6 +98,7 @@ struct usb_hcd {
 #ifdef CONFIG_PM
 	struct work_struct	wakeup_work;	/* for remote wakeup */
 #endif
+	struct work_struct	died_work;	/* for dying */
 
 	/*
 	 * hardware info/state
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-11 18:52 ` Raul E Rangel
  0 siblings, 0 replies; 12+ messages in thread
From: Raul E Rangel @ 2019-04-11 18:52 UTC (permalink / raw)
  To: linux-usb
  Cc: groeck, oneukum, djkurtz, zwisler, Raul E Rangel,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Alan Stern,
	Dmitry Torokhov, Suwan Kim, linux-kernel, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Greg Kroah-Hartman, Mathias Nyman

This change will send a CHANGE event to udev with the DEAD environment
variable set when the HC dies. I chose this instead of any of the other
udev events because it's representing a state change in the host
controller. The only other event that might have fit was OFFLINE, but
that seems to be used for hot-removal and it implies the device could
come ONLINE again.

By notifying user space the appropriate policies can be applied.
i.e.,
 * Collect error logs.
 * Notify the user that USB is no longer functional.
 * Perform a graceful reboot.

Signed-off-by: Raul E Rangel <rrangel@chromium.org>
---
I wasn't able to find any good examples of other drivers sending a dead
notification.

Use an EVENT= format
https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497

Uses SDEV_MEDIA_CHANGE=
https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318

Uses ERROR=1.
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
I'm not a fan because it doesn't signal what the error was.

We could change the DEAD=1 event to maybe ERROR=1?

Also where would be a good place to document this?

Also thanks for the review. This is my first kernel patch so forgive me
if I get the workflow wrong.

Changes in v2:
- Check that the root hub still exists before sending the uevent.
- Ensure died_work has completed before deallocating.

 drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
 include/linux/usb/hcd.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 975d7c1288e3..7835f1a3647d 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
 	return status;
 }
 
+
+/**
+ * hcd_died_work - Workqueue routine for root-hub has died.
+ * @hcd: primary host controller for this root hub.
+ *
+ * Do not call with the shared_hcd.
+ * */
+static void hcd_died_work(struct work_struct *work)
+{
+	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
+
+	mutex_lock(&usb_bus_idr_lock);
+
+	if (hcd->self.root_hub)
+		/* Notify user space that the host controller has died */
+		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
+				   (char *[]){ "DEAD=1", NULL });
+
+	mutex_unlock(&usb_bus_idr_lock);
+}
+
 /* Workqueue routine for root-hub remote wakeup */
 static void hcd_resume_work(struct work_struct *work)
 {
@@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
 			usb_kick_hub_wq(hcd->self.root_hub);
 		}
 	}
+
+	/* Handle the case where this function gets called with a shared HCD */
+	if (usb_hcd_is_primary_hcd(hcd))
+		schedule_work(&hcd->died_work);
+	else
+		schedule_work(&hcd->primary_hcd->died_work);
+
 	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
 	/* Make sure that the other roothub is also deallocated. */
 }
@@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
 	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
 #endif
 
+	INIT_WORK(&hcd->died_work, hcd_died_work);
+
 	hcd->driver = driver;
 	hcd->speed = driver->flags & HCD_MASK;
 	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
@@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
 #ifdef CONFIG_PM
 	cancel_work_sync(&hcd->wakeup_work);
 #endif
+	cancel_work_sync(&hcd->died_work);
 	mutex_lock(&usb_bus_idr_lock);
 	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
 	mutex_unlock(&usb_bus_idr_lock);
@@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
 #ifdef CONFIG_PM
 	cancel_work_sync(&hcd->wakeup_work);
 #endif
+	cancel_work_sync(&hcd->died_work);
 
 	mutex_lock(&usb_bus_idr_lock);
 	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
index 695931b03684..ae51d5bd1dfc 100644
--- a/include/linux/usb/hcd.h
+++ b/include/linux/usb/hcd.h
@@ -98,6 +98,7 @@ struct usb_hcd {
 #ifdef CONFIG_PM
 	struct work_struct	wakeup_work;	/* for remote wakeup */
 #endif
+	struct work_struct	died_work;	/* for dying */
 
 	/*
 	 * hardware info/state

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-15 15:56   ` Alan Stern
  0 siblings, 0 replies; 12+ messages in thread
From: Alan Stern @ 2019-04-15 15:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Raul E Rangel
  Cc: USB list, groeck, Oliver Neukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Dmitry Torokhov,
	Suwan Kim, Kernel development list, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Mathias Nyman

On Thu, 11 Apr 2019, Raul E Rangel wrote:

> This change will send a CHANGE event to udev with the DEAD environment
> variable set when the HC dies. I chose this instead of any of the other
> udev events because it's representing a state change in the host
> controller. The only other event that might have fit was OFFLINE, but
> that seems to be used for hot-removal and it implies the device could
> come ONLINE again.
> 
> By notifying user space the appropriate policies can be applied.
> i.e.,
>  * Collect error logs.
>  * Notify the user that USB is no longer functional.
>  * Perform a graceful reboot.
> 
> Signed-off-by: Raul E Rangel <rrangel@chromium.org>
> ---
> I wasn't able to find any good examples of other drivers sending a dead
> notification.
> 
> Use an EVENT= format
> https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
> https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497
> 
> Uses SDEV_MEDIA_CHANGE=
> https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318
> 
> Uses ERROR=1.
> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
> I'm not a fan because it doesn't signal what the error was.
> 
> We could change the DEAD=1 event to maybe ERROR=1?
> 
> Also where would be a good place to document this?
> 
> Also thanks for the review. This is my first kernel patch so forgive me
> if I get the workflow wrong.
> 
> Changes in v2:
> - Check that the root hub still exists before sending the uevent.
> - Ensure died_work has completed before deallocating.
> 
>  drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
>  include/linux/usb/hcd.h |  1 +
>  2 files changed, 33 insertions(+)

Technically this patch looks okay to me.  However, Greg KH may have
some comments regarding the new uevent this introduces.

Acked-by: Alan Stern <stern@rowland.harvard.edu>

Alan Stern


> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 975d7c1288e3..7835f1a3647d 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
>  	return status;
>  }
>  
> +
> +/**
> + * hcd_died_work - Workqueue routine for root-hub has died.
> + * @hcd: primary host controller for this root hub.
> + *
> + * Do not call with the shared_hcd.
> + * */
> +static void hcd_died_work(struct work_struct *work)
> +{
> +	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
> +
> +	mutex_lock(&usb_bus_idr_lock);
> +
> +	if (hcd->self.root_hub)
> +		/* Notify user space that the host controller has died */
> +		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
> +				   (char *[]){ "DEAD=1", NULL });
> +
> +	mutex_unlock(&usb_bus_idr_lock);
> +}
> +
>  /* Workqueue routine for root-hub remote wakeup */
>  static void hcd_resume_work(struct work_struct *work)
>  {
> @@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
>  			usb_kick_hub_wq(hcd->self.root_hub);
>  		}
>  	}
> +
> +	/* Handle the case where this function gets called with a shared HCD */
> +	if (usb_hcd_is_primary_hcd(hcd))
> +		schedule_work(&hcd->died_work);
> +	else
> +		schedule_work(&hcd->primary_hcd->died_work);
> +
>  	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
>  	/* Make sure that the other roothub is also deallocated. */
>  }
> @@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
>  	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
>  #endif
>  
> +	INIT_WORK(&hcd->died_work, hcd_died_work);
> +
>  	hcd->driver = driver;
>  	hcd->speed = driver->flags & HCD_MASK;
>  	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
> @@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
>  	mutex_unlock(&usb_bus_idr_lock);
> @@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
> index 695931b03684..ae51d5bd1dfc 100644
> --- a/include/linux/usb/hcd.h
> +++ b/include/linux/usb/hcd.h
> @@ -98,6 +98,7 @@ struct usb_hcd {
>  #ifdef CONFIG_PM
>  	struct work_struct	wakeup_work;	/* for remote wakeup */
>  #endif
> +	struct work_struct	died_work;	/* for dying */
>  
>  	/*
>  	 * hardware info/state
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-15 15:56   ` Alan Stern
  0 siblings, 0 replies; 12+ messages in thread
From: Alan Stern @ 2019-04-15 15:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Raul E Rangel
  Cc: USB list, groeck, Oliver Neukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Dmitry Torokhov,
	Suwan Kim, Kernel development list, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Mathias Nyman

On Thu, 11 Apr 2019, Raul E Rangel wrote:

> This change will send a CHANGE event to udev with the DEAD environment
> variable set when the HC dies. I chose this instead of any of the other
> udev events because it's representing a state change in the host
> controller. The only other event that might have fit was OFFLINE, but
> that seems to be used for hot-removal and it implies the device could
> come ONLINE again.
> 
> By notifying user space the appropriate policies can be applied.
> i.e.,
>  * Collect error logs.
>  * Notify the user that USB is no longer functional.
>  * Perform a graceful reboot.
> 
> Signed-off-by: Raul E Rangel <rrangel@chromium.org>
> ---
> I wasn't able to find any good examples of other drivers sending a dead
> notification.
> 
> Use an EVENT= format
> https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
> https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497
> 
> Uses SDEV_MEDIA_CHANGE=
> https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318
> 
> Uses ERROR=1.
> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
> I'm not a fan because it doesn't signal what the error was.
> 
> We could change the DEAD=1 event to maybe ERROR=1?
> 
> Also where would be a good place to document this?
> 
> Also thanks for the review. This is my first kernel patch so forgive me
> if I get the workflow wrong.
> 
> Changes in v2:
> - Check that the root hub still exists before sending the uevent.
> - Ensure died_work has completed before deallocating.
> 
>  drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
>  include/linux/usb/hcd.h |  1 +
>  2 files changed, 33 insertions(+)

Technically this patch looks okay to me.  However, Greg KH may have
some comments regarding the new uevent this introduces.

Acked-by: Alan Stern <stern@rowland.harvard.edu>

Alan Stern


> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 975d7c1288e3..7835f1a3647d 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
>  	return status;
>  }
>  
> +
> +/**
> + * hcd_died_work - Workqueue routine for root-hub has died.
> + * @hcd: primary host controller for this root hub.
> + *
> + * Do not call with the shared_hcd.
> + * */
> +static void hcd_died_work(struct work_struct *work)
> +{
> +	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
> +
> +	mutex_lock(&usb_bus_idr_lock);
> +
> +	if (hcd->self.root_hub)
> +		/* Notify user space that the host controller has died */
> +		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
> +				   (char *[]){ "DEAD=1", NULL });
> +
> +	mutex_unlock(&usb_bus_idr_lock);
> +}
> +
>  /* Workqueue routine for root-hub remote wakeup */
>  static void hcd_resume_work(struct work_struct *work)
>  {
> @@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
>  			usb_kick_hub_wq(hcd->self.root_hub);
>  		}
>  	}
> +
> +	/* Handle the case where this function gets called with a shared HCD */
> +	if (usb_hcd_is_primary_hcd(hcd))
> +		schedule_work(&hcd->died_work);
> +	else
> +		schedule_work(&hcd->primary_hcd->died_work);
> +
>  	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
>  	/* Make sure that the other roothub is also deallocated. */
>  }
> @@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
>  	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
>  #endif
>  
> +	INIT_WORK(&hcd->died_work, hcd_died_work);
> +
>  	hcd->driver = driver;
>  	hcd->speed = driver->flags & HCD_MASK;
>  	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
> @@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
>  	mutex_unlock(&usb_bus_idr_lock);
> @@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
> index 695931b03684..ae51d5bd1dfc 100644
> --- a/include/linux/usb/hcd.h
> +++ b/include/linux/usb/hcd.h
> @@ -98,6 +98,7 @@ struct usb_hcd {
>  #ifdef CONFIG_PM
>  	struct work_struct	wakeup_work;	/* for remote wakeup */
>  #endif
> +	struct work_struct	died_work;	/* for dying */
>  
>  	/*
>  	 * hardware info/state
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-16  9:54   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 12+ messages in thread
From: Greg Kroah-Hartman @ 2019-04-16  9:54 UTC (permalink / raw)
  To: Raul E Rangel
  Cc: linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Alan Stern,
	Dmitry Torokhov, Suwan Kim, linux-kernel, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Mathias Nyman

On Thu, Apr 11, 2019 at 12:52:11PM -0600, Raul E Rangel wrote:
> This change will send a CHANGE event to udev with the DEAD environment
> variable set when the HC dies. I chose this instead of any of the other
> udev events because it's representing a state change in the host
> controller. The only other event that might have fit was OFFLINE, but
> that seems to be used for hot-removal and it implies the device could
> come ONLINE again.

Is "DEAD" used by any other uevents?

> By notifying user space the appropriate policies can be applied.
> i.e.,
>  * Collect error logs.
>  * Notify the user that USB is no longer functional.
>  * Perform a graceful reboot.

What userspace code uses this new uevent to do any of this?

I think "OFFLINE" is a bit better here, it does not always imply that it
can come online again.

> Signed-off-by: Raul E Rangel <rrangel@chromium.org>
> ---
> I wasn't able to find any good examples of other drivers sending a dead
> notification.
> 
> Use an EVENT= format
> https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
> https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497
> 
> Uses SDEV_MEDIA_CHANGE=
> https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318
> 
> Uses ERROR=1.
> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
> I'm not a fan because it doesn't signal what the error was.
> 
> We could change the DEAD=1 event to maybe ERROR=1?

"ERROR=1" is worse than "DEAD=1" :(

> Also where would be a good place to document this?

Documentation/ABI/ is a good start.

> Also thanks for the review. This is my first kernel patch so forgive me
> if I get the workflow wrong.
> 
> Changes in v2:
> - Check that the root hub still exists before sending the uevent.
> - Ensure died_work has completed before deallocating.
> 
>  drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
>  include/linux/usb/hcd.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 975d7c1288e3..7835f1a3647d 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
>  	return status;
>  }
>  
> +
> +/**
> + * hcd_died_work - Workqueue routine for root-hub has died.
> + * @hcd: primary host controller for this root hub.
> + *
> + * Do not call with the shared_hcd.
> + * */

No need for kerneldoc fortting for a static function.

And your documentation isn't even correct, @hcd is not an argument to
this function :(

> +static void hcd_died_work(struct work_struct *work)
> +{
> +	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
> +
> +	mutex_lock(&usb_bus_idr_lock);

Why do you need to lock something that is "dead"?  And why is the idr
lock the correct one here?

> +
> +	if (hcd->self.root_hub)
> +		/* Notify user space that the host controller has died */
> +		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
> +				   (char *[]){ "DEAD=1", NULL });

declaring the envp in the function is cute, but please don't do that,
make it obvious what is happening here with a real variable.

> +
> +	mutex_unlock(&usb_bus_idr_lock);
> +}
> +
>  /* Workqueue routine for root-hub remote wakeup */
>  static void hcd_resume_work(struct work_struct *work)
>  {
> @@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
>  			usb_kick_hub_wq(hcd->self.root_hub);
>  		}
>  	}
> +
> +	/* Handle the case where this function gets called with a shared HCD */
> +	if (usb_hcd_is_primary_hcd(hcd))
> +		schedule_work(&hcd->died_work);
> +	else
> +		schedule_work(&hcd->primary_hcd->died_work);
> +
>  	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
>  	/* Make sure that the other roothub is also deallocated. */
>  }
> @@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
>  	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
>  #endif
>  
> +	INIT_WORK(&hcd->died_work, hcd_died_work);
> +
>  	hcd->driver = driver;
>  	hcd->speed = driver->flags & HCD_MASK;
>  	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
> @@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
>  	mutex_unlock(&usb_bus_idr_lock);
> @@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
> index 695931b03684..ae51d5bd1dfc 100644
> --- a/include/linux/usb/hcd.h
> +++ b/include/linux/usb/hcd.h
> @@ -98,6 +98,7 @@ struct usb_hcd {
>  #ifdef CONFIG_PM
>  	struct work_struct	wakeup_work;	/* for remote wakeup */
>  #endif
> +	struct work_struct	died_work;	/* for dying */

"For when the device dies"?

And have you ever hit this in the real world?  If so, what can you do
about it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-16  9:54   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 12+ messages in thread
From: Greg Kroah-Hartman @ 2019-04-16  9:54 UTC (permalink / raw)
  To: Raul E Rangel
  Cc: linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Alan Stern,
	Dmitry Torokhov, Suwan Kim, linux-kernel, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Mathias Nyman

On Thu, Apr 11, 2019 at 12:52:11PM -0600, Raul E Rangel wrote:
> This change will send a CHANGE event to udev with the DEAD environment
> variable set when the HC dies. I chose this instead of any of the other
> udev events because it's representing a state change in the host
> controller. The only other event that might have fit was OFFLINE, but
> that seems to be used for hot-removal and it implies the device could
> come ONLINE again.

Is "DEAD" used by any other uevents?

> By notifying user space the appropriate policies can be applied.
> i.e.,
>  * Collect error logs.
>  * Notify the user that USB is no longer functional.
>  * Perform a graceful reboot.

What userspace code uses this new uevent to do any of this?

I think "OFFLINE" is a bit better here, it does not always imply that it
can come online again.

> Signed-off-by: Raul E Rangel <rrangel@chromium.org>
> ---
> I wasn't able to find any good examples of other drivers sending a dead
> notification.
> 
> Use an EVENT= format
> https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
> https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497
> 
> Uses SDEV_MEDIA_CHANGE=
> https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318
> 
> Uses ERROR=1.
> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
> I'm not a fan because it doesn't signal what the error was.
> 
> We could change the DEAD=1 event to maybe ERROR=1?

"ERROR=1" is worse than "DEAD=1" :(

> Also where would be a good place to document this?

Documentation/ABI/ is a good start.

> Also thanks for the review. This is my first kernel patch so forgive me
> if I get the workflow wrong.
> 
> Changes in v2:
> - Check that the root hub still exists before sending the uevent.
> - Ensure died_work has completed before deallocating.
> 
>  drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
>  include/linux/usb/hcd.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 975d7c1288e3..7835f1a3647d 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
>  	return status;
>  }
>  
> +
> +/**
> + * hcd_died_work - Workqueue routine for root-hub has died.
> + * @hcd: primary host controller for this root hub.
> + *
> + * Do not call with the shared_hcd.
> + * */

No need for kerneldoc fortting for a static function.

And your documentation isn't even correct, @hcd is not an argument to
this function :(

> +static void hcd_died_work(struct work_struct *work)
> +{
> +	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
> +
> +	mutex_lock(&usb_bus_idr_lock);

Why do you need to lock something that is "dead"?  And why is the idr
lock the correct one here?

> +
> +	if (hcd->self.root_hub)
> +		/* Notify user space that the host controller has died */
> +		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
> +				   (char *[]){ "DEAD=1", NULL });

declaring the envp in the function is cute, but please don't do that,
make it obvious what is happening here with a real variable.

> +
> +	mutex_unlock(&usb_bus_idr_lock);
> +}
> +
>  /* Workqueue routine for root-hub remote wakeup */
>  static void hcd_resume_work(struct work_struct *work)
>  {
> @@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
>  			usb_kick_hub_wq(hcd->self.root_hub);
>  		}
>  	}
> +
> +	/* Handle the case where this function gets called with a shared HCD */
> +	if (usb_hcd_is_primary_hcd(hcd))
> +		schedule_work(&hcd->died_work);
> +	else
> +		schedule_work(&hcd->primary_hcd->died_work);
> +
>  	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
>  	/* Make sure that the other roothub is also deallocated. */
>  }
> @@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
>  	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
>  #endif
>  
> +	INIT_WORK(&hcd->died_work, hcd_died_work);
> +
>  	hcd->driver = driver;
>  	hcd->speed = driver->flags & HCD_MASK;
>  	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
> @@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
>  	mutex_unlock(&usb_bus_idr_lock);
> @@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
>  #ifdef CONFIG_PM
>  	cancel_work_sync(&hcd->wakeup_work);
>  #endif
> +	cancel_work_sync(&hcd->died_work);
>  
>  	mutex_lock(&usb_bus_idr_lock);
>  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
> index 695931b03684..ae51d5bd1dfc 100644
> --- a/include/linux/usb/hcd.h
> +++ b/include/linux/usb/hcd.h
> @@ -98,6 +98,7 @@ struct usb_hcd {
>  #ifdef CONFIG_PM
>  	struct work_struct	wakeup_work;	/* for remote wakeup */
>  #endif
> +	struct work_struct	died_work;	/* for dying */

"For when the device dies"?

And have you ever hit this in the real world?  If so, what can you do
about it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-17 17:53     ` Raul E Rangel
  0 siblings, 0 replies; 12+ messages in thread
From: Raul Rangel @ 2019-04-17 17:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Alan Stern,
	Dmitry Torokhov, Suwan Kim, linux-kernel, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Mathias Nyman, Raul Rangel

On Tue, Apr 16, 2019 at 11:54:29AM +0200, Greg Kroah-Hartman wrote:
> On Thu, Apr 11, 2019 at 12:52:11PM -0600, Raul E Rangel wrote:
> > This change will send a CHANGE event to udev with the DEAD environment
> > variable set when the HC dies. I chose this instead of any of the other
> > udev events because it's representing a state change in the host
> > controller. The only other event that might have fit was OFFLINE, but
> > that seems to be used for hot-removal and it implies the device could
> > come ONLINE again.
> 
> Is "DEAD" used by any other uevents?
No, I wasn't able to find any other events that notify when the host
controller has died.
> 
> > By notifying user space the appropriate policies can be applied.
> > i.e.,
> >  * Collect error logs.
> >  * Notify the user that USB is no longer functional.
> >  * Perform a graceful reboot.
> 
> What userspace code uses this new uevent to do any of this?
On ChromeOS we have an error reporter that listens for hardware errors:
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1560170
Though not implemented yet, I also  want to notify the user that USB is no
longer functional and they should restart.

> 
> I think "OFFLINE" is a bit better here, it does not always imply that it
> can come online again.

How about OFFLINE with ERROR=DEAD? I would like to report the problem,
but I'm OK if you want to just stick with OFFLINE.

> 
> > Signed-off-by: Raul E Rangel <rrangel@chromium.org>
> > ---
> > I wasn't able to find any good examples of other drivers sending a dead
> > notification.
> > 
> > Use an EVENT= format
> > https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
> > https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497
> > 
> > Uses SDEV_MEDIA_CHANGE=
> > https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318
> > 
> > Uses ERROR=1.
> > https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
> > I'm not a fan because it doesn't signal what the error was.
> > 
> > We could change the DEAD=1 event to maybe ERROR=1?
> 
> "ERROR=1" is worse than "DEAD=1" :(
> 
> > Also where would be a good place to document this?
> 
> Documentation/ABI/ is a good start.
I'll add something to Documentation/ABI/testing/xhci-uevent
> 
> > Also thanks for the review. This is my first kernel patch so forgive me
> > if I get the workflow wrong.
> > 
> > Changes in v2:
> > - Check that the root hub still exists before sending the uevent.
> > - Ensure died_work has completed before deallocating.
> > 
> >  drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
> >  include/linux/usb/hcd.h |  1 +
> >  2 files changed, 33 insertions(+)
> > 
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 975d7c1288e3..7835f1a3647d 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
> >  	return status;
> >  }
> >  
> > +
> > +/**
> > + * hcd_died_work - Workqueue routine for root-hub has died.
> > + * @hcd: primary host controller for this root hub.
> > + *
> > + * Do not call with the shared_hcd.
> > + * */
> 
> No need for kerneldoc fortting for a static function.
> 
> And your documentation isn't even correct, @hcd is not an argument to
> this function :(
Oops! Apparently I never updated the documentation when I refactored
this to use a worker.
> 
> > +static void hcd_died_work(struct work_struct *work)
> > +{
> > +	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
> > +
> > +	mutex_lock(&usb_bus_idr_lock);
> 
> Why do you need to lock something that is "dead"?  And why is the idr
> lock the correct one here?
We need to ensure that root_hub is not null. Though I'm not sure the
lock is entirely necessary in this case. usb_remove_hcd stops the work
item before it sets the rhdev to null. The reason I picked
usb_bus_idr_lock was because it's the same lock that usb_remove_hcd uses
when setting rhdev = NULL.

Alan, what do you think? Should I remove the lock?
> 
> > +
> > +	if (hcd->self.root_hub)
> > +		/* Notify user space that the host controller has died */
> > +		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
> > +				   (char *[]){ "DEAD=1", NULL });
> 
> declaring the envp in the function is cute, but please don't do that,
> make it obvious what is happening here with a real variable.
> 
I can do that.
> > +
> > +	mutex_unlock(&usb_bus_idr_lock);
> > +}
> > +
> >  /* Workqueue routine for root-hub remote wakeup */
> >  static void hcd_resume_work(struct work_struct *work)
> >  {
> > @@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
> >  			usb_kick_hub_wq(hcd->self.root_hub);
> >  		}
> >  	}
> > +
> > +	/* Handle the case where this function gets called with a shared HCD */
> > +	if (usb_hcd_is_primary_hcd(hcd))
> > +		schedule_work(&hcd->died_work);
> > +	else
> > +		schedule_work(&hcd->primary_hcd->died_work);
> > +
> >  	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
> >  	/* Make sure that the other roothub is also deallocated. */
> >  }
> > @@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
> >  	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
> >  #endif
> >  
> > +	INIT_WORK(&hcd->died_work, hcd_died_work);
> > +
> >  	hcd->driver = driver;
> >  	hcd->speed = driver->flags & HCD_MASK;
> >  	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
> > @@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
> >  #ifdef CONFIG_PM
> >  	cancel_work_sync(&hcd->wakeup_work);
> >  #endif
> > +	cancel_work_sync(&hcd->died_work);
> >  	mutex_lock(&usb_bus_idr_lock);
> >  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> >  	mutex_unlock(&usb_bus_idr_lock);
> > @@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
> >  #ifdef CONFIG_PM
> >  	cancel_work_sync(&hcd->wakeup_work);
> >  #endif
> > +	cancel_work_sync(&hcd->died_work);
> >  
> >  	mutex_lock(&usb_bus_idr_lock);
> >  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> > diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
> > index 695931b03684..ae51d5bd1dfc 100644
> > --- a/include/linux/usb/hcd.h
> > +++ b/include/linux/usb/hcd.h
> > @@ -98,6 +98,7 @@ struct usb_hcd {
> >  #ifdef CONFIG_PM
> >  	struct work_struct	wakeup_work;	/* for remote wakeup */
> >  #endif
> > +	struct work_struct	died_work;	/* for dying */
> 
> "For when the device dies"?
Will do
> 
> And have you ever hit this in the real world?  If so, what can you do
> about it?

So I've encountered this in two real world situations:
1) There is a firmware bug in the AMD Device 7812 xHC where if a device
is disconnected during a SET_ADDRESS command, the controller will lock
up. The xHCI spec says that when SET_ADDRESS times out, the Command
Abort bit should be asserted to cancel the command.
Since the firmware doesn't respond to the CA bit in this situation,
the command abort times out and the controller is deemed dead. The only
way to get out of this situation is to reboot :( I have tried HCRST, but
that is ignored as well.

2) My workstation using a Lewisburg USB 3.0 xHCI Controller.
We have some USB devices we use for debugging and testing our DUTs.
Ever since upgrading our workstations from 4.18 to 4.19 we get the
following error:
xhci_hcd 0000:00:14.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
xhci_hcd 0000:00:14.0: ep deq seg = 000000000552b658, deq ptr = 000000004895b2f8
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:14.0] fault addr 0 [fault reason 06] PTE Read access is not set
xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
xhci_hcd 0000:00:14.0: HC died; cleaning up

In this situation we also need to reboot. I haven't spent anytime trying
to track down the problem.

> 
> thanks,

Thanks for the through review and comments. I'll get a new patch
uploaded soon.
> 
> greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-17 17:53     ` Raul E Rangel
  0 siblings, 0 replies; 12+ messages in thread
From: Raul E Rangel @ 2019-04-17 17:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Alan Stern,
	Dmitry Torokhov, Suwan Kim, linux-kernel, Gustavo A. R. Silva,
	Miquel Raynal, Johan Hovold, Mathias Nyman, Raul Rangel

On Tue, Apr 16, 2019 at 11:54:29AM +0200, Greg Kroah-Hartman wrote:
> On Thu, Apr 11, 2019 at 12:52:11PM -0600, Raul E Rangel wrote:
> > This change will send a CHANGE event to udev with the DEAD environment
> > variable set when the HC dies. I chose this instead of any of the other
> > udev events because it's representing a state change in the host
> > controller. The only other event that might have fit was OFFLINE, but
> > that seems to be used for hot-removal and it implies the device could
> > come ONLINE again.
> 
> Is "DEAD" used by any other uevents?
No, I wasn't able to find any other events that notify when the host
controller has died.
> 
> > By notifying user space the appropriate policies can be applied.
> > i.e.,
> >  * Collect error logs.
> >  * Notify the user that USB is no longer functional.
> >  * Perform a graceful reboot.
> 
> What userspace code uses this new uevent to do any of this?
On ChromeOS we have an error reporter that listens for hardware errors:
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1560170
Though not implemented yet, I also  want to notify the user that USB is no
longer functional and they should restart.

> 
> I think "OFFLINE" is a bit better here, it does not always imply that it
> can come online again.

How about OFFLINE with ERROR=DEAD? I would like to report the problem,
but I'm OK if you want to just stick with OFFLINE.

> 
> > Signed-off-by: Raul E Rangel <rrangel@chromium.org>
> > ---
> > I wasn't able to find any good examples of other drivers sending a dead
> > notification.
> > 
> > Use an EVENT= format
> > https://github.com/torvalds/linux/blob/master/drivers/acpi/dock.c#L302
> > https://github.com/torvalds/linux/blob/master/drivers/net/wireless/ath/wil6210/interrupt.c#L497
> > 
> > Uses SDEV_MEDIA_CHANGE=
> > https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_lib.c#L2318
> > 
> > Uses ERROR=1.
> > https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7f6d8aec5803aac44192f03dce5637b66cda7abf/drivers/input/touchscreen/atmel_mxt_ts.c#1581
> > I'm not a fan because it doesn't signal what the error was.
> > 
> > We could change the DEAD=1 event to maybe ERROR=1?
> 
> "ERROR=1" is worse than "DEAD=1" :(
> 
> > Also where would be a good place to document this?
> 
> Documentation/ABI/ is a good start.
I'll add something to Documentation/ABI/testing/xhci-uevent
> 
> > Also thanks for the review. This is my first kernel patch so forgive me
> > if I get the workflow wrong.
> > 
> > Changes in v2:
> > - Check that the root hub still exists before sending the uevent.
> > - Ensure died_work has completed before deallocating.
> > 
> >  drivers/usb/core/hcd.c  | 32 ++++++++++++++++++++++++++++++++
> >  include/linux/usb/hcd.h |  1 +
> >  2 files changed, 33 insertions(+)
> > 
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 975d7c1288e3..7835f1a3647d 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -2343,6 +2343,27 @@ int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg)
> >  	return status;
> >  }
> >  
> > +
> > +/**
> > + * hcd_died_work - Workqueue routine for root-hub has died.
> > + * @hcd: primary host controller for this root hub.
> > + *
> > + * Do not call with the shared_hcd.
> > + * */
> 
> No need for kerneldoc fortting for a static function.
> 
> And your documentation isn't even correct, @hcd is not an argument to
> this function :(
Oops! Apparently I never updated the documentation when I refactored
this to use a worker.
> 
> > +static void hcd_died_work(struct work_struct *work)
> > +{
> > +	struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work);
> > +
> > +	mutex_lock(&usb_bus_idr_lock);
> 
> Why do you need to lock something that is "dead"?  And why is the idr
> lock the correct one here?
We need to ensure that root_hub is not null. Though I'm not sure the
lock is entirely necessary in this case. usb_remove_hcd stops the work
item before it sets the rhdev to null. The reason I picked
usb_bus_idr_lock was because it's the same lock that usb_remove_hcd uses
when setting rhdev = NULL.

Alan, what do you think? Should I remove the lock?
> 
> > +
> > +	if (hcd->self.root_hub)
> > +		/* Notify user space that the host controller has died */
> > +		kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_CHANGE,
> > +				   (char *[]){ "DEAD=1", NULL });
> 
> declaring the envp in the function is cute, but please don't do that,
> make it obvious what is happening here with a real variable.
> 
I can do that.
> > +
> > +	mutex_unlock(&usb_bus_idr_lock);
> > +}
> > +
> >  /* Workqueue routine for root-hub remote wakeup */
> >  static void hcd_resume_work(struct work_struct *work)
> >  {
> > @@ -2488,6 +2509,13 @@ void usb_hc_died (struct usb_hcd *hcd)
> >  			usb_kick_hub_wq(hcd->self.root_hub);
> >  		}
> >  	}
> > +
> > +	/* Handle the case where this function gets called with a shared HCD */
> > +	if (usb_hcd_is_primary_hcd(hcd))
> > +		schedule_work(&hcd->died_work);
> > +	else
> > +		schedule_work(&hcd->primary_hcd->died_work);
> > +
> >  	spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
> >  	/* Make sure that the other roothub is also deallocated. */
> >  }
> > @@ -2555,6 +2583,8 @@ struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
> >  	INIT_WORK(&hcd->wakeup_work, hcd_resume_work);
> >  #endif
> >  
> > +	INIT_WORK(&hcd->died_work, hcd_died_work);
> > +
> >  	hcd->driver = driver;
> >  	hcd->speed = driver->flags & HCD_MASK;
> >  	hcd->product_desc = (driver->product_desc) ? driver->product_desc :
> > @@ -2908,6 +2938,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
> >  #ifdef CONFIG_PM
> >  	cancel_work_sync(&hcd->wakeup_work);
> >  #endif
> > +	cancel_work_sync(&hcd->died_work);
> >  	mutex_lock(&usb_bus_idr_lock);
> >  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> >  	mutex_unlock(&usb_bus_idr_lock);
> > @@ -2968,6 +2999,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
> >  #ifdef CONFIG_PM
> >  	cancel_work_sync(&hcd->wakeup_work);
> >  #endif
> > +	cancel_work_sync(&hcd->died_work);
> >  
> >  	mutex_lock(&usb_bus_idr_lock);
> >  	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
> > diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
> > index 695931b03684..ae51d5bd1dfc 100644
> > --- a/include/linux/usb/hcd.h
> > +++ b/include/linux/usb/hcd.h
> > @@ -98,6 +98,7 @@ struct usb_hcd {
> >  #ifdef CONFIG_PM
> >  	struct work_struct	wakeup_work;	/* for remote wakeup */
> >  #endif
> > +	struct work_struct	died_work;	/* for dying */
> 
> "For when the device dies"?
Will do
> 
> And have you ever hit this in the real world?  If so, what can you do
> about it?

So I've encountered this in two real world situations:
1) There is a firmware bug in the AMD Device 7812 xHC where if a device
is disconnected during a SET_ADDRESS command, the controller will lock
up. The xHCI spec says that when SET_ADDRESS times out, the Command
Abort bit should be asserted to cancel the command.
Since the firmware doesn't respond to the CA bit in this situation,
the command abort times out and the controller is deemed dead. The only
way to get out of this situation is to reboot :( I have tried HCRST, but
that is ignored as well.

2) My workstation using a Lewisburg USB 3.0 xHCI Controller.
We have some USB devices we use for debugging and testing our DUTs.
Ever since upgrading our workstations from 4.18 to 4.19 we get the
following error:
xhci_hcd 0000:00:14.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
xhci_hcd 0000:00:14.0: ep deq seg = 000000000552b658, deq ptr = 000000004895b2f8
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:14.0] fault addr 0 [fault reason 06] PTE Read access is not set
xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
xhci_hcd 0000:00:14.0: HC died; cleaning up

In this situation we also need to reboot. I haven't spent anytime trying
to track down the problem.

> 
> thanks,

Thanks for the through review and comments. I'll get a new patch
uploaded soon.
> 
> greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-17 18:24       ` Alan Stern
  0 siblings, 0 replies; 12+ messages in thread
From: Alan Stern @ 2019-04-17 18:24 UTC (permalink / raw)
  To: Raul Rangel
  Cc: Greg Kroah-Hartman, linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Dmitry Torokhov,
	Suwan Kim, linux-kernel, Gustavo A. R. Silva, Miquel Raynal,
	Johan Hovold, Mathias Nyman

On Wed, 17 Apr 2019, Raul Rangel wrote:

> > > Also where would be a good place to document this?
> > 
> > Documentation/ABI/ is a good start.
> I'll add something to Documentation/ABI/testing/xhci-uevent

Your patch will apply to all host controllers, not just xhci.  The
documentation filename should reflect this.  Perhaps "usb-uevent"?

> > Why do you need to lock something that is "dead"?  And why is the idr
> > lock the correct one here?
> We need to ensure that root_hub is not null. Though I'm not sure the
> lock is entirely necessary in this case. usb_remove_hcd stops the work
> item before it sets the rhdev to null. The reason I picked
> usb_bus_idr_lock was because it's the same lock that usb_remove_hcd uses
> when setting rhdev = NULL.
> 
> Alan, what do you think? Should I remove the lock?

You're both right; the lock isn't needed because the work is stopped
before the root hub gets removed.  Acquiring the lock doesn't do any
harm, but it isn't needed so you probably should remove it.  In fact, 
you don't even need to test for whether hcd->self.root_hub is 
non-NULL.

Alan Stern



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-17 18:24       ` Alan Stern
  0 siblings, 0 replies; 12+ messages in thread
From: Alan Stern @ 2019-04-17 18:24 UTC (permalink / raw)
  To: Raul Rangel
  Cc: Greg Kroah-Hartman, linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Dmitry Torokhov,
	Suwan Kim, linux-kernel, Gustavo A. R. Silva, Miquel Raynal,
	Johan Hovold, Mathias Nyman

On Wed, 17 Apr 2019, Raul Rangel wrote:

> > > Also where would be a good place to document this?
> > 
> > Documentation/ABI/ is a good start.
> I'll add something to Documentation/ABI/testing/xhci-uevent

Your patch will apply to all host controllers, not just xhci.  The
documentation filename should reflect this.  Perhaps "usb-uevent"?

> > Why do you need to lock something that is "dead"?  And why is the idr
> > lock the correct one here?
> We need to ensure that root_hub is not null. Though I'm not sure the
> lock is entirely necessary in this case. usb_remove_hcd stops the work
> item before it sets the rhdev to null. The reason I picked
> usb_bus_idr_lock was because it's the same lock that usb_remove_hcd uses
> when setting rhdev = NULL.
> 
> Alan, what do you think? Should I remove the lock?

You're both right; the lock isn't needed because the work is stopped
before the root hub gets removed.  Acquiring the lock doesn't do any
harm, but it isn't needed so you probably should remove it.  In fact, 
you don't even need to test for whether hcd->self.root_hub is 
non-NULL.

Alan Stern

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-17 18:31         ` Raul E Rangel
  0 siblings, 0 replies; 12+ messages in thread
From: Raul Rangel @ 2019-04-17 18:31 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg Kroah-Hartman, linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Dmitry Torokhov,
	Suwan Kim, linux-kernel, Gustavo A. R. Silva, Miquel Raynal,
	Johan Hovold, Mathias Nyman, Raul Rangel

On Wed, Apr 17, 2019 at 02:24:06PM -0400, Alan Stern wrote:
> On Wed, 17 Apr 2019, Raul Rangel wrote:
> 
> > > > Also where would be a good place to document this?
> > > 
> > > Documentation/ABI/ is a good start.
> > I'll add something to Documentation/ABI/testing/xhci-uevent
> 
> Your patch will apply to all host controllers, not just xhci.  The
> documentation filename should reflect this.  Perhaps "usb-uevent"?
I realized that after I had sent the email :) I'll do usb-uevent.
> 
> > > Why do you need to lock something that is "dead"?  And why is the idr
> > > lock the correct one here?
> > We need to ensure that root_hub is not null. Though I'm not sure the
> > lock is entirely necessary in this case. usb_remove_hcd stops the work
> > item before it sets the rhdev to null. The reason I picked
> > usb_bus_idr_lock was because it's the same lock that usb_remove_hcd uses
> > when setting rhdev = NULL.
> > 
> > Alan, what do you think? Should I remove the lock?
> 
> You're both right; the lock isn't needed because the work is stopped
> before the root hub gets removed.  Acquiring the lock doesn't do any
> harm, but it isn't needed so you probably should remove it.  In fact, 
> you don't even need to test for whether hcd->self.root_hub is 
> non-NULL.
Sounds good, I'll clean it up.
> 
> Alan Stern
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [v2] usb/hcd: Send a uevent signaling that the host controller has died
@ 2019-04-17 18:31         ` Raul E Rangel
  0 siblings, 0 replies; 12+ messages in thread
From: Raul E Rangel @ 2019-04-17 18:31 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg Kroah-Hartman, linux-usb, groeck, oneukum, djkurtz, zwisler,
	Sebastian Andrzej Siewior, Martin Blumenstingl, Dmitry Torokhov,
	Suwan Kim, linux-kernel, Gustavo A. R. Silva, Miquel Raynal,
	Johan Hovold, Mathias Nyman, Raul Rangel

On Wed, Apr 17, 2019 at 02:24:06PM -0400, Alan Stern wrote:
> On Wed, 17 Apr 2019, Raul Rangel wrote:
> 
> > > > Also where would be a good place to document this?
> > > 
> > > Documentation/ABI/ is a good start.
> > I'll add something to Documentation/ABI/testing/xhci-uevent
> 
> Your patch will apply to all host controllers, not just xhci.  The
> documentation filename should reflect this.  Perhaps "usb-uevent"?
I realized that after I had sent the email :) I'll do usb-uevent.
> 
> > > Why do you need to lock something that is "dead"?  And why is the idr
> > > lock the correct one here?
> > We need to ensure that root_hub is not null. Though I'm not sure the
> > lock is entirely necessary in this case. usb_remove_hcd stops the work
> > item before it sets the rhdev to null. The reason I picked
> > usb_bus_idr_lock was because it's the same lock that usb_remove_hcd uses
> > when setting rhdev = NULL.
> > 
> > Alan, what do you think? Should I remove the lock?
> 
> You're both right; the lock isn't needed because the work is stopped
> before the root hub gets removed.  Acquiring the lock doesn't do any
> harm, but it isn't needed so you probably should remove it.  In fact, 
> you don't even need to test for whether hcd->self.root_hub is 
> non-NULL.
Sounds good, I'll clean it up.
> 
> Alan Stern
> 
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-04-17 18:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-11 18:52 [PATCH v2] usb/hcd: Send a uevent signaling that the host controller has died Raul E Rangel
2019-04-11 18:52 ` [v2] " Raul E Rangel
2019-04-15 15:56 ` [PATCH v2] " Alan Stern
2019-04-15 15:56   ` [v2] " Alan Stern
2019-04-16  9:54 ` [PATCH v2] " Greg Kroah-Hartman
2019-04-16  9:54   ` [v2] " Greg Kroah-Hartman
2019-04-17 17:53   ` [PATCH v2] " Raul Rangel
2019-04-17 17:53     ` [v2] " Raul E Rangel
2019-04-17 18:24     ` [PATCH v2] " Alan Stern
2019-04-17 18:24       ` [v2] " Alan Stern
2019-04-17 18:31       ` [PATCH v2] " Raul Rangel
2019-04-17 18:31         ` [v2] " Raul E Rangel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.