linux-serial.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready
@ 2024-05-02  4:02 Weifeng Liu
  2024-05-02  4:02 ` [RFC PATCH 1/2] platform/surface: aggregator: Defer probing when serdev " Weifeng Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Weifeng Liu @ 2024-05-02  4:02 UTC (permalink / raw)
  To: platform-driver-x86, linux-serial
  Cc: Weifeng Liu, Maximilian Luz, Hans de Goede

Greetings,

This series is intended to remedy a race condition where surface
aggregator module (SAM) which is a serdev driver could fail to probe if
the underlying UART port is not ready to open.  In such circumstance,
invoking serdev_device_open() gets errno -ENXIO, leading to failure in
probing of SAM.  However, if the probe is retried in a short delay,
serdev_device_open() would work as expected and everything just goes
fine.  As a workaround, adding the serial driver (8250_dw) into
initramfs or building it into the kernel image significantly mitigates
the likelihood of encountering this race condition, as in this way the
serial device would be initialized much earlier than probing of SAM.

However, IMO we should reliably avoid this sort of race condition.  A
good way is returning -EPROBE_DEFER when serdev_device_open returns
-ENXIO so that the kernel will be able to retry the probing later.  This
is what the first patch tries to do.

Though this solution might be a good enough solution for this specific
issue, I am wondering why this kind of race condition could ever happen,
i.e., why a serdes device could be not ready yet at the moment the
serdev driver gets called and tries to bind it.  And even if this is an
expected behavior how serdev driver works, I do feel it a little bit
weird that we need to identify serdev_device_open() returning -ENXIO as
non-fatal error and thus return -EPROBE_DEFER manually in such case, as
I don't see this sort of behavior in other serdev driver.  Maximilian
and Hans suggested discussing the root cause of the race condition here.
I will be grateful if you could provide some reasoning and insights on
this.

Following is the code path when the issue occurs:

	ssam_serial_hub_probe()
	serdev_device_open()
	ctrl->ops->open()	/* this callback being ttyport_open() */
	tty->ops->open()	/* this callback being uart_open() */
	tty_port_open()
	port->ops->activate()	/* this callback being uart_port_activate() */
	Find bit UPF_DEAD is set in uport->flags and fail with errno -ENXIO.

I notice that flag UPF_DEAD would be set in serial_core_register_port()
during calling serial_core_add_one_port() but don't have much idea
what's going on under the hood.

Additionally, add logs to the probe procedure of SAM in the second
patch, hoping it could help provide context to user when something
miserable happens.

Context of this series is available in [1].

Best regards,
Weifeng

[1] https://github.com/linux-surface/kernel/pull/152

Weifeng Liu (2):
  platform/surface: aggregator: Defer probing when serdev is not ready
  platform/surface: aggregator: Log critical errors during SAM probing

 drivers/platform/surface/aggregator/core.c | 39 ++++++++++++++++++----
 1 file changed, 32 insertions(+), 7 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/2] platform/surface: aggregator: Defer probing when serdev is not ready
  2024-05-02  4:02 [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Weifeng Liu
@ 2024-05-02  4:02 ` Weifeng Liu
  2024-05-02 15:19   ` Andy Shevchenko
  2024-05-02  4:02 ` [RFC PATCH 2/2] platform/surface: aggregator: Log critical errors during SAM probing Weifeng Liu
  2024-05-02  8:38 ` [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Hans de Goede
  2 siblings, 1 reply; 7+ messages in thread
From: Weifeng Liu @ 2024-05-02  4:02 UTC (permalink / raw)
  To: platform-driver-x86, linux-serial
  Cc: Weifeng Liu, Maximilian Luz, Hans de Goede

This is an attempt to alleviate race conditions in the SAM driver where
essential resources like serial device and GPIO pins are not ready at
the time ssam_serial_hub_probe() is called.  Instead of giving up
probing, a better way would be to defer the probing by returning
-EPROBE_DEFER, allowing the kernel try again later.

However, there is no way of identifying all such cases from other real
errors in a few days.  So let's take a gradual approach identify and
address these cases as they arise.  This commit marks the initial step
in this process.

Suggested-by: Maximilian Luz <luzmaximilian@gmail.com>
Signed-off-by: Weifeng Liu <weifeng.liu.z@gmail.com>
---
 drivers/platform/surface/aggregator/core.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/platform/surface/aggregator/core.c b/drivers/platform/surface/aggregator/core.c
index 9591a28bc38a..72a521dd729c 100644
--- a/drivers/platform/surface/aggregator/core.c
+++ b/drivers/platform/surface/aggregator/core.c
@@ -645,9 +645,20 @@ static int ssam_serial_hub_probe(struct serdev_device *serdev)
 	/* Set up serdev device. */
 	serdev_device_set_drvdata(serdev, ctrl);
 	serdev_device_set_client_ops(serdev, &ssam_serdev_ops);
+
+	/* The following step can fail when it's called too early before the
+	 * underlying uart device is ready (in this case -ENXIO is returned).
+	 * Instead of simply giving up and losing everything, we can defer
+	 * the probing by returning -EPROBE_DEFER so that the kernel would be
+	 * able to retry later. */
 	status = serdev_device_open(serdev);
-	if (status)
+	if (status == -ENXIO)
+		status = -EPROBE_DEFER;
+	if (status) {
+		dev_err_probe(&serdev->dev, status,
+			      "failed to open serdev device\n");
 		goto err_devopen;
+	}
 
 	astatus = ssam_serdev_setup_via_acpi(ssh->handle, serdev);
 	if (ACPI_FAILURE(astatus)) {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/2] platform/surface: aggregator: Log critical errors during SAM probing
  2024-05-02  4:02 [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Weifeng Liu
  2024-05-02  4:02 ` [RFC PATCH 1/2] platform/surface: aggregator: Defer probing when serdev " Weifeng Liu
@ 2024-05-02  4:02 ` Weifeng Liu
  2024-05-02 15:15   ` Andy Shevchenko
  2024-05-02  8:38 ` [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Hans de Goede
  2 siblings, 1 reply; 7+ messages in thread
From: Weifeng Liu @ 2024-05-02  4:02 UTC (permalink / raw)
  To: platform-driver-x86, linux-serial
  Cc: Weifeng Liu, Maximilian Luz, Hans de Goede

Emits messages upon errors during probing of SAM.  Hopefully this could
provide useful context to user for the purpose of diagnosis when
something miserable happen.

Signed-off-by: Weifeng Liu <weifeng.liu.z@gmail.com>
---
 drivers/platform/surface/aggregator/core.c | 26 +++++++++++++++++-----
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/drivers/platform/surface/aggregator/core.c b/drivers/platform/surface/aggregator/core.c
index 72a521dd729c..4e2e70f4dcf5 100644
--- a/drivers/platform/surface/aggregator/core.c
+++ b/drivers/platform/surface/aggregator/core.c
@@ -623,8 +623,10 @@ static int ssam_serial_hub_probe(struct serdev_device *serdev)
 	acpi_status astatus;
 	int status;
 
-	if (gpiod_count(&serdev->dev, NULL) < 0)
+	if (gpiod_count(&serdev->dev, NULL) < 0) {
+		dev_err(&serdev->dev, "no GPIO found\n");
 		return -ENODEV;
+	}
 
 	status = devm_acpi_dev_add_driver_gpios(&serdev->dev, ssam_acpi_gpios);
 	if (status)
@@ -637,8 +639,11 @@ static int ssam_serial_hub_probe(struct serdev_device *serdev)
 
 	/* Initialize controller. */
 	status = ssam_controller_init(ctrl, serdev);
-	if (status)
+	if (status) {
+		dev_err_probe(&serdev->dev, status,
+			      "failed to initialize ssam controller\n");
 		goto err_ctrl_init;
+	}
 
 	ssam_controller_lock(ctrl);
 
@@ -663,6 +668,7 @@ static int ssam_serial_hub_probe(struct serdev_device *serdev)
 	astatus = ssam_serdev_setup_via_acpi(ssh->handle, serdev);
 	if (ACPI_FAILURE(astatus)) {
 		status = -ENXIO;
+		dev_err(&serdev->dev, "failed to setup serdev\n");
 		goto err_devinit;
 	}
 
@@ -678,16 +684,22 @@ static int ssam_serial_hub_probe(struct serdev_device *serdev)
 	 * states.
 	 */
 	status = ssam_log_firmware_version(ctrl);
-	if (status)
+	if (status) {
+		dev_err(&serdev->dev, "failed to get firmware version\n");
 		goto err_initrq;
+	}
 
 	status = ssam_ctrl_notif_d0_entry(ctrl);
-	if (status)
+	if (status) {
+		dev_err(&serdev->dev, "failed to notify EC of entry of D0\n");
 		goto err_initrq;
+	}
 
 	status = ssam_ctrl_notif_display_on(ctrl);
-	if (status)
+	if (status) {
+		dev_err(&serdev->dev, "failed to notify EC of display on\n");
 		goto err_initrq;
+	}
 
 	status = sysfs_create_group(&serdev->dev.kobj, &ssam_sam_group);
 	if (status)
@@ -695,8 +707,10 @@ static int ssam_serial_hub_probe(struct serdev_device *serdev)
 
 	/* Set up IRQ. */
 	status = ssam_irq_setup(ctrl);
-	if (status)
+	if (status) {
+		dev_err_probe(&serdev->dev, status, "failed to setup IRQ\n");
 		goto err_irq;
+	}
 
 	/* Finally, set main controller reference. */
 	status = ssam_try_set_controller(ctrl);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready
  2024-05-02  4:02 [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Weifeng Liu
  2024-05-02  4:02 ` [RFC PATCH 1/2] platform/surface: aggregator: Defer probing when serdev " Weifeng Liu
  2024-05-02  4:02 ` [RFC PATCH 2/2] platform/surface: aggregator: Log critical errors during SAM probing Weifeng Liu
@ 2024-05-02  8:38 ` Hans de Goede
  2024-05-02 20:03   ` Maximilian Luz
  2 siblings, 1 reply; 7+ messages in thread
From: Hans de Goede @ 2024-05-02  8:38 UTC (permalink / raw)
  To: Weifeng Liu, platform-driver-x86, linux-serial; +Cc: Maximilian Luz

Hi Weifeng,

On 5/2/24 6:02 AM, Weifeng Liu wrote:
> Greetings,
> 
> This series is intended to remedy a race condition where surface
> aggregator module (SAM) which is a serdev driver could fail to probe if
> the underlying UART port is not ready to open.  In such circumstance,
> invoking serdev_device_open() gets errno -ENXIO, leading to failure in
> probing of SAM.  However, if the probe is retried in a short delay,
> serdev_device_open() would work as expected and everything just goes
> fine.  As a workaround, adding the serial driver (8250_dw) into
> initramfs or building it into the kernel image significantly mitigates
> the likelihood of encountering this race condition, as in this way the
> serial device would be initialized much earlier than probing of SAM.
> 
> However, IMO we should reliably avoid this sort of race condition.  A
> good way is returning -EPROBE_DEFER when serdev_device_open returns
> -ENXIO so that the kernel will be able to retry the probing later.  This
> is what the first patch tries to do.
> 
> Though this solution might be a good enough solution for this specific
> issue, I am wondering why this kind of race condition could ever happen,
> i.e., why a serdes device could be not ready yet at the moment the
> serdev driver gets called and tries to bind it.  And even if this is an
> expected behavior how serdev driver works, I do feel it a little bit
> weird that we need to identify serdev_device_open() returning -ENXIO as
> non-fatal error and thus return -EPROBE_DEFER manually in such case, as
> I don't see this sort of behavior in other serdev driver.  Maximilian
> and Hans suggested discussing the root cause of the race condition here.
> I will be grateful if you could provide some reasoning and insights on
> this.

Ack, I have no objection against the changes and if Maximilian is ok with
it I can merge these right away as an interim fix, but I would really
like to know why the serdev core / tty code is registering a serdev
device for a serial port before it is ready to have serdev_device_open()
called on it. To me it seems that the root cause is in somewhere in
the 8250_dw code or the serdev core code.

Resources sometimes not being ready is sometimes which drivers generally
speaking need to handle, but in this case the resource which is not
ready is the device the driver is binding to, so it seems that
the device is registered too soon.

If someone familiar with the serial / serdev code can provide some
insight here that would be great.

Regards,

Hans




> 
> Following is the code path when the issue occurs:
> 
> 	ssam_serial_hub_probe()
> 	serdev_device_open()
> 	ctrl->ops->open()	/* this callback being ttyport_open() */
> 	tty->ops->open()	/* this callback being uart_open() */
> 	tty_port_open()
> 	port->ops->activate()	/* this callback being uart_port_activate() */
> 	Find bit UPF_DEAD is set in uport->flags and fail with errno -ENXIO.
> 
> I notice that flag UPF_DEAD would be set in serial_core_register_port()
> during calling serial_core_add_one_port() but don't have much idea
> what's going on under the hood.
> 
> Additionally, add logs to the probe procedure of SAM in the second
> patch, hoping it could help provide context to user when something
> miserable happens.
> 
> Context of this series is available in [1].
> 
> Best regards,
> Weifeng
> 
> [1] https://github.com/linux-surface/kernel/pull/152
> 
> Weifeng Liu (2):
>   platform/surface: aggregator: Defer probing when serdev is not ready
>   platform/surface: aggregator: Log critical errors during SAM probing
> 
>  drivers/platform/surface/aggregator/core.c | 39 ++++++++++++++++++----
>  1 file changed, 32 insertions(+), 7 deletions(-)
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 2/2] platform/surface: aggregator: Log critical errors during SAM probing
  2024-05-02  4:02 ` [RFC PATCH 2/2] platform/surface: aggregator: Log critical errors during SAM probing Weifeng Liu
@ 2024-05-02 15:15   ` Andy Shevchenko
  0 siblings, 0 replies; 7+ messages in thread
From: Andy Shevchenko @ 2024-05-02 15:15 UTC (permalink / raw)
  To: Weifeng Liu
  Cc: platform-driver-x86, linux-serial, Maximilian Luz, Hans de Goede

On Thu, May 02, 2024 at 12:02:47PM +0800, Weifeng Liu wrote:
> Emits messages upon errors during probing of SAM.  Hopefully this could
> provide useful context to user for the purpose of diagnosis when
> something miserable happen.

...

> -	if (gpiod_count(&serdev->dev, NULL) < 0)
> +	if (gpiod_count(&serdev->dev, NULL) < 0) {
> +		dev_err(&serdev->dev, "no GPIO found\n");
>  		return -ENODEV;

Why not

	return dev_err_probe(...);

?

Also while at it, unshadow the error code

	status = gpiod_count(&serdev->dev, NULL);
	if (status < 0)
		return dev_err_probe(..., status, ...);

> +	}

...

>  	astatus = ssam_serdev_setup_via_acpi(ssh->handle, serdev);
>  	if (ACPI_FAILURE(astatus)) {
>  		status = -ENXIO;
> +		dev_err(&serdev->dev, "failed to setup serdev\n");

In the similar way

		status = dev_err(&serdev->dev, -ENXIO, "failed to setup serdev\n");

>  		goto err_devinit;
>  	}

...

>  	status = ssam_log_firmware_version(ctrl);
> -	if (status)
> +	if (status) {
> +		dev_err(&serdev->dev, "failed to get firmware version\n");
>  		goto err_initrq;

Use dev_err_probe() everywhere for the sake of uniform output format.

> +	}

...

>  	status = ssam_ctrl_notif_d0_entry(ctrl);
> -	if (status)
> +	if (status) {
> +		dev_err(&serdev->dev, "failed to notify EC of entry of D0\n");
>  		goto err_initrq;

Ditto.

> +	}
>  
>  	status = ssam_ctrl_notif_display_on(ctrl);
> -	if (status)
> +	if (status) {
> +		dev_err(&serdev->dev, "failed to notify EC of display on\n");
>  		goto err_initrq;

Ditto.

> +	}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/2] platform/surface: aggregator: Defer probing when serdev is not ready
  2024-05-02  4:02 ` [RFC PATCH 1/2] platform/surface: aggregator: Defer probing when serdev " Weifeng Liu
@ 2024-05-02 15:19   ` Andy Shevchenko
  0 siblings, 0 replies; 7+ messages in thread
From: Andy Shevchenko @ 2024-05-02 15:19 UTC (permalink / raw)
  To: Weifeng Liu
  Cc: platform-driver-x86, linux-serial, Maximilian Luz, Hans de Goede

On Thu, May 02, 2024 at 12:02:46PM +0800, Weifeng Liu wrote:
> This is an attempt to alleviate race conditions in the SAM driver where
> essential resources like serial device and GPIO pins are not ready at
> the time ssam_serial_hub_probe() is called.  Instead of giving up
> probing, a better way would be to defer the probing by returning
> -EPROBE_DEFER, allowing the kernel try again later.
> 
> However, there is no way of identifying all such cases from other real
> errors in a few days.  So let's take a gradual approach identify and
> address these cases as they arise.  This commit marks the initial step
> in this process.

...

> +	/* The following step can fail when it's called too early before the
> +	 * underlying uart device is ready (in this case -ENXIO is returned).

UART

> +	 * Instead of simply giving up and losing everything, we can defer
> +	 * the probing by returning -EPROBE_DEFER so that the kernel would be
> +	 * able to retry later. */

/*
 * Use correct style for the comments.
 * Here is the example.
 */

>  	status = serdev_device_open(serdev);
> -	if (status)
> +	if (status == -ENXIO)
> +		status = -EPROBE_DEFER;
> +	if (status) {
> +		dev_err_probe(&serdev->dev, status,
> +			      "failed to open serdev device\n");
>  		goto err_devopen;
> +	}

...

Hans, not sure if it helps, but we added similar into I²C and SPI code.

2dea645ffc21 ("i2c: acpi: Return error pointers from i2c_acpi_new_device()")

9c22ec4ac27b ("spi: Return deferred probe error when controller isn't yet available")

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready
  2024-05-02  8:38 ` [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Hans de Goede
@ 2024-05-02 20:03   ` Maximilian Luz
  0 siblings, 0 replies; 7+ messages in thread
From: Maximilian Luz @ 2024-05-02 20:03 UTC (permalink / raw)
  To: Hans de Goede, Weifeng Liu, platform-driver-x86, linux-serial

Hi,

On 5/2/24 10:38 AM, Hans de Goede wrote:
> Hi Weifeng,
> 
> On 5/2/24 6:02 AM, Weifeng Liu wrote:
>> Greetings,
>>
>> This series is intended to remedy a race condition where surface
>> aggregator module (SAM) which is a serdev driver could fail to probe if
>> the underlying UART port is not ready to open.  In such circumstance,
>> invoking serdev_device_open() gets errno -ENXIO, leading to failure in
>> probing of SAM.  However, if the probe is retried in a short delay,
>> serdev_device_open() would work as expected and everything just goes
>> fine.  As a workaround, adding the serial driver (8250_dw) into
>> initramfs or building it into the kernel image significantly mitigates
>> the likelihood of encountering this race condition, as in this way the
>> serial device would be initialized much earlier than probing of SAM.
>>
>> However, IMO we should reliably avoid this sort of race condition.  A
>> good way is returning -EPROBE_DEFER when serdev_device_open returns
>> -ENXIO so that the kernel will be able to retry the probing later.  This
>> is what the first patch tries to do.
>>
>> Though this solution might be a good enough solution for this specific
>> issue, I am wondering why this kind of race condition could ever happen,
>> i.e., why a serdes device could be not ready yet at the moment the
>> serdev driver gets called and tries to bind it.  And even if this is an
>> expected behavior how serdev driver works, I do feel it a little bit
>> weird that we need to identify serdev_device_open() returning -ENXIO as
>> non-fatal error and thus return -EPROBE_DEFER manually in such case, as
>> I don't see this sort of behavior in other serdev driver.  Maximilian
>> and Hans suggested discussing the root cause of the race condition here.
>> I will be grateful if you could provide some reasoning and insights on
>> this.
> 
> Ack, I have no objection against the changes and if Maximilian is ok with
> it I can merge these right away as an interim fix, but I would really
> like to know why the serdev core / tty code is registering a serdev
> device for a serial port before it is ready to have serdev_device_open()
> called on it. To me it seems that the root cause is in somewhere in
> the 8250_dw code or the serdev core code.

I'm fine with this being merged as an interim fix once Andy's comments
have been addressed. Weifeng, in that case you can then also add

Reviewed-by: Maximilian Luz <luzmaximilian@gmail.com>

However, I too would like to see this fixed properly on the serdev side.

> Resources sometimes not being ready is sometimes which drivers generally
> speaking need to handle, but in this case the resource which is not
> ready is the device the driver is binding to, so it seems that
> the device is registered too soon.
> 
> If someone familiar with the serial / serdev code can provide some
> insight here that would be great.

I will be away for the remainder of the month starting end of next week,
but I could try to look into this after that, provided no one else did by
then.

Best regards,
Max

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-02 20:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-02  4:02 [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Weifeng Liu
2024-05-02  4:02 ` [RFC PATCH 1/2] platform/surface: aggregator: Defer probing when serdev " Weifeng Liu
2024-05-02 15:19   ` Andy Shevchenko
2024-05-02  4:02 ` [RFC PATCH 2/2] platform/surface: aggregator: Log critical errors during SAM probing Weifeng Liu
2024-05-02 15:15   ` Andy Shevchenko
2024-05-02  8:38 ` [RFC PATCH 0/2] Defer probing of SAM if serdev device is not ready Hans de Goede
2024-05-02 20:03   ` Maximilian Luz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).