All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V1 0/1] hci_ldisc: Use rwlocking to avoid closing proto races
@ 2017-05-05 15:27 Dean Jenkins
  2017-05-05 15:27 ` [PATCH V1 1/1] Bluetooth: " Dean Jenkins
  2017-05-05 15:55 ` [PATCH V1 0/1] " Marcel Holtmann
  0 siblings, 2 replies; 3+ messages in thread
From: Dean Jenkins @ 2017-05-05 15:27 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Dean Jenkins, Gustavo F . Padovan, Johan Hedberg, linux-bluetooth

Hi Marcel,

This is my 3rd patchset in my series of resolving a design flaw in
hci_uart_tty_close() related to the proper closure of the Data Link protocol
layer which can result in a kernel crash.

This patchset contains a single patch that adds rwlocking around the "proto"
function pointers. This provides a full fix to prevent a potential kernel crash
triggered by hu->proto->close() running before or during the execution of a
non-close "proto" function pointer.


Previous Discussions
====================

Please refer to the discussion on my 2nd patchset V1 that can be found here:
https://www.spinics.net/lists/linux-bluetooth/msg70370.html
You accepted the 3 patches of my patchset of commits namely:
  Bluetooth: hci_ldisc: Add protocol check to hci_uart_send_frame()
  Bluetooth: hci_ldisc: Add protocol check to hci_uart_dequeue()
  Bluetooth: hci_ldisc: Add protocol check to hci_uart_tx_wakeup()

 drivers/bluetooth/hci_ldisc.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Please refer to the discussion on my 1st patchset V3 that can be found here:
https://www.spinics.net/lists/linux-bluetooth/msg70315.html
You accepted the 3 patches of my patchset of commits namely:
  Bluetooth: hci_ldisc: Add missing return in hci_uart_init_work()
  Bluetooth: hci_ldisc: Ensure hu->hdev set to NULL before freeing hdev
  Bluetooth: hci_ldisc: Add missing clear HCI_UART_PROTO_READY

 drivers/bluetooth/hci_ldisc.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Please refer to the discussion on my 1st patchset V2 that can be found here:
https://www.spinics.net/lists/linux-bluetooth/msg70183.html

Please refer to the discussion on my 1st RFC patchset V1 that can be found here:
https://www.spinics.net/lists/linux-bluetooth/msg70077.html


Analysis
========

There are multiple threads that use "proto" function pointers that potentially
can be running concurrently with the thread of hci_uart_tty_close().

The threads include (not a complete list):

a) hci_uart_write_work() [hci_ldisc.c]
   which is executed by the work queue hu->write_work which creates the
   callstack:
   hci_uart_write_work()
   hci_uart_dequeue()
   hu->proto->dequeue()

b) hci_tx_work() [hci_core.c]
   which is executed by the work queue hdev->tx_work which creates the
   callstack:
   hci_tx_work()
   hci_send_frame()
   hdev->send()
   hci_uart_send_frame()
   hu->proto->enqueue()
   then calls
   hci_uart_tx_wakeup()
   schedule_work(&hu->write_work)
   triggering a)

c) hci_cmd_work() [hci_core.c]
   which is executed by the work queue hdev->cmd_work which creates the
   callstack:
   hci_cmd_work()
   hci_send_frame()
   hdev->send()
   hci_uart_send_frame()
   hu->proto->enqueue()
   then calls
   hci_uart_tx_wakeup()
   schedule_work(&hu->write_work)
   triggering a)

d) hci_dev_close() [hci_core.c]
   creates the callstack:
   hci_dev_close()
   hci_dev_do_close()
   hdev->flush()
   hci_uart_flush()
   hu->proto->flush()

e) When BCSP is used and a retransmission is needed:
   bcsp_timed_event() [hci_bcsp.c]
   creates the callstack:
   bcsp_timed_event()
   hci_uart_tx_wakeup()
   schedule_work(&hu->write_work)
   triggering a)


hci_ldisc.c: hci_uart_tty_close() currently has the construction:

	cancel_work_sync(&hu->write_work);

	if (test_and_clear_bit(HCI_UART_PROTO_READY, &hu->flags)) {
		if (hdev) {
			if (test_bit(HCI_UART_REGISTERED, &hu->flags))
				hci_unregister_dev(hdev);
			hci_free_dev(hdev);
		}
		hu->proto->close(hu);
	}

At first glance, the cancel_work_sync(&hu->write_work) would seem to inhibit
hci_uart_write_work() from running. Actually, the call to cancel_work_sync()
can be ineffective because:

a) hci_unregister_dev() calls hci_dev_do_close() and if HCI_UP flag is in the
   set state then it is possible for hci_dev_do_close() to send a HCI command.
   Such as sending a HCI RESET command which causes hci_send_frame() to schedule
   hci_uart_write_work() after the cancel_work_sync() has executed.

b) In the case of BCSP retransmissions, the callstack in e) runs periodically
   every 250ms. This can occur before and after the cancel_work_sync() executed.


The call to hu->proto->close(hu) is partially protected by the flag
HCI_UART_PROTO_READY but there is a small race condition. A race condition
exists because hci_uart_tty_close() can clear HCI_UART_PROTO_READY just after
the concurrent thread(s) detected the flag as set. The race potentially allows
the concurrent thread(s) to run a "proto" function pointer after or during
hci_uart_tty_close() is calling hu->proto->close(hu) which could trigger a
kernel crash.

Bear in mind that on systems with low resources, such as embedded systems, that
threads can get delayed by interrupts and pre-emption. This means that a
concurrent thread might sleep after it checks the HCI_UART_PROTO_READY flag.
Therefore, just using the flag HCI_UART_PROTO_READY is not 100% safe from
introducing race conditions.

Note that the elapsed run-time for hci_unregister_dev() is dependent on a
number of scenarios. This means hci_unregister_dev() can quickly return or take
up to 10 seconds (guess) to return. This means that the elapsed time between
clearing HCI_UART_PROTO_READY and executing hu->proto->close(hu) is variable
and therefore there is a variable risk in triggering the kernel crash.


Solution implemented by this patchset
=====================================

Modify hci_ldisc.c: hci_uart_tty_close() to use a rwlock solution:

	cancel_work_sync(&hu->write_work);

	if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
		write_lock_irqsave(&hu->proto_lock, flags);
		clear_bit(HCI_UART_PROTO_READY, &hu->flags);
		write_unlock_irqrestore(&hu->proto_lock, flags);

		if (hdev) {
			if (test_bit(HCI_UART_REGISTERED, &hu->flags))
				hci_unregister_dev(hdev);
			hci_free_dev(hdev);
		}
		hu->proto->close(hu);
	}
 
The call to write_lock_irqsave(&hu->proto_lock, flags) will block when a
concurrent thread has called read_lock(&hu->proto_lock). This ensures that
the flag HCI_UART_PROTO_READY is only cleared when no "proto" function pointers
are running in the concurrent threads. Therefore, it is not possible for
any "proto" function pointers to be running immediately before, during or after
hu->proto->close(hu) is called. This means the race condition is avoided.

Note that this solution will inhibit the Data Link layer protocol from
independently retransmitting because hu->proto->enqueue() cannot be called after
HCI_UART_PROTO_READY has been put into the clear state.


Test results from running the patchset
======================================

A x86 64-bit test kernel was created based on kernel v4.11.0 plus the 7 patches:

3rd Patchset
  Bluetooth: hci_ldisc: Use rwlocking to avoid closing proto races
2nd Patchset 
  Bluetooth: hci_ldisc: Add protocol check to hci_uart_send_frame()
  Bluetooth: hci_ldisc: Add protocol check to hci_uart_dequeue()
  Bluetooth: hci_ldisc: Add protocol check to hci_uart_tx_wakeup()
1st Patchset 
  Bluetooth: hci_ldisc: Add missing return in hci_uart_init_work()
  Bluetooth: hci_ldisc: Ensure hu->hdev set to NULL before freeing hdev
  Bluetooth: hci_ldisc: Add missing clear HCI_UART_PROTO_READY

The test used btattach with BCSP which fails to establish a link because BCSP
is not fully implemented in the kernel. However, this is ideal in forcing
BCSP retransmissions. btattach was killed after 5 seconds of running to force
hci_uart_tty_close() to execute.

Here is a snippet of a dmesg log with dynamic debug enabled:
[ 1039.415914] hci_uart_tty_close: tty ffff9861c691d800
[ 1039.415917] hci_uart_close: hdev ffff98625b422000
[ 1039.415920] hci_uart_flush: hdev ffff98625b422000 tty ffff9861c691d800
[ 1039.415922] bcsp_flush: hu ffff9861c27bd480
[ 1039.415928] hci_unregister_dev: ffff98625b422000 name hci1 bus 3
[ 1039.418767] bcsp_timed_event: hu ffff9861c27bd480 retransmitting 2 pkts
[ 1039.674810] Bluetooth: hci1 command 0x1001 tx timeout
[ 1039.674827] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 1
[ 1039.674832] hci_send_frame: hci1 type 1 len 3
[ 1039.674836] hci_uart_send_frame: hci1: type 1 len 3
[ 1039.674838] Bluetooth: hci1 sending frame failed (-49)
[ 1041.722773] Bluetooth: hci1 command 0x1009 tx timeout
[ 1041.722811] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 0
[ 1045.818847] hci_uart_close: hdev ffff98625b422000
[ 1045.818850] hci_uart_flush: hdev ffff98625b422000 tty ffff9861c691d800
[ 1045.818868] hci_dev_do_close: hci1 ffff98625b422000
[ 1045.819031] hci_conn_params_clear_all: All LE connection parameters were removed
[ 1045.819036] bcsp_close: hu ffff9861c27bd480

The following can be observed from the results:

1. hci_unregister_dev() ran which means HCI_UART_PROTO_READY has been cleared.

2. bcsp_timed_event() runs after HCI_UART_PROTO_READY has been cleared and
   so hci_uart_tx_wakeup() no longer appears in the log. This inhibits further
   BCSP retransmission attempts.

3. Bluetooth: hci1 command 0x1001 tx timeout
   This occurs because BCSP was unable to successfully transfer the HCI
   command. BCSP fails because the data plane to and from the UART driver is no
   longer working.

4. Bluetooth: hci1 sending frame failed (-49)
   This occurs because HCI_UART_PROTO_READY has been cleared which inhibits
   a new HCI command being sent from hci_cmd_work().

5. Bluetooth: hci1 command 0x1009 tx timeout
   BCSP never took the HCI command but the HCI core still performs the HCI
   command timeout. This means the thread gets blocked for 2 seconds for doing
   nothing.

6. hci_uart_close() runs twice which seems to be unnecessary. This causes
   hci_uart_flush() to run twice but notice the 2nd execution of
   hci_uart_flush() does not run bcsp_flush() because hdev->flush gets set to
   NULL in hci_uart_close(). This suggests a workaround was done in the past
   to prevent a kernel crash.


Outline of potential further work
=================================

With this patchset, the design flaw in hci_uart_tty_close() is further exposed.

It can be seen that hci_uart_tty_close() is systematically preventing
communications with the Bluetooth Radio Module. But hci_uart_tty_close() calls
hci_unregister_dev() which calls hci_dev_do_close(). If the flag HCI_UP is in
the set state then hci_dev_do_close() assumes that good communications with the
Bluetooth Radio Module are still possible. Therefore, there is a conflict in the
design because the design should not be closing the link and trying to use the
link to send HCI commands at the same time. This can trigger the protocol error
handling procedures to occur which are unsuccessful.

In addition, it is suspected that before hci_uart_tty_close() runs, the TTY
layer has cut the data plane between the Data Link protocol and the UART driver.
This means that if hci_uart_tty_close() is rearranged to allow
HCI_UART_PROTO_READY to be cleared after hci_unregister_dev() runs then any HCI
command will still fail to reach the Bluetooth Radio Module.

There are other issues in hci_ldisc.c: hci_uart_tty_close()

	/* Detach from the tty */
	tty->disc_data = NULL;

	if (!hu)
		return;

	hdev = hu->hdev;
	if (hdev)
		hci_uart_close(hdev);

It can be seen that tty->disc_data is set to NULL. This causes some of the HCI
UART LDISC functions to not perform any operation such as the functions:

hci_uart_tty_wakeup()
hci_uart_tty_receive()
hci_uart_tty_ioctl()

Doing "tty->disc_data = NULL" sabotages the communications with the Bluetooth
Radio Module. Therefore, this statement needs to be moved to after the
communications are going to be cleanly closed such as after hci_unregister_dev()
has completed.

Calling hci_uart_close(hdev) is likely to interfere with the communications with
the Bluetooth Radio Module. Therefore, this statement needs to be moved to after
the communications are going to be cleanly closed such as after
hci_unregister_dev() has completed.


Potential solution to the design flaw
=====================================

Suspect that a solution would be to get hci_uart_tty_close() to run
hci_unregister_dev() before tearing down the resources needed by the
Data Link protocol layer.


The patch was rebased onto the bluetooth-next master branch HEAD commit:
2da711a Bluetooth: Skip vendor diagnostic configuration for HCI User Channel

Dean Jenkins (1):
  Bluetooth: hci_ldisc: Use rwlocking to avoid closing proto races

 drivers/bluetooth/hci_ldisc.c | 40 +++++++++++++++++++++++++++++++++++-----
 drivers/bluetooth/hci_uart.h  |  1 +
 2 files changed, 36 insertions(+), 5 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH V1 1/1] Bluetooth: hci_ldisc: Use rwlocking to avoid closing proto races
  2017-05-05 15:27 [PATCH V1 0/1] hci_ldisc: Use rwlocking to avoid closing proto races Dean Jenkins
@ 2017-05-05 15:27 ` Dean Jenkins
  2017-05-05 15:55 ` [PATCH V1 0/1] " Marcel Holtmann
  1 sibling, 0 replies; 3+ messages in thread
From: Dean Jenkins @ 2017-05-05 15:27 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Dean Jenkins, Gustavo F . Padovan, Johan Hedberg, linux-bluetooth

When HCI_UART_PROTO_READY is in the set state, the Data Link protocol
layer (proto) is bound to the HCI UART driver. This state allows the
registered proto function pointers to be used by the HCI UART driver.

When unbinding (closing) the Data Link protocol layer, the proto
function pointers much be prevented from being used immediately before
running the proto close function pointer. Otherwise, there is a risk
that a proto non-close function pointer is used during or after the
proto close function pointer is used. The consequences are likely to
be a kernel crash because the proto close function pointer will free
resources used in the Data Link protocol layer.

Therefore, add a reader writer lock (rwlock) solution to prevent the
close proto function pointer from running by using write_lock_irqsave()
whilst the other proto function pointers are protected using
read_lock(). This means HCI_UART_PROTO_READY can safely be cleared
in the knowledge that no proto function pointers are running.

When flag HCI_UART_PROTO_READY is put into the clear state,
proto close function pointer can safely be run. Note
flag HCI_UART_PROTO_SET being in the set state prevents the proto
open function pointer from being run so there is no race condition
between proto open and close function pointers.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
---
 drivers/bluetooth/hci_ldisc.c | 40 +++++++++++++++++++++++++++++++++++-----
 drivers/bluetooth/hci_uart.h  |  1 +
 2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/drivers/bluetooth/hci_ldisc.c b/drivers/bluetooth/hci_ldisc.c
index 2edd305..8397b71 100644
--- a/drivers/bluetooth/hci_ldisc.c
+++ b/drivers/bluetooth/hci_ldisc.c
@@ -114,8 +114,12 @@ static inline struct sk_buff *hci_uart_dequeue(struct hci_uart *hu)
 	struct sk_buff *skb = hu->tx_skb;
 
 	if (!skb) {
+		read_lock(&hu->proto_lock);
+
 		if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
 			skb = hu->proto->dequeue(hu);
+
+		read_unlock(&hu->proto_lock);
 	} else {
 		hu->tx_skb = NULL;
 	}
@@ -125,18 +129,23 @@ static inline struct sk_buff *hci_uart_dequeue(struct hci_uart *hu)
 
 int hci_uart_tx_wakeup(struct hci_uart *hu)
 {
+	read_lock(&hu->proto_lock);
+
 	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags))
-		return 0;
+		goto no_schedule;
 
 	if (test_and_set_bit(HCI_UART_SENDING, &hu->tx_state)) {
 		set_bit(HCI_UART_TX_WAKEUP, &hu->tx_state);
-		return 0;
+		goto no_schedule;
 	}
 
 	BT_DBG("");
 
 	schedule_work(&hu->write_work);
 
+no_schedule:
+	read_unlock(&hu->proto_lock);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(hci_uart_tx_wakeup);
@@ -237,9 +246,13 @@ static int hci_uart_flush(struct hci_dev *hdev)
 	tty_ldisc_flush(tty);
 	tty_driver_flush_buffer(tty);
 
+	read_lock(&hu->proto_lock);
+
 	if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
 		hu->proto->flush(hu);
 
+	read_unlock(&hu->proto_lock);
+
 	return 0;
 }
 
@@ -261,10 +274,15 @@ static int hci_uart_send_frame(struct hci_dev *hdev, struct sk_buff *skb)
 	BT_DBG("%s: type %d len %d", hdev->name, hci_skb_pkt_type(skb),
 	       skb->len);
 
-	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags))
+	read_lock(&hu->proto_lock);
+
+	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
+		read_unlock(&hu->proto_lock);
 		return -EUNATCH;
+	}
 
 	hu->proto->enqueue(hu, skb);
+	read_unlock(&hu->proto_lock);
 
 	hci_uart_tx_wakeup(hu);
 
@@ -460,6 +478,8 @@ static int hci_uart_tty_open(struct tty_struct *tty)
 	INIT_WORK(&hu->init_ready, hci_uart_init_work);
 	INIT_WORK(&hu->write_work, hci_uart_write_work);
 
+	rwlock_init(&hu->proto_lock);
+
 	/* Flush any pending characters in the driver */
 	tty_driver_flush_buffer(tty);
 
@@ -475,6 +495,7 @@ static void hci_uart_tty_close(struct tty_struct *tty)
 {
 	struct hci_uart *hu = tty->disc_data;
 	struct hci_dev *hdev;
+	unsigned long flags;
 
 	BT_DBG("tty %p", tty);
 
@@ -490,7 +511,11 @@ static void hci_uart_tty_close(struct tty_struct *tty)
 
 	cancel_work_sync(&hu->write_work);
 
-	if (test_and_clear_bit(HCI_UART_PROTO_READY, &hu->flags)) {
+	if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
+		write_lock_irqsave(&hu->proto_lock, flags);
+		clear_bit(HCI_UART_PROTO_READY, &hu->flags);
+		write_unlock_irqrestore(&hu->proto_lock, flags);
+
 		if (hdev) {
 			if (test_bit(HCI_UART_REGISTERED, &hu->flags))
 				hci_unregister_dev(hdev);
@@ -549,13 +574,18 @@ static void hci_uart_tty_receive(struct tty_struct *tty, const u8 *data,
 	if (!hu || tty != hu->tty)
 		return;
 
-	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags))
+	read_lock(&hu->proto_lock);
+
+	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
+		read_unlock(&hu->proto_lock);
 		return;
+	}
 
 	/* It does not need a lock here as it is already protected by a mutex in
 	 * tty caller
 	 */
 	hu->proto->recv(hu, data, count);
+	read_unlock(&hu->proto_lock);
 
 	if (hu->hdev)
 		hu->hdev->stat.byte_rx += count;
diff --git a/drivers/bluetooth/hci_uart.h b/drivers/bluetooth/hci_uart.h
index 2b05e55..c6e9e1c 100644
--- a/drivers/bluetooth/hci_uart.h
+++ b/drivers/bluetooth/hci_uart.h
@@ -87,6 +87,7 @@ struct hci_uart {
 	struct work_struct	write_work;
 
 	const struct hci_uart_proto *proto;
+	rwlock_t		proto_lock;	/* Stop work for proto close */
 	void			*priv;
 
 	struct sk_buff		*tx_skb;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH V1 0/1] hci_ldisc: Use rwlocking to avoid closing proto races
  2017-05-05 15:27 [PATCH V1 0/1] hci_ldisc: Use rwlocking to avoid closing proto races Dean Jenkins
  2017-05-05 15:27 ` [PATCH V1 1/1] Bluetooth: " Dean Jenkins
@ 2017-05-05 15:55 ` Marcel Holtmann
  1 sibling, 0 replies; 3+ messages in thread
From: Marcel Holtmann @ 2017-05-05 15:55 UTC (permalink / raw)
  To: Dean Jenkins; +Cc: Gustavo F. Padovan, Johan Hedberg, linux-bluetooth

Hi Dean,

> This is my 3rd patchset in my series of resolving a design flaw in
> hci_uart_tty_close() related to the proper closure of the Data Link protocol
> layer which can result in a kernel crash.
> 
> This patchset contains a single patch that adds rwlocking around the "proto"
> function pointers. This provides a full fix to prevent a potential kernel crash
> triggered by hu->proto->close() running before or during the execution of a
> non-close "proto" function pointer.
> 
> 
> Previous Discussions
> ====================
> 
> Please refer to the discussion on my 2nd patchset V1 that can be found here:
> https://www.spinics.net/lists/linux-bluetooth/msg70370.html
> You accepted the 3 patches of my patchset of commits namely:
>  Bluetooth: hci_ldisc: Add protocol check to hci_uart_send_frame()
>  Bluetooth: hci_ldisc: Add protocol check to hci_uart_dequeue()
>  Bluetooth: hci_ldisc: Add protocol check to hci_uart_tx_wakeup()
> 
> drivers/bluetooth/hci_ldisc.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
> 
> Please refer to the discussion on my 1st patchset V3 that can be found here:
> https://www.spinics.net/lists/linux-bluetooth/msg70315.html
> You accepted the 3 patches of my patchset of commits namely:
>  Bluetooth: hci_ldisc: Add missing return in hci_uart_init_work()
>  Bluetooth: hci_ldisc: Ensure hu->hdev set to NULL before freeing hdev
>  Bluetooth: hci_ldisc: Add missing clear HCI_UART_PROTO_READY
> 
> drivers/bluetooth/hci_ldisc.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
> 
> Please refer to the discussion on my 1st patchset V2 that can be found here:
> https://www.spinics.net/lists/linux-bluetooth/msg70183.html
> 
> Please refer to the discussion on my 1st RFC patchset V1 that can be found here:
> https://www.spinics.net/lists/linux-bluetooth/msg70077.html
> 
> 
> Analysis
> ========
> 
> There are multiple threads that use "proto" function pointers that potentially
> can be running concurrently with the thread of hci_uart_tty_close().
> 
> The threads include (not a complete list):
> 
> a) hci_uart_write_work() [hci_ldisc.c]
>   which is executed by the work queue hu->write_work which creates the
>   callstack:
>   hci_uart_write_work()
>   hci_uart_dequeue()
>   hu->proto->dequeue()
> 
> b) hci_tx_work() [hci_core.c]
>   which is executed by the work queue hdev->tx_work which creates the
>   callstack:
>   hci_tx_work()
>   hci_send_frame()
>   hdev->send()
>   hci_uart_send_frame()
>   hu->proto->enqueue()
>   then calls
>   hci_uart_tx_wakeup()
>   schedule_work(&hu->write_work)
>   triggering a)
> 
> c) hci_cmd_work() [hci_core.c]
>   which is executed by the work queue hdev->cmd_work which creates the
>   callstack:
>   hci_cmd_work()
>   hci_send_frame()
>   hdev->send()
>   hci_uart_send_frame()
>   hu->proto->enqueue()
>   then calls
>   hci_uart_tx_wakeup()
>   schedule_work(&hu->write_work)
>   triggering a)
> 
> d) hci_dev_close() [hci_core.c]
>   creates the callstack:
>   hci_dev_close()
>   hci_dev_do_close()
>   hdev->flush()
>   hci_uart_flush()
>   hu->proto->flush()
> 
> e) When BCSP is used and a retransmission is needed:
>   bcsp_timed_event() [hci_bcsp.c]
>   creates the callstack:
>   bcsp_timed_event()
>   hci_uart_tx_wakeup()
>   schedule_work(&hu->write_work)
>   triggering a)
> 
> 
> hci_ldisc.c: hci_uart_tty_close() currently has the construction:
> 
> 	cancel_work_sync(&hu->write_work);
> 
> 	if (test_and_clear_bit(HCI_UART_PROTO_READY, &hu->flags)) {
> 		if (hdev) {
> 			if (test_bit(HCI_UART_REGISTERED, &hu->flags))
> 				hci_unregister_dev(hdev);
> 			hci_free_dev(hdev);
> 		}
> 		hu->proto->close(hu);
> 	}
> 
> At first glance, the cancel_work_sync(&hu->write_work) would seem to inhibit
> hci_uart_write_work() from running. Actually, the call to cancel_work_sync()
> can be ineffective because:
> 
> a) hci_unregister_dev() calls hci_dev_do_close() and if HCI_UP flag is in the
>   set state then it is possible for hci_dev_do_close() to send a HCI command.
>   Such as sending a HCI RESET command which causes hci_send_frame() to schedule
>   hci_uart_write_work() after the cancel_work_sync() has executed.
> 
> b) In the case of BCSP retransmissions, the callstack in e) runs periodically
>   every 250ms. This can occur before and after the cancel_work_sync() executed.
> 
> 
> The call to hu->proto->close(hu) is partially protected by the flag
> HCI_UART_PROTO_READY but there is a small race condition. A race condition
> exists because hci_uart_tty_close() can clear HCI_UART_PROTO_READY just after
> the concurrent thread(s) detected the flag as set. The race potentially allows
> the concurrent thread(s) to run a "proto" function pointer after or during
> hci_uart_tty_close() is calling hu->proto->close(hu) which could trigger a
> kernel crash.
> 
> Bear in mind that on systems with low resources, such as embedded systems, that
> threads can get delayed by interrupts and pre-emption. This means that a
> concurrent thread might sleep after it checks the HCI_UART_PROTO_READY flag.
> Therefore, just using the flag HCI_UART_PROTO_READY is not 100% safe from
> introducing race conditions.
> 
> Note that the elapsed run-time for hci_unregister_dev() is dependent on a
> number of scenarios. This means hci_unregister_dev() can quickly return or take
> up to 10 seconds (guess) to return. This means that the elapsed time between
> clearing HCI_UART_PROTO_READY and executing hu->proto->close(hu) is variable
> and therefore there is a variable risk in triggering the kernel crash.
> 
> 
> Solution implemented by this patchset
> =====================================
> 
> Modify hci_ldisc.c: hci_uart_tty_close() to use a rwlock solution:
> 
> 	cancel_work_sync(&hu->write_work);
> 
> 	if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
> 		write_lock_irqsave(&hu->proto_lock, flags);
> 		clear_bit(HCI_UART_PROTO_READY, &hu->flags);
> 		write_unlock_irqrestore(&hu->proto_lock, flags);
> 
> 		if (hdev) {
> 			if (test_bit(HCI_UART_REGISTERED, &hu->flags))
> 				hci_unregister_dev(hdev);
> 			hci_free_dev(hdev);
> 		}
> 		hu->proto->close(hu);
> 	}
> 
> The call to write_lock_irqsave(&hu->proto_lock, flags) will block when a
> concurrent thread has called read_lock(&hu->proto_lock). This ensures that
> the flag HCI_UART_PROTO_READY is only cleared when no "proto" function pointers
> are running in the concurrent threads. Therefore, it is not possible for
> any "proto" function pointers to be running immediately before, during or after
> hu->proto->close(hu) is called. This means the race condition is avoided.
> 
> Note that this solution will inhibit the Data Link layer protocol from
> independently retransmitting because hu->proto->enqueue() cannot be called after
> HCI_UART_PROTO_READY has been put into the clear state.
> 
> 
> Test results from running the patchset
> ======================================
> 
> A x86 64-bit test kernel was created based on kernel v4.11.0 plus the 7 patches:
> 
> 3rd Patchset
>  Bluetooth: hci_ldisc: Use rwlocking to avoid closing proto races
> 2nd Patchset 
>  Bluetooth: hci_ldisc: Add protocol check to hci_uart_send_frame()
>  Bluetooth: hci_ldisc: Add protocol check to hci_uart_dequeue()
>  Bluetooth: hci_ldisc: Add protocol check to hci_uart_tx_wakeup()
> 1st Patchset 
>  Bluetooth: hci_ldisc: Add missing return in hci_uart_init_work()
>  Bluetooth: hci_ldisc: Ensure hu->hdev set to NULL before freeing hdev
>  Bluetooth: hci_ldisc: Add missing clear HCI_UART_PROTO_READY
> 
> The test used btattach with BCSP which fails to establish a link because BCSP
> is not fully implemented in the kernel. However, this is ideal in forcing
> BCSP retransmissions. btattach was killed after 5 seconds of running to force
> hci_uart_tty_close() to execute.
> 
> Here is a snippet of a dmesg log with dynamic debug enabled:
> [ 1039.415914] hci_uart_tty_close: tty ffff9861c691d800
> [ 1039.415917] hci_uart_close: hdev ffff98625b422000
> [ 1039.415920] hci_uart_flush: hdev ffff98625b422000 tty ffff9861c691d800
> [ 1039.415922] bcsp_flush: hu ffff9861c27bd480
> [ 1039.415928] hci_unregister_dev: ffff98625b422000 name hci1 bus 3
> [ 1039.418767] bcsp_timed_event: hu ffff9861c27bd480 retransmitting 2 pkts
> [ 1039.674810] Bluetooth: hci1 command 0x1001 tx timeout
> [ 1039.674827] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 1
> [ 1039.674832] hci_send_frame: hci1 type 1 len 3
> [ 1039.674836] hci_uart_send_frame: hci1: type 1 len 3
> [ 1039.674838] Bluetooth: hci1 sending frame failed (-49)
> [ 1041.722773] Bluetooth: hci1 command 0x1009 tx timeout
> [ 1041.722811] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 0
> [ 1045.818847] hci_uart_close: hdev ffff98625b422000
> [ 1045.818850] hci_uart_flush: hdev ffff98625b422000 tty ffff9861c691d800
> [ 1045.818868] hci_dev_do_close: hci1 ffff98625b422000
> [ 1045.819031] hci_conn_params_clear_all: All LE connection parameters were removed
> [ 1045.819036] bcsp_close: hu ffff9861c27bd480
> 
> The following can be observed from the results:
> 
> 1. hci_unregister_dev() ran which means HCI_UART_PROTO_READY has been cleared.
> 
> 2. bcsp_timed_event() runs after HCI_UART_PROTO_READY has been cleared and
>   so hci_uart_tx_wakeup() no longer appears in the log. This inhibits further
>   BCSP retransmission attempts.
> 
> 3. Bluetooth: hci1 command 0x1001 tx timeout
>   This occurs because BCSP was unable to successfully transfer the HCI
>   command. BCSP fails because the data plane to and from the UART driver is no
>   longer working.
> 
> 4. Bluetooth: hci1 sending frame failed (-49)
>   This occurs because HCI_UART_PROTO_READY has been cleared which inhibits
>   a new HCI command being sent from hci_cmd_work().
> 
> 5. Bluetooth: hci1 command 0x1009 tx timeout
>   BCSP never took the HCI command but the HCI core still performs the HCI
>   command timeout. This means the thread gets blocked for 2 seconds for doing
>   nothing.
> 
> 6. hci_uart_close() runs twice which seems to be unnecessary. This causes
>   hci_uart_flush() to run twice but notice the 2nd execution of
>   hci_uart_flush() does not run bcsp_flush() because hdev->flush gets set to
>   NULL in hci_uart_close(). This suggests a workaround was done in the past
>   to prevent a kernel crash.
> 
> 
> Outline of potential further work
> =================================
> 
> With this patchset, the design flaw in hci_uart_tty_close() is further exposed.
> 
> It can be seen that hci_uart_tty_close() is systematically preventing
> communications with the Bluetooth Radio Module. But hci_uart_tty_close() calls
> hci_unregister_dev() which calls hci_dev_do_close(). If the flag HCI_UP is in
> the set state then hci_dev_do_close() assumes that good communications with the
> Bluetooth Radio Module are still possible. Therefore, there is a conflict in the
> design because the design should not be closing the link and trying to use the
> link to send HCI commands at the same time. This can trigger the protocol error
> handling procedures to occur which are unsuccessful.
> 
> In addition, it is suspected that before hci_uart_tty_close() runs, the TTY
> layer has cut the data plane between the Data Link protocol and the UART driver.
> This means that if hci_uart_tty_close() is rearranged to allow
> HCI_UART_PROTO_READY to be cleared after hci_unregister_dev() runs then any HCI
> command will still fail to reach the Bluetooth Radio Module.
> 
> There are other issues in hci_ldisc.c: hci_uart_tty_close()
> 
> 	/* Detach from the tty */
> 	tty->disc_data = NULL;
> 
> 	if (!hu)
> 		return;
> 
> 	hdev = hu->hdev;
> 	if (hdev)
> 		hci_uart_close(hdev);
> 
> It can be seen that tty->disc_data is set to NULL. This causes some of the HCI
> UART LDISC functions to not perform any operation such as the functions:
> 
> hci_uart_tty_wakeup()
> hci_uart_tty_receive()
> hci_uart_tty_ioctl()
> 
> Doing "tty->disc_data = NULL" sabotages the communications with the Bluetooth
> Radio Module. Therefore, this statement needs to be moved to after the
> communications are going to be cleanly closed such as after hci_unregister_dev()
> has completed.
> 
> Calling hci_uart_close(hdev) is likely to interfere with the communications with
> the Bluetooth Radio Module. Therefore, this statement needs to be moved to after
> the communications are going to be cleanly closed such as after
> hci_unregister_dev() has completed.
> 
> 
> Potential solution to the design flaw
> =====================================
> 
> Suspect that a solution would be to get hci_uart_tty_close() to run
> hci_unregister_dev() before tearing down the resources needed by the
> Data Link protocol layer.
> 
> 
> The patch was rebased onto the bluetooth-next master branch HEAD commit:
> 2da711a Bluetooth: Skip vendor diagnostic configuration for HCI User Channel
> 
> Dean Jenkins (1):
>  Bluetooth: hci_ldisc: Use rwlocking to avoid closing proto races
> 
> drivers/bluetooth/hci_ldisc.c | 40 +++++++++++++++++++++++++++++++++++-----
> drivers/bluetooth/hci_uart.h  |  1 +
> 2 files changed, 36 insertions(+), 5 deletions(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-05-05 15:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-05 15:27 [PATCH V1 0/1] hci_ldisc: Use rwlocking to avoid closing proto races Dean Jenkins
2017-05-05 15:27 ` [PATCH V1 1/1] Bluetooth: " Dean Jenkins
2017-05-05 15:55 ` [PATCH V1 0/1] " Marcel Holtmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.