Hi Lee, On Wed, Jul 6, 2022 at 3:53 AM Lee Jones wrote: > > On Tue, 05 Jul 2022, Luiz Augusto von Dentz wrote: > > > Hi Lee, > > > > On Wed, Jun 29, 2022 at 8:28 AM Lee Jones wrote: > > > > > > On Tue, 28 Jun 2022, Luiz Augusto von Dentz wrote: > > > > > > > Hi Eric, Lee, > > > > > > > > On Mon, Jun 27, 2022 at 4:39 PM Luiz Augusto von Dentz > > > > wrote: > > > > > > > > > > Hi Eric, Lee, > > > > > > > > > > On Mon, Jun 27, 2022 at 7:41 AM Eric Dumazet wrote: > > > > > > > > > > > > On Wed, Jun 22, 2022 at 10:27 AM Lee Jones wrote: > > > > > > > > > > > > > > This change prevents a use-after-free caused by one of the worker > > > > > > > threads starting up (see below) *after* the final channel reference > > > > > > > has been put() during sock_close() but *before* the references to the > > > > > > > channel have been destroyed. > > > > > > > > > > > > > > refcount_t: increment on 0; use-after-free. > > > > > > > BUG: KASAN: use-after-free in refcount_dec_and_test+0x20/0xd0 > > > > > > > Read of size 4 at addr ffffffc114f5bf18 by task kworker/u17:14/705 > > > > > > > > > > > > > > CPU: 4 PID: 705 Comm: kworker/u17:14 Tainted: G S W 4.14.234-00003-g1fb6d0bd49a4-dirty #28 > > > > > > > Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150 Google Inc. MSM sm8150 Flame DVT (DT) > > > > > > > Workqueue: hci0 hci_rx_work > > > > > > > Call trace: > > > > > > > dump_backtrace+0x0/0x378 > > > > > > > show_stack+0x20/0x2c > > > > > > > dump_stack+0x124/0x148 > > > > > > > print_address_description+0x80/0x2e8 > > > > > > > __kasan_report+0x168/0x188 > > > > > > > kasan_report+0x10/0x18 > > > > > > > __asan_load4+0x84/0x8c > > > > > > > refcount_dec_and_test+0x20/0xd0 > > > > > > > l2cap_chan_put+0x48/0x12c > > > > > > > l2cap_recv_frame+0x4770/0x6550 > > > > > > > l2cap_recv_acldata+0x44c/0x7a4 > > > > > > > hci_acldata_packet+0x100/0x188 > > > > > > > hci_rx_work+0x178/0x23c > > > > > > > process_one_work+0x35c/0x95c > > > > > > > worker_thread+0x4cc/0x960 > > > > > > > kthread+0x1a8/0x1c4 > > > > > > > ret_from_fork+0x10/0x18 > > > > > > > > > > > > > > Cc: stable@kernel.org > > > > > > > > > > > > When was the bug added ? (Fixes: tag please) > > > > > > > > > > > > > Cc: Marcel Holtmann > > > > > > > Cc: Johan Hedberg > > > > > > > Cc: Luiz Augusto von Dentz > > > > > > > Cc: "David S. Miller" > > > > > > > Cc: Eric Dumazet > > > > > > > Cc: Jakub Kicinski > > > > > > > Cc: Paolo Abeni > > > > > > > Cc: linux-bluetooth@vger.kernel.org > > > > > > > Cc: netdev@vger.kernel.org > > > > > > > Signed-off-by: Lee Jones > > > > > > > --- > > > > > > > net/bluetooth/l2cap_core.c | 4 ++-- > > > > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c > > > > > > > index ae78490ecd3d4..82279c5919fd8 100644 > > > > > > > --- a/net/bluetooth/l2cap_core.c > > > > > > > +++ b/net/bluetooth/l2cap_core.c > > > > > > > @@ -483,9 +483,7 @@ static void l2cap_chan_destroy(struct kref *kref) > > > > > > > > > > > > > > BT_DBG("chan %p", chan); > > > > > > > > > > > > > > - write_lock(&chan_list_lock); > > > > > > > list_del(&chan->global_l); > > > > > > > - write_unlock(&chan_list_lock); > > > > > > > > > > > > > > kfree(chan); > > > > > > > } > > > > > > > @@ -501,7 +499,9 @@ void l2cap_chan_put(struct l2cap_chan *c) > > > > > > > { > > > > > > > BT_DBG("chan %p orig refcnt %u", c, kref_read(&c->kref)); > > > > > > > > > > > > > > + write_lock(&chan_list_lock); > > > > > > > kref_put(&c->kref, l2cap_chan_destroy); > > > > > > > + write_unlock(&chan_list_lock); > > > > > > > } > > > > > > > EXPORT_SYMBOL_GPL(l2cap_chan_put); > > > > > > > > > > > > > > > > > > > > > > > > > > I do not think this patch is correct. > > > > > > > > > > > > a kref does not need to be protected by a write lock. > > > > > > > > > > > > This might shuffle things enough to work around a particular repro you have. > > > > > > > > > > > > If the patch was correct why not protect kref_get() sides ? > > > > > > > > > > > > Before the &hdev->rx_work is scheduled (queue_work(hdev->workqueue, > > > > > > &hdev->rx_work), > > > > > > a reference must be taken. > > > > > > > > > > > > Then this reference must be released at the end of hci_rx_work() or > > > > > > when hdev->workqueue > > > > > > is canceled. > > > > > > > > > > > > This refcount is not needed _if_ the workqueue is properly canceled at > > > > > > device dismantle, > > > > > > in a synchronous way. > > > > > > > > > > > > I do not see this hdev->rx_work being canceled, maybe this is the real issue. > > > > > > > > > > > > There is a call to drain_workqueue() but this is not enough I think, > > > > > > because hci_recv_frame() > > > > > > can re-arm > > > > > > queue_work(hdev->workqueue, &hdev->rx_work); > > > > > > > > > > I suspect this likely a refcount problem, we do l2cap_get_chan_by_scid: > > > > > > > > > > /* Find channel with given SCID. > > > > > * Returns locked channel. */ > > > > > static struct l2cap_chan *l2cap_get_chan_by_scid(struct l2cap_conn > > > > > *conn, u16 cid) > > > > > > > > > > So we return a locked channel but that doesn't prevent another thread > > > > > to call l2cap_chan_put which doesn't care about l2cap_chan_lock so > > > > > perhaps we actually need to host a reference while we have the lock, > > > > > at least we do something like that on l2cap_sock.c: > > > > > > > > > > l2cap_chan_hold(chan); > > > > > l2cap_chan_lock(chan); > > > > > > > > > > __clear_chan_timer(chan); > > > > > l2cap_chan_close(chan, ECONNRESET); > > > > > l2cap_sock_kill(sk); > > > > > > > > > > l2cap_chan_unlock(chan); > > > > > l2cap_chan_put(chan); > > > > > > > > Perhaps something like this: > > > > > > I'm struggling to apply this for test: > > > > > > "error: corrupt patch at line 6" > > > > Check with the attached patch. > > With the patch applied: > > [ 188.825418][ T75] refcount_t: addition on 0; use-after-free. > [ 188.825418][ T75] refcount_t: addition on 0; use-after-free. Looks like the changes just make the issue more visible since we are trying to add a refcount when it is already 0 so this proves the design is not quite right since it is removing the object from the list only when destroying it while we probably need to do it before. How about we use kref_get_unless_zero as it appears it was introduced exactly for such cases (patch attached.) Luiz Augusto von Dentz