linux-hams.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* unregister_netdevice: waiting for ax0 to become free
@ 2024-04-09 21:05 Lars Kellogg-Stedman
  2024-04-10  0:20 ` David Ranch
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Lars Kellogg-Stedman @ 2024-04-09 21:05 UTC (permalink / raw)
  To: linux-hams

First, apologies if this isn't the right place for this. I know this
is primarily a development list, but I'm trying to track down patches
for a kernel issue.

I'm running a kernel built from 2c71fdf02a9, which includes the recent
use-after-free patch from Duoming Zhou in fd819ad3ec. After bringing
up ax.25 (see below), If I make a single ax.25 connection to the host
(to an ax25d hosted service), it will subsequently fail to reboot; the
kernel gets stuck waiting for an ax.25 interface:

     3   [  217.717327] unregister_netdevice: waiting for ax0 to
become free. Usage count = 0
     2   [  227.957427] unregister_netdevice: waiting for ax0 to
become free. Usage count = 0

The kernel will get stuck here indefinitely (or until the watchdog
triggers). I know this isn't a new problem, but is it one for which
patches exist? There's a lot of older information out there, and the
website in the mailing list welcome message
(https://radio.linux.org.au/) appears to have gone away.

It looks there is a panic earlier in the shutdown process; the kernel
trace starts like this:

=== cut ===
[   32.072855] ------------[ cut here ]------------
[   32.074595] refcount_t: underflow; use-after-free.
[   32.075777] WARNING: CPU: 0 PID: 1354 at lib/refcount.c:28
refcount_warn_saturate+0xbe/0x110
[   32.077104] Modules linked in: mkiss ax25 ...
[   32.088624] CPU: 0 PID: 1354 Comm: socat Not tainted 6.9.0-rc3+ #3
=== cut ===

I'm running a test environment with a pair of AXUDP interfaces; I'm
configuring things like this:

=== cut ===
#!/bin/sh

mkdir -p /run/radio
socat \
        pty,link=/run/radio/ptyudp0 \
        pty,link=/run/radio/kissudp0 &
socat \
        pty,link=/run/radio/ptyudp1 \
        pty,link=/run/radio/kissudp1 &

while ! [ -c /run/radio/ptyudp0 ]; do sleep 1; done
while ! [ -c /run/radio/ptyudp1 ]; do sleep 1; done

ax25ipd -c /etc/ax25/ax25ipd-udp0.conf -d /run/radio/ptyudp0
ax25ipd -c /etc/ax25/ax25ipd-udp1.conf -d /run/radio/ptyudp1

kissattach -l $(readlink /run/radio/kissudp0) udp0
kissattach -l $(readlink /run/radio/kissudp1) udp1

kissparms -p udp0 -c 1
kissparms -p udp1 -c 1

ax25d -l
=== cut ===

-- 
Lars Kellogg-Stedman <lars@oddbit.com> | larsks@{twitter, irc, github,
...} | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-09 21:05 unregister_netdevice: waiting for ax0 to become free Lars Kellogg-Stedman
@ 2024-04-10  0:20 ` David Ranch
  2024-04-10  1:21   ` Lars Kellogg-Stedman
  2024-04-10 12:01 ` Dan Carpenter
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: David Ranch @ 2024-04-10  0:20 UTC (permalink / raw)
  To: Lars Kellogg-Stedman, linux-hams


Hello Lars,

> First, apologies if this isn't the right place for this. I know this
> is primarily a development list, but I'm trying to track down patches
> for a kernel issue.
This is the right place for this Linux AX.25 kernel bug


> I'm running a kernel built from 2c71fdf02a9, which includes the recent
> use-after-free patch from Duoming Zhou in fd819ad3ec.

Few questions:

   - Which specific kernel version does that commit go into?
   - What distro and version of distro
   - Which ax.25 lib/apps/tools are you using (distro provided, 
official, ve7fet, other) and which versions?


> After bringing
> up ax.25 (see below), If I make a single ax.25 connection to the host
> (to an ax25d hosted service), it will subsequently fail to reboot; the
> kernel gets stuck waiting for an ax.25 interface:
>
>       3   [  217.717327] unregister_netdevice: waiting for ax0 to
> become free. Usage count = 0

Yup.. I've seen that and if you have people making connections to you 
that later time out, you will probably see stale sessions per "netstat 
-A ax25 -an".  Pay attention to the DESTINATION column.  I know Ubuntu 
20.04 with both it's standard and ESM kernels being impacted.  I 
recently started testing with the Raspberry Pi Bookworm 6.6.20 kernel 
but have only been running it for a week.  The reality here is that 
unless you know exactly what commits have been applied to a specific 
vendor's kernel, the version number alone cannot tell you if the fix is 
present.


> The kernel will get stuck here indefinitely (or until the watchdog
> triggers). I know this isn't a new problem, but is it one for which
> patches exist? There's a lot of older information out there

There are various threads about this very issue on various lists be it 
this one, Pat (winlink client), etc.   Here is a sample of my running 
list of issues:
--
Last good known kernel version
-----------------------------------------------------------
02/04/24:

I know I am jumping in late here, but the AX.25 stuff mostly works on 
kernel version 4.19, and does ok on 5.15 with some patches. Version 6 
just seems to explode.  I am running my linux/AX.25 on an old version of 
Slackware (13.37) that runs kernel version 2.6, and it is running 
perfectly for all the stuff I am using.  It is kind of sad that this has 
been neglected for that long.

-Chris KQ6UP


Raspberry Pi OS specific:
    - Bullseye, which should have the 5.15.xx kernel.

    - Buster has a release of 4.19 kernel I'll test also.


2/2/24: ve1jot@eastlink.ca
Ahh, well, there's been work done since kernel 5.15, I think it was 
kernel 6.2 I was trying, and that was for a different reason than 
crashing..there was an issue with axip/axudp links going dead and I 
found that kernel 6.2 seemed to resolve this issue, but I'm not sure if 
any of the recent patches address all problems, or just a few hi...

2/5/23:  jon-bousselot@pacbell.net
If I'm seeing this correctly, the stable kernel tree for rpi does NOT 
have the ax25 patches applied.
The ax25 patches do appear in 5.15.y and the 6.1.y.  The only two I 
checked.



Patches
-----------------------------------------------------------

Detailed 5.10.x fixes and a good kernel compile howto from Chris Madness:
https://groups.io/g/KM4ACK-Pi/topic/howto_patch_the_kernel_for/95904470?p=,,,20,0,0,0::recentpostdate/sticky,,,20,0,0,95904470,previd%3D0,nextid%3D1691495389076951633&previd=0&nextid=1691495389076951633


https://blog.habets.se/2021/11/AX25-user-space.html
    - known issues
    - proposed userspace solution

1/8/24: known kernel issues
https://groups.io/g/RaspberryPi-4-HamRadio/message/15430

12/23/24: Other reports of kernel panics, bugs:
https://groups.io/g/RaspberryPi-4-HamRadio/topic/103374879#15603

5/22/22:  stale sessions
https://github.com/la5nta/pat/issues/352

--


> website in the mailing list welcome message
> (https://radio.linux.org.au/) appears to have gone away.

Where did you see that?  That site has been gone for years and whatever 
is giving you that old link needs to be updated.


> I'm running a test environment with a pair of AXUDP interfaces; I'm
> configuring things like this:

While I don't see why it isn't legal, I've never seen people use socat 
to link together kissttach'ed link.

--David
KI6ZHD

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-10  0:20 ` David Ranch
@ 2024-04-10  1:21   ` Lars Kellogg-Stedman
  0 siblings, 0 replies; 10+ messages in thread
From: Lars Kellogg-Stedman @ 2024-04-10  1:21 UTC (permalink / raw)
  To: linux-hams; +Cc: linux-hams

David,

Thanks for the reply; glad to know I'm in the right place.

> > I'm running a kernel built from 2c71fdf02a9, which includes the recent
> > use-after-free patch from Duoming Zhou in fd819ad3ec.
>
> Few questions:
>
>    - Which specific kernel version does that commit go into?

That's the head of `master` today; the Makefile builds 6.9.0-rc3.

>    - What distro and version of distro

This is Fedora 39.

>    - Which ax.25 lib/apps/tools are you using (distro provided,
> official, ve7fet, other) and which versions?

I'm using the distro-provided packages:

    [root@node1 ~]# rpm -qa '*ax25*'
    libax25-1.1.1-8.fc39.x86_64
    ax25-apps-2.0.0-10.fc39.x86_64
    ax25-tools-1.0.4-8.fc39.x86_64

Which appear to all come from https://github.com/ve7fet/linuxax25.git.

> The reality here is that unless you know exactly what commits have been
> applied to a specific vendor's kernel, the version number alone cannot tell
> you if the fix is present.

In case it wasn't clear from my last email, I'm running an upstream kernel for
exactly this reason.

> Here is a sample of my running list of issues:

Thanks; I'll take a look through those. I know I've come across a few of those
links before (in particular, I've seen
https://blog.habets.se/2021/11/AX25-user-space.html several times in
my search results).

> > website in the mailing list welcome message
> > (https://radio.linux.org.au/) appears to have gone away.
>
> Where did you see that?  That site has been gone for years and whatever
> is giving you that old link needs to be updated.

That's in the welcome message for this mailing list. I joined in Oct. 2023, so
not that long ago, and the welcome message (from majordomo@vger.kernel.org)
includes this information:

> Welcome to the linux-hams mailing list, hosted on vger.kernel.org.
> This is for discussions and development of Radio Amateur protocol
> and interface subsystems.
>
> Archives:
>   http://hes.iki.fi/archive/linux-hams/
>   http://www.spinics.net/lists/linux-hams/
>   http://marc.info/?l=linux-hams
> Broken archives:
>   http://groups.google.com/groups?group=mlist.linux.hams
>   http://www.geocrawler.com/lists/3/Linux/
> Ham software index for Linux:
>   http://radio.linux.org.au/

It looks like several of those links are broken.

> While I don't see why it isn't legal, I've never seen people use socat
> to link together kissttach'ed link.

socat is useful for this because it can automatically create symlinks to the
pty devices, so you have deterministic names you can use in your scripts (and
in any case it shouldn't break the kernel).

-- 
Lars Kellogg-Stedman <lars@oddbit.com> | larsks@{twitter, irc, github,
...} | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-09 21:05 unregister_netdevice: waiting for ax0 to become free Lars Kellogg-Stedman
  2024-04-10  0:20 ` David Ranch
@ 2024-04-10 12:01 ` Dan Carpenter
  2024-04-11  2:42   ` Lars Kellogg-Stedman
  2024-04-10 12:02 ` Dan Carpenter
  2024-04-20  1:40 ` Lars Kellogg-Stedman
  3 siblings, 1 reply; 10+ messages in thread
From: Dan Carpenter @ 2024-04-10 12:01 UTC (permalink / raw)
  To: Lars Kellogg-Stedman; +Cc: linux-hams, Duoming Zhou, Simon Horman

You say that this isn't a new problem...  Why would it not be a new
problem?  Could you try reverting commit fd819ad3ecf6 ("ax25: fix
use-after-free bugs caused by ax25_ds_del_timer") and see if that fixes
the problem?

Just do: `git show fd819ad3ecf6 | patch -p1 -R` and rebuild.

regards,
dan carpenter

On Tue, Apr 09, 2024 at 05:05:40PM -0400, Lars Kellogg-Stedman wrote:
> First, apologies if this isn't the right place for this. I know this
> is primarily a development list, but I'm trying to track down patches
> for a kernel issue.
> 
> I'm running a kernel built from 2c71fdf02a9, which includes the recent
> use-after-free patch from Duoming Zhou in fd819ad3ec. After bringing
> up ax.25 (see below), If I make a single ax.25 connection to the host
> (to an ax25d hosted service), it will subsequently fail to reboot; the
> kernel gets stuck waiting for an ax.25 interface:
> 
>      3   [  217.717327] unregister_netdevice: waiting for ax0 to
> become free. Usage count = 0
>      2   [  227.957427] unregister_netdevice: waiting for ax0 to
> become free. Usage count = 0
> 
> The kernel will get stuck here indefinitely (or until the watchdog
> triggers). I know this isn't a new problem, but is it one for which
> patches exist? There's a lot of older information out there, and the
> website in the mailing list welcome message
> (https://radio.linux.org.au/) appears to have gone away.
> 
> It looks there is a panic earlier in the shutdown process; the kernel
> trace starts like this:
> 
> === cut ===
> [   32.072855] ------------[ cut here ]------------
> [   32.074595] refcount_t: underflow; use-after-free.
> [   32.075777] WARNING: CPU: 0 PID: 1354 at lib/refcount.c:28
> refcount_warn_saturate+0xbe/0x110
> [   32.077104] Modules linked in: mkiss ax25 ...
> [   32.088624] CPU: 0 PID: 1354 Comm: socat Not tainted 6.9.0-rc3+ #3
> === cut ===
> 
> I'm running a test environment with a pair of AXUDP interfaces; I'm
> configuring things like this:
> 
> === cut ===
> #!/bin/sh
> 
> mkdir -p /run/radio
> socat \
>         pty,link=/run/radio/ptyudp0 \
>         pty,link=/run/radio/kissudp0 &
> socat \
>         pty,link=/run/radio/ptyudp1 \
>         pty,link=/run/radio/kissudp1 &
> 
> while ! [ -c /run/radio/ptyudp0 ]; do sleep 1; done
> while ! [ -c /run/radio/ptyudp1 ]; do sleep 1; done
> 
> ax25ipd -c /etc/ax25/ax25ipd-udp0.conf -d /run/radio/ptyudp0
> ax25ipd -c /etc/ax25/ax25ipd-udp1.conf -d /run/radio/ptyudp1
> 
> kissattach -l $(readlink /run/radio/kissudp0) udp0
> kissattach -l $(readlink /run/radio/kissudp1) udp1
> 
> kissparms -p udp0 -c 1
> kissparms -p udp1 -c 1
> 
> ax25d -l
> === cut ===
> 
> -- 
> Lars Kellogg-Stedman <lars@oddbit.com> | larsks@{twitter, irc, github,
> ...} | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-09 21:05 unregister_netdevice: waiting for ax0 to become free Lars Kellogg-Stedman
  2024-04-10  0:20 ` David Ranch
  2024-04-10 12:01 ` Dan Carpenter
@ 2024-04-10 12:02 ` Dan Carpenter
  2024-04-20  1:40 ` Lars Kellogg-Stedman
  3 siblings, 0 replies; 10+ messages in thread
From: Dan Carpenter @ 2024-04-10 12:02 UTC (permalink / raw)
  To: Lars Kellogg-Stedman; +Cc: linux-hams

On Tue, Apr 09, 2024 at 05:05:40PM -0400, Lars Kellogg-Stedman wrote:
> It looks there is a panic earlier in the shutdown process; the kernel
> trace starts like this:
> 
> === cut ===
> [   32.072855] ------------[ cut here ]------------
> [   32.074595] refcount_t: underflow; use-after-free.
> [   32.075777] WARNING: CPU: 0 PID: 1354 at lib/refcount.c:28
> refcount_warn_saturate+0xbe/0x110
> [   32.077104] Modules linked in: mkiss ax25 ...
> [   32.088624] CPU: 0 PID: 1354 Comm: socat Not tainted 6.9.0-rc3+ #3
> === cut ===
> 

Could you provide the whole dmesg?

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-10 12:01 ` Dan Carpenter
@ 2024-04-11  2:42   ` Lars Kellogg-Stedman
  2024-04-11 13:47     ` Lars Kellogg-Stedman
  0 siblings, 1 reply; 10+ messages in thread
From: Lars Kellogg-Stedman @ 2024-04-11  2:42 UTC (permalink / raw)
  To: Dan Carpenter, linux-hams

On Wed, Apr 10, 2024 at 8:01 AM Dan Carpenter <dan.carpenter@linaro.org> wrote:
>
> You say that this isn't a new problem...  Why would it not be a new
> problem?

I know that the same behavior reproduces with the Fedora 39 stock
kernel (6.5.6 +
patches, released 10/2023, predating fd819ad3ecf6 by several months).
I've also seen the same thing on Raspbian kernels dating back
before that.

> Could you try reverting commit fd819ad3ecf6 ("ax25: fix
> use-after-free bugs caused by ax25_ds_del_timer") and see if that fixes
> the problem?
>
> Just do: `git show fd819ad3ecf6 | patch -p1 -R` and rebuild.

The behavior is identical with that patch removed.

> Could you provide the whole dmesg?

Sure. Included below:

[   43.941039] ------------[ cut here ]------------er root...
[   43.941327] refcount_t: underflow; use-after-free.
[   43.941654] WARNING: CPU: 0 PID: 992 at lib/refcount.c:28
refcount_warn_saturate+0xbe/0x110
[   43.942186] Modules linked in: mkiss rfkill binfmt_misc vfat fat
intel_rapl_msr intel_rapl_common intel_uncore_frequency_common
intel_pmc_core snd_hda_codec_generic intel_vsec pmt_telemetry
pmt_class snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_intel
snd_hda_codec snd_hda_core kvm snd_hwdep snd_pcm iTCO_wdt
intel_pmc_bxt iTCO_vendor_support irqbypass rapl pktcdvd snd_timer
i2c_i801 snd i2c_smbus pcspkr lpc_ich soundcore virtio_net
net_failover virtio_balloon failover virtio_gpu virtio_dma_buf joydev
netrom ax25 ip6_tables ip_tables fuse loop zram crct10dif_pclmul
crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 virtio_console
sha1_ssse3 virtio_blk serio_raw qemu_fw_cfg
[   43.946568] CPU: 0 PID: 992 Comm: ax25ipd Not tainted
6.8.4-200.fc39.x86_64 #1
[   43.947132] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-1.fc39 04/01/2014
[   43.947692] RIP: 0010:refcount_warn_saturate+0xbe/0x110
[   43.948031] Code: 01 01 e8 95 43 8e ff 0f 0b c3 cc cc cc cc 80 3d
6b 4e ea 01 00 75 85 48 c7 c7 00 8a b7 98 c6 05 5b 4e ea 01 01 e8 72
43 8e ff <0f> 0b c3 cc cc cc cc 80 3d 49 4e ea 01 00 0f 85 5e ff ff ff
48 c7
[   43.949211] RSP: 0018:ffffa714c24a3b28 EFLAGS: 00010286

[   43.949563] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
[   43.950163] RDX: ffff90817bc218c8 RSI: 0000000000000001 RDI: ffff90817bc218c0
[   43.950165] RBP: ffff908113a18000 R08: 0000000000000000 R09: ffffa714c24a39b8
[   43.950166] R10: ffffa714c24a39b0 R11: 0000000000000003 R12: ffff908106270e40
[   43.950167] R13: ffffa714c24a3b98 R14: 00000000ffffffe5 R15: 0000000000000000
[   43.950168] FS:  0000000000000000(0000) GS:ffff90817bc00000(0000)
knlGS:0000000000000000
[   43.950169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.950170] CR2: 00007f3f99972080 CR3: 00000001068ca005 CR4: 0000000000770ef0
[   43.950173] PKRU: 55555554
[   43.950174] Call Trace:
[   43.950176]  <TASK>
[   43.950176]  ? refcount_warn_saturate+0xbe/0x110
[   43.950180]  ? __warn+0x81/0x130
[   43.950184]  ? refcount_warn_saturate+0xbe/0x110
[   43.950186]  ? report_bug+0x171/0x1a0
[   43.950189]  ? prb_read_valid+0x1b/0x30
[   43.950191]  ? handle_bug+0x3c/0x80
[   43.950194]  ? exc_invalid_op+0x17/0x70
[   43.950196]  ? asm_exc_invalid_op+0x1a/0x20
[   43.950199]  ? refcount_warn_saturate+0xbe/0x110
[   43.950201]  ? refcount_warn_saturate+0xbe/0x110
[   43.950203]  ax25_device_event+0x1b1/0x240 [ax25]
[   43.950213]  notifier_call_chain+0x5a/0xd0
[   43.950217]  dev_close_many+0x11e/0x180
[   43.950221]  unregister_netdevice_many_notify+0x199/0x880
[   43.950224]  unregister_netdevice_queue+0xf7/0x140
[   43.950226]  unregister_netdev+0x1c/0x30
[   43.950230]  mkiss_close+0x76/0xb0 [mkiss]
[   43.950234]  tty_ldisc_hangup+0xfd/0x230
[   43.950238]  __tty_hangup.part.0+0x1f3/0x370
[   43.950240]  tty_release+0xee/0x600
[   43.950242]  __fput+0x97/0x2c0
[   43.950245]  task_work_run+0x5a/0x90
[   43.950248]  do_exit+0x360/0xaf0
[   43.950250]  ? __handle_mm_fault+0xb2c/0xe50
[   43.950254]  do_group_exit+0x31/0x80
[   43.950257]  __x64_sys_exit_group+0x18/0x20
[   43.950258]  do_syscall_64+0x86/0x170
[   43.950261]  ? __count_memcg_events+0x4d/0xc0
[   43.950263]  ? count_memcg_events.constprop.0+0x1a/0x30
[   43.950265]  ? handle_mm_fault+0xa2/0x360
[   43.950267]  ? do_user_addr_fault+0x304/0x670
[   43.950270]  ? exc_page_fault+0x7f/0x180
[   43.950273]  entry_SYSCALL_64_after_hwframe+0x73/0x7b
[   43.950275] RIP: 0033:0x7f9d1e85091d
[   43.950288] Code: Unable to access opcode bytes at 0x7f9d1e8508f3.
[   43.950289] RSP: 002b:00007ffcee06aff8 EFLAGS: 00000202 ORIG_RAX:
00000000000000e7
[   43.950291] RAX: ffffffffffffffda RBX: 00007f9d1e94cfa8 RCX: 00007f9d1e85091d
[   43.950292] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000001
[   43.950293] RBP: 00007ffcee06b050 R08: 00007ffcee06af98 R09: 00007ffcee06af1f
[   43.950294] R10: 00007ffcee06ae90 R11: 0000000000000202 R12: 0000000000000001
[   43.950294] R13: 0000000000000000 R14: 0000000000000001 R15: 00007f9d1e94cfc0
[   43.950297]  </TASK>
[   43.950297] ---[ end trace 0000000000000000 ]---


--
Lars Kellogg-Stedman <lars@oddbit.com> | larsks@{twitter, irc, github,
...} | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-11  2:42   ` Lars Kellogg-Stedman
@ 2024-04-11 13:47     ` Lars Kellogg-Stedman
  0 siblings, 0 replies; 10+ messages in thread
From: Lars Kellogg-Stedman @ 2024-04-11 13:47 UTC (permalink / raw)
  To: Dan Carpenter, linux-hams

It turns out there is also an *earlier* trace that occurs after the
first ax25 connection is closed. The trace does not repeat for
subsequent connections:

[   32.034626] ------------[ cut here ]------------
[   32.035856] refcount_t: decrement hit 0; leaking memory.
[   32.036887] WARNING: CPU: 1 PID: 1014 at lib/refcount.c:31
refcount_warn_saturate+0xff/0x110
[   32.037344] Modules linked in: mkiss rfkill binfmt_misc vfat fat
intel_rapl_msr intel_rapl_common intel_uncore_frequency_common
kvm_intel kvm snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg
snd_intel_sdw_acpi snd_hda_codec snd_hda_core rapl iTCO_wdt snd_hwdep
snd_pcm iTCO_vendor_support snd_timer virtio_gpu snd virtio_net
i2c_i801 pktcdvd net_failover pcspkr i2c_smbus soundcore lpc_ich
virtio_dma_buf virtio_balloon failover drm_shmem_helper joydev netrom
ax25 loop zram zsmalloc crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 virtio_blk sha1_ssse3
virtio_console serio_raw ip6_tables ip_tables fuse qemu_fw_cfg
[   32.040112] CPU: 1 PID: 1014 Comm: axwrapper Not tainted 6.9.0-rc3-radio+ #9
[   32.040506] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-1.fc39 04/01/2014
[   32.040944] RIP: 0010:refcount_warn_saturate+0xff/0x110
[   32.041203] Code: f0 13 83 82 c6 05 34 08 4e 01 01 e8 fb bd 91 ff
0f 0b c3 cc cc cc cc 48 c7 c7 48 14 83 82 c6 05 18 08 4e 01 01 e8 e1
bd 91 ff <0f> 0b c3 cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90
90 90
[   32.042170] RSP: 0018:ffffc9000080bd28 EFLAGS: 00010282
[   32.042474] RAX: 0000000000000000 RBX: ffff88813f48e570 RCX: 0000000000000000
[   32.042853] RDX: ffff88817bd2f1c0 RSI: ffff88817bd21880 RDI: ffff88817bd21880
[   32.043189] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000003
[   32.043635] R10: ffffc9000080bbc8 R11: ffffffff82b3e548 R12: ffff88810e42f900
[   32.043994] R13: ffff88813f11d400 R14: ffff88813f56b880 R15: ffff88813f48e000
[   32.044369] FS:  0000000000000000(0000) GS:ffff88817bd00000(0000)
knlGS:0000000000000000
[   32.044773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.045055] CR2: 00007efc9c3a7680 CR3: 0000000002a22006 CR4: 0000000000770ef0
[   32.045436] PKRU: 55555554
[   32.045577] Call Trace:
[   32.045702]  <TASK>
[   32.045812]  ? __warn+0x80/0x120
[   32.045975]  ? refcount_warn_saturate+0xff/0x110
[   32.046211]  ? report_bug+0x164/0x190
[   32.046416]  ? handle_bug+0x3c/0x80
[   32.046592]  ? exc_invalid_op+0x17/0x70
[   32.046801]  ? asm_exc_invalid_op+0x1a/0x20
[   32.047015]  ? refcount_warn_saturate+0xff/0x110
[   32.047257]  ? refcount_warn_saturate+0xff/0x110
[   32.047570]  ref_tracker_free+0x206/0x210
[   32.047797]  ? lock_timer_base+0x76/0xa0
[   32.048037]  ? _raw_spin_unlock_irqrestore+0xe/0x40
[   32.048372]  ? __mod_timer+0x10d/0x310
[   32.048591]  ax25_release+0xf3/0x360 [ax25]
[   32.048828]  __sock_release+0x3a/0xc0
[   32.049047]  sock_close+0x15/0x20
[   32.049444]  __fput+0x97/0x2c0
[   32.049810]  task_work_run+0x59/0x90
[   32.050314]  do_exit+0x311/0xac0
[   32.050651]  ? irqtime_account_irq+0x45/0xc0
[   32.051050]  do_group_exit+0x30/0x80
[   32.051469]  __x64_sys_exit_group+0x18/0x20
[   32.051875]  do_syscall_64+0x64/0x170
[   32.052271]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   32.052765] RIP: 0033:0x7efc9c2ac91d
[   32.053139] Code: Unable to access opcode bytes at 0x7efc9c2ac8f3.
[   32.053669] RSP: 002b:00007ffdd1d984d8 EFLAGS: 00000202 ORIG_RAX:
00000000000000e7
[   32.054281] RAX: ffffffffffffffda RBX: 00007efc9c3a8fa8 RCX: 00007efc9c2ac91d
[   32.054888] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000000
[   32.055470] RBP: 00007ffdd1d98530 R08: 00007ffdd1d98480 R09: 00007ffdd1d9840f
[   32.055957] R10: 00007ffdd1d98390 R11: 0000000000000202 R12: 0000000000000001
[   32.056476] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efc9c3a8fc0
[   32.056978]  </TASK>
[   32.057231] ---[ end trace 0000000000000000 ]---


--
Lars Kellogg-Stedman <lars@oddbit.com> | larsks@{twitter, irc, github,
...} | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-09 21:05 unregister_netdevice: waiting for ax0 to become free Lars Kellogg-Stedman
                   ` (2 preceding siblings ...)
  2024-04-10 12:02 ` Dan Carpenter
@ 2024-04-20  1:40 ` Lars Kellogg-Stedman
  2024-04-24  2:06   ` Lars Kellogg-Stedman
  3 siblings, 1 reply; 10+ messages in thread
From: Lars Kellogg-Stedman @ 2024-04-20  1:40 UTC (permalink / raw)
  To: linux-hams

On Tue, Apr 09, 2024 at 05:05:40PM -0400, Lars Kellogg-Stedman wrote:
> If I make a single ax.25 connection to the host
> (to an ax25d hosted service), it will subsequently fail to reboot; the
> kernel gets stuck waiting for an ax.25 interface:
>
>      3   [  217.717327] unregister_netdevice: waiting for ax0 to
>          become free. Usage count = 0

I'm still trying to track down these issues. I've built a minimal kernel
for testing (from 2668e3ae2ef) with ax.25 support compiled in (that is,
no modules), and that seems to have changed the nature of the tracebacks
a bit.

If bring up a couple of axudp ports and a minimal ax25d configuration
and make a single connection and then attempt to reboot, I see:

    refcount_t: underflow; use-after-free.

The call trace looks like this:

    Call Trace:
    <TASK>
    ? show_regs+0x5f/0x70
    ? __warn+0x6c/0xd0
    ? refcount_warn_saturate+0xb2/0x100
    ? report_bug+0x166/0x1b0
    ? prb_read_valid+0x1b/0x20
    ? handle_bug+0x3e/0x70
    ? exc_invalid_op+0x1c/0x70
    ? asm_exc_invalid_op+0x1f/0x30
    ? refcount_warn_saturate+0xb2/0x100
    ? refcount_warn_saturate+0xb2/0x100
    ax25_dev_device_down+0x16e/0x190
    ax25_device_event+0x1a6/0x240
    raw_notifier_call_chain+0x45/0x60
    call_netdevice_notifiers_info+0x4f/0x90
    dev_close_many+0xe1/0x140
    unregister_netdevice_many_notify+0x139/0x7f0
    ? __call_rcu_common.constprop.0+0x8a/0x220
    unregister_netdevice_queue+0x8d/0xd0
    unregister_netdev+0x1b/0x30
    mkiss_close+0x74/0xb0
    tty_ldisc_close+0x26/0x30
    tty_ldisc_hangup+0xf5/0x210
    __tty_hangup.part.0+0x1b2/0x2f0
    tty_vhangup+0x14/0x20
    pty_close+0x122/0x160
    tty_release+0xe0/0x4d0
    __fput+0x8c/0x2a0
    ____fput+0xd/0x10
    task_work_run+0x59/0x80
    do_exit+0x2aa/0x8e0
    ? handle_mm_fault+0xb5/0x300
    do_group_exit+0x33/0x90
    __x64_sys_exit_group+0x17/0x20
    x64_sys_call+0x18c9/0x1aa0
    do_syscall_64+0x55/0x110
    entry_SYSCALL_64_after_hwframe+0x4b/0x53

Stepping through ax25_dev_device_down with gdb, I noticed that there are
two calls to ax25_dev_put (in net/ax25/ax25_dev.c):

    unlock_put:
      spin_unlock_bh(&ax25_dev_lock);
      ax25_dev_put(ax25_dev);
      dev->ax25_ptr = NULL;
      netdev_put(dev, &ax25_dev->dev_tracker);
      ax25_dev_put(ax25_dev);

Is that valid? By the time the code reaches the second call to
ax25_dev_put, ax25_dev->refcount is already 0. If I remove the second
call to ax25_dev_put, that gets rid of the "refcount_t: underflow;
use-after-free." error, but doesn't prevent the kernel from getting
stuck at:

    unregister_netdevice: waiting for ax0 to become free. Usage count = 0

I've include the complete trace below:

------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 269 at lib/refcount.c:28 refcount_warn_saturate+0xb2/0x100
CPU: 0 PID: 269 Comm: ax25ipd Not tainted 6.9.0-rc4-ax25+ #36
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:refcount_warn_saturate+0xb2/0x100
Code: d1 a5 9d 00 01 e8 5e 6b b4 ff 0f 0b 5d c3 80 3d be a5 9d 00 00 75 91 48 c7 c7 08 a1 dc 81 c6 05 ae a5 9d 00 01 e8 3e 6b b4 ff <0f> 0b 5d c3 80 3d 9f a5 9d 00 00 0f 85 6d ff ff ff 48 c7 c7 d8 a0
RSP: 0018:ffffc9000182fb48 EFLAGS: 00010286
RAX: 0000000000000026 RBX: ffff88810a1359c0 RCX: 0000000000000027
RDX: ffff88817bc1bf48 RSI: 0000000000000001 RDI: ffff88817bc1bf40
RBP: ffffc9000182fb48 R08: 00000000ffffefff R09: 00000000ffffefff
R10: 0000000000000001 R11: ffffffff81e3a5e0 R12: ffff888109b6a000
R13: ffff88810a135a50 R14: ffff88810372a400 R15: ffff88810372a460
FS:  0000000000000000(0000) GS:ffff88817bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005580c08ca8a0 CR3: 00000001017a0006 CR4: 0000000000770eb0
PKRU: 55555554
Call Trace:
 <TASK>
 ? show_regs+0x5f/0x70
 ? __warn+0x6c/0xd0
 ? refcount_warn_saturate+0xb2/0x100
 ? report_bug+0x166/0x1b0
 ? prb_read_valid+0x1b/0x20
 ? handle_bug+0x3e/0x70
 ? exc_invalid_op+0x1c/0x70
 ? asm_exc_invalid_op+0x1f/0x30
 ? refcount_warn_saturate+0xb2/0x100
 ? refcount_warn_saturate+0xb2/0x100
 ax25_dev_device_down+0x16e/0x190
 ax25_device_event+0x1a6/0x240
 raw_notifier_call_chain+0x45/0x60
 call_netdevice_notifiers_info+0x4f/0x90
 dev_close_many+0xe1/0x140
 unregister_netdevice_many_notify+0x139/0x7f0
 ? __call_rcu_common.constprop.0+0x8a/0x220
 unregister_netdevice_queue+0x8d/0xd0
 unregister_netdev+0x1b/0x30
 mkiss_close+0x74/0xb0
 tty_ldisc_close+0x26/0x30
 tty_ldisc_hangup+0xf5/0x210
 __tty_hangup.part.0+0x1b2/0x2f0
 tty_vhangup+0x14/0x20
 pty_close+0x122/0x160
 tty_release+0xe0/0x4d0
 __fput+0x8c/0x2a0
 ____fput+0xd/0x10
 task_work_run+0x59/0x80
 do_exit+0x2aa/0x8e0
 do_group_exit+0x33/0x90
 __x64_sys_exit_group+0x17/0x20
 x64_sys_call+0x18c9/0x1aa0
 do_syscall_64+0x55/0x110
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f140d27491d
Code: Unable to access opcode bytes at 0x7f140d2748f3.
RSP: 002b:00007ffc66356c28 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f140d370fa8 RCX: 00007f140d27491d
RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000002
RBP: 00007ffc66356c80 R08: 00007ffc66356bc8 R09: 00007ffc66356b4f
R10: 00007ffc66356ac0 R11: 0000000000000202 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000002 R15: 00007f140d370fc0
 </TASK>
---[ end trace 0000000000000000 ]---

-- 
Lars Kellogg-Stedman <lars@oddbit.com> | larsks @ {irc,twitter,github}
http://blog.oddbit.com/                | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-20  1:40 ` Lars Kellogg-Stedman
@ 2024-04-24  2:06   ` Lars Kellogg-Stedman
  2024-04-25 12:14     ` Lars Kellogg-Stedman
  0 siblings, 1 reply; 10+ messages in thread
From: Lars Kellogg-Stedman @ 2024-04-24  2:06 UTC (permalink / raw)
  To: linux-hams

To help figure out what's going on, I wrote an `ax-sockets` command [1]
for gdb that iterates over the `ax25_list` variable to show active
sockets. The output looks something like this:

    0xffff888109fe9800 LISTEN       ax1 src:TEST-1     dst:<none>     cb:02 dev:03
    0xffff888100860600 LISTEN       ax0 src:TEST-0     dst:<none>     cb:02 dev:03

The two fields at the end (labeled `cb:` and `dev:`) show the value of
the `refcount` attribute of the `ax25_cb` struct and the corresponding
`ax25_cb->ax25_dev` struct.

When the system first boots up and we create some listening sockets
with ax25d, each ax.25 interface has a refcount of 3 (that's what we see
in the above output). Every time we initiate a connection between the
sockets...

    axcall udp0 TEST-1

...the refcount on the listening device drops by one when the connection closes.
That is, after making one connection, we have:

    0xffff888109fe9800 LISTEN       ax1 src:TEST-1     dst:<none>     cb:02 dev:02
    0xffff888100860600 LISTEN       ax0 src:TEST-0     dst:<none>     cb:02 dev:03

And after another connection:

    0xffff888109fe9800 LISTEN       ax1 src:TEST-1     dst:<none>     cb:02 dev:01
    0xffff888100860600 LISTEN       ax0 src:TEST-0     dst:<none>     cb:02 dev:03

After a third connection:

    0xffff888109fe9800 LISTEN       ax1 src:TEST-1     dst:<none>     cb:02 dev:00
    0xffff888100860600 LISTEN       ax0 src:TEST-0     dst:<none>     cb:02 dev:03

And then after a fourth connection, we get a kernel traceback with
"refcount_t: underflow; use-after-free." and the following backtrace:

    Call Trace:
    <TASK>
    ? show_regs+0x64/0x70
    ? __warn+0x83/0x120
    ? refcount_warn_saturate+0xb2/0x100
    ? report_bug+0x158/0x190
    ? prb_read_valid+0x20/0x30
    ? handle_bug+0x3e/0x70
    ? exc_invalid_op+0x1c/0x70
    ? asm_exc_invalid_op+0x1f/0x30
    ? refcount_warn_saturate+0xb2/0x100
    ax25_dev_put+0x37/0x40
    ax25_release+0x377/0x3b0
    __sock_release+0x35/0xa0
    sock_close+0x19/0x20
    __fput+0x91/0x2a0
    ____fput+0x12/0x20
    task_work_run+0x5e/0x90
    do_exit+0x2f5/0xa00
    ? handle_mm_fault+0xba/0x300
    do_group_exit+0x38/0x90
    __x64_sys_exit_group+0x1c/0x20
    x64_sys_call+0x18ce/0x1aa0
    do_syscall_64+0x55/0x110
    entry_SYSCALL_64_after_hwframe+0x4b/0x53

If we look at things when a connection is established -- e.g., at the
top of ax25_accept -- we see that refcount for the *source* interface
(udp0/ax0 in this example) has incremented:

    0xffff888103eb3a00 ESTABLISHED  ax1 src:TEST-1     dst:TEST-0     cb:02 dev:03
    0xffff888103ea4400 SABM_SENT    ax0 src:TEST-0     dst:TEST-1     cb:02 dev:04
    0xffff888103eb3600 LISTEN       ax1 src:TEST-1     dst:<none>     cb:02 dev:03
    0xffff888103eb3400 LISTEN       ax0 src:TEST-0     dst:<none>     cb:02 dev:04

It looks like we may be failing to increment the refcount on the
*listening* socket when a connection is established...but we're
decrementing it when the connection is closed, leading to problems.

[1]: https://github.com/larsks/ax25-debugging

-- 
Lars Kellogg-Stedman <lars@oddbit.com> | larsks @ {irc,twitter,github}
http://blog.oddbit.com/                | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unregister_netdevice: waiting for ax0 to become free
  2024-04-24  2:06   ` Lars Kellogg-Stedman
@ 2024-04-25 12:14     ` Lars Kellogg-Stedman
  0 siblings, 0 replies; 10+ messages in thread
From: Lars Kellogg-Stedman @ 2024-04-25 12:14 UTC (permalink / raw)
  To: linux-hams

On Tue, Apr 23, 2024 at 10:06:44PM -0400, Lars Kellogg-Stedman wrote:
> It looks like we may be failing to increment the refcount on the
> *listening* socket when a connection is established...but we're
> decrementing it when the connection is closed, leading to problems.

The above problem can be resolved by adding a call to ax25_dev_hold() in
ax25_accept(), but it looks like a similar situation exists in the
device refcount (ax25_dev->dev->dev_refcnt), which also decrements for
every connection and ultimately experiences an underflow (1 -> 0 -> -1073741824).

-- 
Lars Kellogg-Stedman <lars@oddbit.com> | larsks @ {irc,twitter,github}
http://blog.oddbit.com/                | N1LKS

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-04-25 15:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-09 21:05 unregister_netdevice: waiting for ax0 to become free Lars Kellogg-Stedman
2024-04-10  0:20 ` David Ranch
2024-04-10  1:21   ` Lars Kellogg-Stedman
2024-04-10 12:01 ` Dan Carpenter
2024-04-11  2:42   ` Lars Kellogg-Stedman
2024-04-11 13:47     ` Lars Kellogg-Stedman
2024-04-10 12:02 ` Dan Carpenter
2024-04-20  1:40 ` Lars Kellogg-Stedman
2024-04-24  2:06   ` Lars Kellogg-Stedman
2024-04-25 12:14     ` Lars Kellogg-Stedman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).