All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression in "ax25: Fix UAF bugs in ax25 timers", patch  82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
@ 2022-05-23  5:46 Thomas Osterried
  2022-05-23 13:52 ` Dan Carpenter
  2022-05-23 14:07 ` duoming
  0 siblings, 2 replies; 12+ messages in thread
From: Thomas Osterried @ 2022-05-23  5:46 UTC (permalink / raw)
  To: linux-hams, Duoming Zhou; +Cc: David S. Miller, Paolo Abeni


Regression due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 "ax25: Fix UAF bugs in ax25 timers",
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/af_ax25.c?id=82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
and discussion about necessarity of
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/ax25_subr.c?id=fc6d01ff9ef03b66d4a3a23b46fc3c3d8cf92009
(because it addresses the same problem).


Due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc, any ax25 session is cleaned up anymore on disconnect.
-> netstat --ax25 shows the ex-connected session it in state "LISTEN". Makes ax25 unusable in a production environment.
=> Regression.


This is my proposal for a fix.
I'd appreciate you test and give feedback.

For more details, see below.

$ diff -Naur af_ax25.c.orig af_ax25.c.dl9sau  
--- af_ax25.c.orig      2022-05-04 12:16:35.229499063 +0200
+++ af_ax25.c.dl9sau    2022-05-23 06:48:06.268171116 +0200
@@ -86,11 +86,18 @@
 again:
        ax25_for_each(s, &ax25_list) {
                if (s->ax25_dev == ax25_dev) {
+                       // Stop timers. Really needed? ax25_disconnect(), that we call with arg ENETUNREACH will also stop the timers.
+                       del_timer_sync(&s->timer);
+                       del_timer_sync(&s->t1timer);
+                       del_timer_sync(&s->t2timer);
+                       del_timer_sync(&s->t3timer);
+                       del_timer_sync(&s->idletimer);
                        sk = s->sk;
                        if (!sk) {
                                spin_unlock_bh(&ax25_list_lock);
                                ax25_disconnect(s, ENETUNREACH);
                                s->ax25_dev = NULL;
+                               ax25_cb_del(s);
                                spin_lock_bh(&ax25_list_lock);
                                goto again;
                        }
@@ -104,6 +111,7 @@
                                ax25_dev_put(ax25_dev);
                        }
                        release_sock(sk);
+                       ax25_cb_del(s);
                        spin_lock_bh(&ax25_list_lock);
                        sock_put(sk);
                        /* The entry could have been deleted from the
@@ -995,9 +1003,11 @@
        if (sk->sk_type == SOCK_SEQPACKET) {
                switch (ax25->state) {
                case AX25_STATE_0:
-                       release_sock(sk);
-                       ax25_disconnect(ax25, 0);
-                       lock_sock(sk);
+                       if (!sock_flag(ax25->sk, SOCK_DEAD)) {
+                               release_sock(sk);
+                               ax25_disconnect(ax25, 0);
+                               lock_sock(sk);
+                       }
                        ax25_destroy_socket(ax25);
                        break;
 
@@ -1052,12 +1062,8 @@
                sk->sk_state_change(sk);
                ax25_destroy_socket(ax25);
        }
+
        if (ax25_dev) {
-               del_timer_sync(&ax25->timer);
-               del_timer_sync(&ax25->t1timer);
-               del_timer_sync(&ax25->t2timer);
-               del_timer_sync(&ax25->t3timer);
-               del_timer_sync(&ax25->idletimer);
                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
                ax25_dev_put(ax25_dev);
        }

Argumentation:

Author Duoming Zhou <duoming@zju.edu.cn> described his patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 , that lead to the regession, with:

===
ax25: Fix UAF bugs in ax25 timers
There are race conditions that may lead to UAF bugs in
ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(),
ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we call
ax25_release() to deallocate ax25_dev.

One of the UAF bugs caused by ax25_release() is shown below:

      (Thread 1)                    |      (Thread 2)
ax25_dev_device_up() //(1)          |
...                                 | ax25_kill_by_device()
ax25_bind()          //(2)          |
ax25_connect()                      | ...
 ax25_std_establish_data_link()     |
  ax25_start_t1timer()              | ax25_dev_device_down() //(3)
   mod_timer(&ax25->t1timer,..)     |
                                    | ax25_release()
   (wait a time)                    |  ...
                                    |  ax25_dev_put(ax25_dev) //(4)FREE
   ax25_t1timer_expiry()            |
    ax25->ax25_dev->values[..] //USE|  ...
     ...                            |

We increase dev in position (1) and (2), and
decrease the refcount of ax25_dev in position (3) and (4).
The ax25_dev will be freed in position (4) and be used in
ax25_t1timer_expiry().
===


Imho, Duoming has overseen when he fixed the ifconfig device down issue, that ax25_release() is called every time a userspace program closes his socket.
(.release        = ax25_release)

Timers are for good, not for bad. ax25 timers are important for correct behavior of the AX25_STATE machine. And the "heartbeat" timer is used internally for correct session cleanup.
I.e., if user closes his AX25-sessioon, kernel should send DISC upon either UA or DM are received, or until t1 expiry. The session and it's state needs still to be known for some time on normal session close. The only case where the session sate not needed anymore if you set ifcoonfig iface down.


Duoming's approach lead to
if (ax25_dev) {  // existence of ax25_dev: normal situation for a regularly closed ax25 session
              del_timer_sync(&ax25->timer);   // this is the "heartbeat" timer
              del_timer_sync(&ax25->t1timer); // .. and these are the others
              del_timer_sync(&ax25->t2timer);
              del_timer_sync(&ax25->t3timer);
              del_timer_sync(&ax25->idletimer);


The assurance that timers are stopped should to go to ax25_kill_by_device() [this is for the situation, that the SysOp sets ifconfig ax25iface down], from where ax25_release() is called. See my patch suggestion.

I argue, that the stopping of the timer in ax25_kill_by_device() is not really necessary, because it calls
ax25_disconnect(s, ENETUNREACH);
and ax25_disconnect() does:
        if (reason == ENETUNREACH) {
                del_timer_sync(&ax25->timer);
                del_timer_sync(&ax25->t1timer);
                del_timer_sync(&ax25->t2timer);
                del_timer_sync(&ax25->t3timer);
                del_timer_sync(&ax25->idletimer);

-> Our timers get stopped anyway.

But on the other hand, Duoming had the "BUG: KASAN: use-after-free in ax25_t1timer_expiry+0x1c/0x60" issue, that inspired him for his patch..

But on another hand. the reason == ENETUNREACH check in ax5_disconnect is quite new: It came by Duoming's "ax25: Fix NULL pointer dereferences in ax25 timers" patch to ax25_subr.c, ten days before (commit ID fc6d01ff). If ax25_disconnect() deletes the timers correctly, it may not be needed as an insurance for ax25_kill_by_device().
Both of Duoming's patches try to solve the same problem.
It should be investigated if the del_timer_sync()-assurance has to be at so many different points: we stop them, and call a function that also stops them. If this is really needed, than the problem we try to resolve with this lays somewhere else.


ax25_destroy_socket() is called on heartbeat expiry if AX25_STATE is 0. In the case we stop the timers in ax25_kill_by_device(), ax25_destroy_socket() is not called.
ax25_cb_del() is normaly done by ax25_destroy_socket(). 

If we don't call ax25_cb_del() directly, the situation is this, for a current incoming connection in state ESTABLISHED and after ifdown bpq0:
DL9AU-3    DL9SAU-1   ???     LISTENING    000/000  0       0
-> No timer cleans this up.

If I'd try to call ax25_destroy_socket() instead of ax25_cb_del(), this would cause a refcount-underflow OOPS due to the dev_put_track-stuff.

ax25_destroy_socket() would do ax25_cb_del(), and also stop timers. But it also may start timers again (ax25_start_heartbeat(sax25), or by timer_setup(&ax25->dtimer, ax25_destroy_timer, 0) ) -- which is not in our interest in this the case the device is set to down.

That's why assure ax25_cb_del() in my approach.


In the ax25_disconnect() del_timer_sync in "reason == ENETUNREACH", I suggest to revert. This was also introduced to solve the ax25_kill_by_device() problem.
The need for enforced stopping the timers I only see in the special case when a device is set to down.


Cc goes to David Miller and Paolo Abeni, because they committed the both patches I discussed above.


vy 73,
	- Thomas  dl9sau




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-23  5:46 Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 Thomas Osterried
@ 2022-05-23 13:52 ` Dan Carpenter
  2022-05-24  9:25   ` Thomas Osterried
  2022-05-23 14:07 ` duoming
  1 sibling, 1 reply; 12+ messages in thread
From: Dan Carpenter @ 2022-05-23 13:52 UTC (permalink / raw)
  To: Thomas Osterried; +Cc: linux-hams, Duoming Zhou, David S. Miller, Paolo Abeni

On Mon, May 23, 2022 at 07:46:57AM +0200, Thomas Osterried wrote:
> 
> Regression due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 "ax25: Fix UAF bugs in ax25 timers",
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/af_ax25.c?id=82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
> and discussion about necessarity of
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/ax25_subr.c?id=fc6d01ff9ef03b66d4a3a23b46fc3c3d8cf92009
> (because it addresses the same problem).
> 
> 
> Due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc, any ax25 session is cleaned up anymore on disconnect.
> -> netstat --ax25 shows the ex-connected session it in state "LISTEN". Makes ax25 unusable in a production environment.
> => Regression.

Please line wrap your emails at 75 characters.

I tried to apply your patch but the format is wrong...  :/

Hopefully Duoming will read this message but the sad truth is that no
one likes to read RFC patches so you'd be better off sending this as
a real patch instead.

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-23  5:46 Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 Thomas Osterried
  2022-05-23 13:52 ` Dan Carpenter
@ 2022-05-23 14:07 ` duoming
  2022-05-23 16:05   ` Roland Schwarz
  2022-05-23 19:14   ` Thomas Osterried
  1 sibling, 2 replies; 12+ messages in thread
From: duoming @ 2022-05-23 14:07 UTC (permalink / raw)
  To: Thomas Osterried; +Cc: linux-hams, David S. Miller, Paolo Abeni

Hello,

On Mon, 23 May 2022 07:46:57 +0200 Thomas wrote:

> Regression due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 "ax25: Fix UAF bugs in ax25 timers",
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/af_ax25.c?id=82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
> and discussion about necessarity of
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/ax25_subr.c?id=fc6d01ff9ef03b66d4a3a23b46fc3c3d8cf92009
> (because it addresses the same problem).
> 
> 
> Due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc, any ax25 session is cleaned up anymore on disconnect.
> -> netstat --ax25 shows the ex-connected session it in state "LISTEN". Makes ax25 unusable in a production environment.
> => Regression.

Is there any ways to reproduce this problem? 

> This is my proposal for a fix.
> I'd appreciate you test and give feedback.

I will test this solution and give feedback.

Best regards,
Duoming Zhou

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-23 14:07 ` duoming
@ 2022-05-23 16:05   ` Roland Schwarz
  2022-05-23 18:53     ` Thomas Osterried
  2022-05-23 19:14   ` Thomas Osterried
  1 sibling, 1 reply; 12+ messages in thread
From: Roland Schwarz @ 2022-05-23 16:05 UTC (permalink / raw)
  To: duoming, Thomas Osterried; +Cc: linux-hams, David S. Miller, Paolo Abeni


[-- Attachment #1.1: Type: text/plain, Size: 489 bytes --]



On 23.05.22 at 16:07 wrote duoming@zju.edu.cn:
...>
> Is there any ways to reproduce this problem?
> 

As I understand it, the stuck LISTEN state is the infamous long-standing 
"stuck sockets" bug which Dave was pointing at in his mail "Status of 
the stuck sockets bugs" from 2021-06-29.

73 de Roland
-- 
__________________________________________
   _  _  | Roland Schwarz
  |_)(_  |
  | \__) | mailto:roland.schwarz@blackspace.at
________| http://www.blackspace.at

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-23 16:05   ` Roland Schwarz
@ 2022-05-23 18:53     ` Thomas Osterried
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas Osterried @ 2022-05-23 18:53 UTC (permalink / raw)
  To: Roland Schwarz; +Cc: duoming, Thomas Osterried, linux-hams

On Mon, May 23, 2022 at 06:05:47PM +0200, Roland Schwarz wrote:
> 
> 
> On 23.05.22 at 16:07 wrote duoming@zju.edu.cn:
> ...>
> > Is there any ways to reproduce this problem?
> > 
> 
> As I understand it, the stuck LISTEN state is the
> infamous long-standing "stuck sockets" bug which
> Dave was pointing at in his mail "Status of the
> stuck sockets bugs" from 2021-06-29.

No, this is a new cause, resulting to the same
symtom (listening state ex-connected sessions).

As said, of you stop timers in cases you don't need
to, nothing tidies up (and nothing honors
the ax25 state machine).


There had been cases observed (since a few years).
I've seen such a case and have a fix I'll present
later. But let's resolve one after another,
not in parallel.

vy 73,
  - Thomas  dl9sau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-23 14:07 ` duoming
  2022-05-23 16:05   ` Roland Schwarz
@ 2022-05-23 19:14   ` Thomas Osterried
  1 sibling, 0 replies; 12+ messages in thread
From: Thomas Osterried @ 2022-05-23 19:14 UTC (permalink / raw)
  To: duoming; +Cc: Thomas Osterried, linux-hams

> > Due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc, any ax25 session is cleaned up anymore on disconnect.
> > -> netstat --ax25 shows the ex-connected session it in state "LISTEN". Makes ax25 unusable in a production environment.
> > => Regression.
> 
> Is there any ways to reproduce this problem? 

I found no ax25 connection that cleaned up anymore.

It's useful to test many (best: all) of these cases:
incoming connection from remote. and initiate disconnect from remote.
  [In this case, we have no socket to userspace]. and send terminate it from there.
outgoing connection initiated by sending an ip packet to remote (iface configured for mode VC).
  [In this case, we have no so socket to userspace]
Incoming connection fron remote and local started ax25d which listens 
for pid=test (-> socket is allocated).
outgoing connection initiated by userspace connect(). and terminate it.

And for each one, kill hard the remote site and 
- see if after idle timer expiry everything goes the correct way.
- send a packet (-> retries until retry timer expiry) and see if after timer expiry everything goes the correct way.

I tested many of these combinations, but not all.

All these cases can happen normaly in the real world.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-23 13:52 ` Dan Carpenter
@ 2022-05-24  9:25   ` Thomas Osterried
  2022-05-24 10:47     ` duoming
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Osterried @ 2022-05-24  9:25 UTC (permalink / raw)
  To: Dan Carpenter, duoming; +Cc: linux-hams, David S. Miller, Paolo Abeni


> Am 23.05.2022 um 15:52 schrieb Dan Carpenter <dan.carpenter@oracle.com>:
> 
> 
[..]
> 
> I tried to apply your patch but the format is wrong...  :/

I thought it would help us to discuss the problem on the mailinglist
at the code fragment.
But I also asked for testing, and then you need a working patch
Was my fault. Sorry for that.


diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index 363d47f94532..498e92fb43b7 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -86,11 +86,17 @@ static void ax25_kill_by_device(struct net_device *dev)
 again:
        ax25_for_each(s, &ax25_list) {
                if (s->ax25_dev == ax25_dev) {
+                       del_timer_sync(&s->timer);
+                       del_timer_sync(&s->t1timer);
+                       del_timer_sync(&s->t2timer);
+                       del_timer_sync(&s->t3timer);
+                       del_timer_sync(&s->idletimer);
                        sk = s->sk;
                        if (!sk) {
                                spin_unlock_bh(&ax25_list_lock);
                                ax25_disconnect(s, ENETUNREACH);
                                s->ax25_dev = NULL;
+                               ax25_cb_del(s);
                                spin_lock_bh(&ax25_list_lock);
                                goto again;
                        }
@@ -104,6 +110,7 @@ static void ax25_kill_by_device(struct net_device *dev)
                                ax25_dev_put(ax25_dev);
                        }
                        release_sock(sk);
+                       ax25_cb_del(s);
                        spin_lock_bh(&ax25_list_lock);
                        sock_put(sk);
                        /* The entry could have been deleted from the
@@ -1052,12 +1059,8 @@ static int ax25_release(struct socket *sock)
                sk->sk_state_change(sk);
                ax25_destroy_socket(ax25);
        }
+
        if (ax25_dev) {
-               del_timer_sync(&ax25->timer);
-               del_timer_sync(&ax25->t1timer);
-               del_timer_sync(&ax25->t2timer);
-               del_timer_sync(&ax25->t3timer);
-               del_timer_sync(&ax25->idletimer);
                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
                ax25_dev_put(ax25_dev);
        }


In that context, I ask to revert (as I argued in my initial mail):
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/ax25_subr.c?id=fc6d01ff9ef03b66d4a3a23b46fc3c3d8cf92009



In my diff from yesterday, I also had a part around "case AX25_STATE_0:").
This fixes another problem.
I removed it from my patch above: it has to discussed later as separate topic.


Btw, Douming, you introduced del_timer_sync().
ax25_stop_heartbeat(), ax25_stop_t1timer(), ax25_stop_t2timer(), ax25_stop_t3timer(),
ax25_stop_idletimer() from ax25_timer.c use del_timer().

Why do you use del_timer_sync() directly?
And if del_timer_sync() is needed, is it only needed in a special case,
or isn't it cleaner to change the ax25_stop_xxx() timer functions?


vy 73,
	- Thomas  dl9sau





^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-24  9:25   ` Thomas Osterried
@ 2022-05-24 10:47     ` duoming
  2022-05-24 21:52       ` Thomas Osterried
  0 siblings, 1 reply; 12+ messages in thread
From: duoming @ 2022-05-24 10:47 UTC (permalink / raw)
  To: Thomas Osterried; +Cc: Dan Carpenter, linux-hams, David S. Miller, Paolo Abeni

Hello,

On Tue, 24 May 2022 11:25:28 +0200 Thomas wrote:

> > I tried to apply your patch but the format is wrong...  :/
> 
> I thought it would help us to discuss the problem on the mailinglist
> at the code fragment.
> But I also asked for testing, and then you need a working patch
> Was my fault. Sorry for that.
> 
> 
> diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
> index 363d47f94532..498e92fb43b7 100644
> --- a/net/ax25/af_ax25.c
> +++ b/net/ax25/af_ax25.c
> @@ -86,11 +86,17 @@ static void ax25_kill_by_device(struct net_device *dev)
>  again:
>         ax25_for_each(s, &ax25_list) {
>                 if (s->ax25_dev == ax25_dev) {
> +                       del_timer_sync(&s->timer);
> +                       del_timer_sync(&s->t1timer);
> +                       del_timer_sync(&s->t2timer);
> +                       del_timer_sync(&s->t3timer);
> +                       del_timer_sync(&s->idletimer);

These del_timer_sync() functions are unnecessary, Because when the device is detaching, 
the reason parameter of ax25_disconnect() equals to ENETUNREACH, the del_timer_sync()
in ax25_disconnect() will execute.

>                         sk = s->sk;
>                         if (!sk) {
>                                 spin_unlock_bh(&ax25_list_lock);
>                                 ax25_disconnect(s, ENETUNREACH);
>                                 s->ax25_dev = NULL;
> +                               ax25_cb_del(s);
>                                 spin_lock_bh(&ax25_list_lock);
>                                 goto again;
>                         }
> @@ -104,6 +110,7 @@ static void ax25_kill_by_device(struct net_device *dev)
>                                 ax25_dev_put(ax25_dev);
>                         }
>                         release_sock(sk);
> +                       ax25_cb_del(s);
>                         spin_lock_bh(&ax25_list_lock);
>                         sock_put(sk);
>                         /* The entry could have been deleted from the
> @@ -1052,12 +1059,8 @@ static int ax25_release(struct socket *sock)
>                 sk->sk_state_change(sk);
>                 ax25_destroy_socket(ax25);
>         }
> +
>         if (ax25_dev) {
> -               del_timer_sync(&ax25->timer);
> -               del_timer_sync(&ax25->t1timer);
> -               del_timer_sync(&ax25->t2timer);
> -               del_timer_sync(&ax25->t3timer);
> -               del_timer_sync(&ax25->idletimer);
>                 dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
>                 ax25_dev_put(ax25_dev);
>         }

I think these del_timer_sync() functions could not be removed, otherwise the UAF bugs caused by
timer handler could not be mitigated.

      (Thread 1)                       |      (Thread 2)
ax25_dev_device_up() //(1)             |
...                                    | ax25_kill_by_device()
ax25_bind()          //(2)             |
ax25_connect()                         | ...
 ax25_std_establish_data_link()        |
  ax25_start_t1timer()                 | ax25_dev_device_down() //(3)
   mod_timer(&ax25->t1timer,..)        |
                                       | ax25_release()
   (wait a time)                       |  ...
                                       |  ax25_dev_put(ax25_dev) //(4)FREE
   ax25_t1timer_expiry()               |
    ax25->ax25_dev->values[..] //(5)USE|  ...
     ...                               |

The ax25_dev is deallocated in position (4) and use in position (5).

> In that context, I ask to revert (as I argued in my initial mail):
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/ax25_subr.c?id=fc6d01ff9ef03b66d4a3a23b46fc3c3d8cf92009

This patch could not be revert. Because when the device is detaching, the timer handler could be running.
If we do not wait the timer handler to finish, the NULL pointer dereference bug will happen.


      (Thread 1)                    |      (Thread 2)
ax25_connect()                      |
 ax25_std_establish_data_link()     |
  ax25_start_t1timer()              |
   mod_timer(&ax25->t1timer,..)     |
                                    | ax25_kill_by_device()
   (wait a time)                    |  ...
                                    |  s->ax25_dev = NULL; //(1)
   ax25_t1timer_expiry()            |
    ax25->ax25_dev->values[..] //(2)|  ...
     ...                            |

We set null to ax25_cb->ax25_dev in position (1) and dereference
the null pointer in position (2).

> In my diff from yesterday, I also had a part around "case AX25_STATE_0:").
> This fixes another problem.
> I removed it from my patch above: it has to discussed later as separate topic.
> 
> 
> Btw, Douming, you introduced del_timer_sync().
> ax25_stop_heartbeat(), ax25_stop_t1timer(), ax25_stop_t2timer(), ax25_stop_t3timer(),
> ax25_stop_idletimer() from ax25_timer.c use del_timer().
> 
> Why do you use del_timer_sync() directly?
> And if del_timer_sync() is needed, is it only needed in a special case,
> or isn't it cleaner to change the ax25_stop_xxx() timer functions?

We could not change del_timer() in ax25_stop_xxx() to del_timer_sync(). Because
the ax25_stop_xxx() is called by many functions such as ax25_ds_state1_machine(),
ax25_ctl_ioctl(), ax25_release() and so on. If we block and wait for timer handler
to finish in these functions, the system will hang in some cases.

Only the reason parameter of ax25_disconnect() equals to ENETUNREACH which means
the device is detaching or the ax25 socket has been destroyed should we use
del_timer_sync() to wait the timer handler to finish. 

Best regards,
Duoming Zhou

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-24 10:47     ` duoming
@ 2022-05-24 21:52       ` Thomas Osterried
  2022-05-25  2:34         ` duoming
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Osterried @ 2022-05-24 21:52 UTC (permalink / raw)
  To: duoming; +Cc: Dan Carpenter, linux-hams, David S. Miller, Paolo Abeni



> Am 24.05.2022 um 12:47 schrieb duoming@zju.edu.cn:
> 
> Hello,
> 
> On Tue, 24 May 2022 11:25:28 +0200 Thomas wrote:
> 
>>> I tried to apply your patch but the format is wrong...  :/
>> 
>> I thought it would help us to discuss the problem on the mailinglist
>> at the code fragment.
>> But I also asked for testing, and then you need a working patch
>> Was my fault. Sorry for that.
>> 
>> 
>> diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
>> index 363d47f94532..498e92fb43b7 100644
>> --- a/net/ax25/af_ax25.c
>> +++ b/net/ax25/af_ax25.c
>> @@ -86,11 +86,17 @@ static void ax25_kill_by_device(struct net_device *dev)
>> again:
>>        ax25_for_each(s, &ax25_list) {
>>                if (s->ax25_dev == ax25_dev) {
>> +                       del_timer_sync(&s->timer);
>> +                       del_timer_sync(&s->t1timer);
>> +                       del_timer_sync(&s->t2timer);
>> +                       del_timer_sync(&s->t3timer);
>> +                       del_timer_sync(&s->idletimer);
> 
> These del_timer_sync() functions are unnecessary, Because when the device is detaching, 
> the reason parameter of ax25_disconnect() equals to ENETUNREACH, the del_timer_sync()
> in ax25_disconnect() will execute.

On the other hand, we have the ENETUNREACH case only in exact this scenario:
device goes down -> ax25_kill_by_device -> ax25_disconnect().

That's why argued it's better here to stop the timers.

> 
>>                        sk = s->sk;
>>                        if (!sk) {
>>                                spin_unlock_bh(&ax25_list_lock);
>>                                ax25_disconnect(s, ENETUNREACH);
>>                                s->ax25_dev = NULL;
>> +                               ax25_cb_del(s);
>>                                spin_lock_bh(&ax25_list_lock);
>>                                goto again;
>>                        }
>> @@ -104,6 +110,7 @@ static void ax25_kill_by_device(struct net_device *dev)
>>                                ax25_dev_put(ax25_dev);
>>                        }
>>                        release_sock(sk);
>> +                       ax25_cb_del(s);
>>                        spin_lock_bh(&ax25_list_lock);
>>                        sock_put(sk);
>>                        /* The entry could have been deleted from the
>> @@ -1052,12 +1059,8 @@ static int ax25_release(struct socket *sock)
>>                sk->sk_state_change(sk);
>>                ax25_destroy_socket(ax25);
>>        }
>> +
>>        if (ax25_dev) {
>> -               del_timer_sync(&ax25->timer);
>> -               del_timer_sync(&ax25->t1timer);
>> -               del_timer_sync(&ax25->t2timer);
>> -               del_timer_sync(&ax25->t3timer);
>> -               del_timer_sync(&ax25->idletimer);
>>                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
>>                ax25_dev_put(ax25_dev);
>>        }
> 
> I think these del_timer_sync() functions could not be removed, otherwise the UAF bugs caused by
> timer handler could not be mitigated.

I disagree. See below:

> 
> 
>      (Thread 1)                       |      (Thread 2)
> ax25_dev_device_up() //(1)             |
> ...                                    | ax25_kill_by_device()
> ax25_bind()          //(2)             |
> ax25_connect()                         | ...
> ax25_std_establish_data_link()        |
>  ax25_start_t1timer()                 | ax25_dev_device_down() //(3)
>   mod_timer(&ax25->t1timer,..)        |
>                                       | ax25_release()
>   (wait a time)                       |  ...
>                                       |  ax25_dev_put(ax25_dev) //(4)FREE
>   ax25_t1timer_expiry()               |
>    ax25->ax25_dev->values[..] //(5)USE|  ...
>     ...                               |
> 
> The ax25_dev is deallocated in position (4) and use in position (5).

I see the problem.


ax25_release() is the wrong place.
ax25_release() is called on normal socket close() from userspace
(i.E. termination of "call" session to a remote host).
The (running) timers are needed for a normal session cleanup (ending DISC); 
that's the sense of these timers.
You see this in the
  case AX25_STATE_3:
  case AX25_STATE_4:
conditions, where t1 is started.
=> The ax25_dev != NULL condition is normal for a normal session close condition.


If we like to delete the timers here because we are in ax25_release due to the
ax25_kill_by_device/ax25_dev_device_down event, do we have another option to see who
called us, and for being able to handle the case correctly?


Or better - I already asked this question:
ax25_kill_by_device calls ax25_disconnect(ENETUNREACH), and in
ax25_disconnect() the timers are stopped.
That was introduced by one of your patches.
After this, ax25_release() is called.
Since ax25_disconnect() already stopped the timers, there's no need to stop them
again. ->
  If ax25_release() is called by userspace socket_close(), timers are not stoped,
  and session shutdown with timers runs properly.


Furthermore:
directly after ax25_disconnect() which you call in ax25_kill_by_device(), 
struct ax25_cb *s s->ax25_dev is set to NULL.
-> If ax25_release() is called for the ax25_cb, the del_timer_assurances in
ax25_release() won't work anyway, because they are in the "if (ax25_dev) { ..." part.



This would lead to the following patch:

diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index 363d47f94532..07693eb13185 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1052,12 +1052,8 @@ static int ax25_release(struct socket *sock)
                sk->sk_state_change(sk);
                ax25_destroy_socket(ax25);
        }
+
        if (ax25_dev) {
-               del_timer_sync(&ax25->timer);
-               del_timer_sync(&ax25->t1timer);
-               del_timer_sync(&ax25->t2timer);
-               del_timer_sync(&ax25->t3timer);
-               del_timer_sync(&ax25->idletimer);
                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
                ax25_dev_put(ax25_dev);
        }


ax25_subr.c: unchanged



> 
>      (Thread 1)                    |      (Thread 2)
> ax25_connect()                      |
> ax25_std_establish_data_link()     |
>  ax25_start_t1timer()              |
>   mod_timer(&ax25->t1timer,..)     |
>                                    | ax25_kill_by_device()
>   (wait a time)                    |  ...
>                                    |  s->ax25_dev = NULL; //(1)
>   ax25_t1timer_expiry()            |
>    ax25->ax25_dev->values[..] //(2)|  ...
>     ...                            |
> 
> We set null to ax25_cb->ax25_dev in position (1) and dereference
> the null pointer in position (2).

Well, perhaps one solution is to check in ax25_xxx_timer_expiry() if ax25->ax25_dev is not NULL.

> 
>> In my diff from yesterday, I also had a part around "case AX25_STATE_0:").
>> This fixes another problem.
>> I removed it from my patch above: it has to discussed later as separate topic.
>> 
>> 
>> Btw, Douming, you introduced del_timer_sync().
>> ax25_stop_heartbeat(), ax25_stop_t1timer(), ax25_stop_t2timer(), ax25_stop_t3timer(),
>> ax25_stop_idletimer() from ax25_timer.c use del_timer().
>> 
>> Why do you use del_timer_sync() directly?
>> And if del_timer_sync() is needed, is it only needed in a special case,
>> or isn't it cleaner to change the ax25_stop_xxx() timer functions?
> 
> We could not change del_timer() in ax25_stop_xxx() to del_timer_sync(). Because
> the ax25_stop_xxx() is called by many functions such as ax25_ds_state1_machine(),
> ax25_ctl_ioctl(), ax25_release() and so on. If we block and wait for timer handler
> to finish in these functions, the system will hang in some cases.
> 
> Only the reason parameter of ax25_disconnect() equals to ENETUNREACH which means
> the device is detaching or the ax25 socket has been destroyed should we use
> del_timer_sync() to wait the timer handler to finish. 
> 
> Best regards,
> Duoming Zhou

vy 73,
	- Thomas  dl9sau


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-24 21:52       ` Thomas Osterried
@ 2022-05-25  2:34         ` duoming
  2022-05-26 22:56           ` Thomas Osterried
  0 siblings, 1 reply; 12+ messages in thread
From: duoming @ 2022-05-25  2:34 UTC (permalink / raw)
  To: Thomas Osterried; +Cc: Dan Carpenter, linux-hams, David S. Miller, Paolo Abeni


Hello,

On Tue, 24 May 2022 23:52:58 +0200 Thomas wrote:

> >>> I tried to apply your patch but the format is wrong...  :/
> >> 
> >> I thought it would help us to discuss the problem on the mailinglist
> >> at the code fragment.
> >> But I also asked for testing, and then you need a working patch
> >> Was my fault. Sorry for that.
> >> 
> >> 
> >> diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
> >> index 363d47f94532..498e92fb43b7 100644
> >> --- a/net/ax25/af_ax25.c
> >> +++ b/net/ax25/af_ax25.c
> >> @@ -86,11 +86,17 @@ static void ax25_kill_by_device(struct net_device *dev)
> >> again:
> >>        ax25_for_each(s, &ax25_list) {
> >>                if (s->ax25_dev == ax25_dev) {
> >> +                       del_timer_sync(&s->timer);
> >> +                       del_timer_sync(&s->t1timer);
> >> +                       del_timer_sync(&s->t2timer);
> >> +                       del_timer_sync(&s->t3timer);
> >> +                       del_timer_sync(&s->idletimer);
> > 
> > These del_timer_sync() functions are unnecessary, Because when the device is detaching, 
> > the reason parameter of ax25_disconnect() equals to ENETUNREACH, the del_timer_sync()
> > in ax25_disconnect() will execute.
> 
> On the other hand, we have the ENETUNREACH case only in exact this scenario:
> device goes down -> ax25_kill_by_device -> ax25_disconnect().
> 
> That's why argued it's better here to stop the timers.

I think you miss one situation:
If we don't call ax25_bind() before ax25_kill_by_device(), there is no s->ax25_dev
equals to ax25_dev when ax25_kill_by_device() is executing and the del_timer_sync()
will not execute.

static void ax25_kill_by_device(struct net_device *dev)
{
...
	ax25_for_each(s, &ax25_list) {
		if (s->ax25_dev == ax25_dev) { //no s->ax25_dev equals to ax25_dev
+                       del_timer_sync(&s->timer);  //will not execute and the same below
+                       del_timer_sync(&s->t1timer); 
+                       del_timer_sync(&s->t2timer);
+                       del_timer_sync(&s->t3timer);
+                       del_timer_sync(&s->idletimer);

Even if ax25_kill_by_device() has been executed, the timers could not be stopped as well. 
As a result, the following UAF bug will happen:

      (Thread 1)                       |      (Thread 2)
ax25_dev_device_up()                   |
...                                    | ax25_kill_by_device()
ax25_bind()                            |
ax25_connect()                         | ...
 ax25_std_establish_data_link()        |
  ax25_start_t1timer()                 | ax25_dev_device_down()
   mod_timer(&ax25->t1timer,..)        |
                                       | ax25_release()
   (wait a time)                       |  ...
                                       |  ax25_dev_put(ax25_dev) //FREE
   ax25_t1timer_expiry()               |
    ax25->ax25_dev->values[..] //USE   |  ...
     ...                               |

Your new patch could not mitigate this problem.

> >>                        sk = s->sk;
> >>                        if (!sk) {
> >>                                spin_unlock_bh(&ax25_list_lock);
> >>                                ax25_disconnect(s, ENETUNREACH);
> >>                                s->ax25_dev = NULL;
> >> +                               ax25_cb_del(s);
> >>                                spin_lock_bh(&ax25_list_lock);
> >>                                goto again;
> >>                        }
> >> @@ -104,6 +110,7 @@ static void ax25_kill_by_device(struct net_device *dev)
> >>                                ax25_dev_put(ax25_dev);
> >>                        }
> >>                        release_sock(sk);
> >> +                       ax25_cb_del(s);
> >>                        spin_lock_bh(&ax25_list_lock);
> >>                        sock_put(sk);
> >>                        /* The entry could have been deleted from the
> >> @@ -1052,12 +1059,8 @@ static int ax25_release(struct socket *sock)
> >>                sk->sk_state_change(sk);
> >>                ax25_destroy_socket(ax25);
> >>        }
> >> +
> >>        if (ax25_dev) {
> >> -               del_timer_sync(&ax25->timer);
> >> -               del_timer_sync(&ax25->t1timer);
> >> -               del_timer_sync(&ax25->t2timer);
> >> -               del_timer_sync(&ax25->t3timer);
> >> -               del_timer_sync(&ax25->idletimer);
> >>                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
> >>                ax25_dev_put(ax25_dev);
> >>        }
> > 
> > I think these del_timer_sync() functions could not be removed, otherwise the UAF bugs caused by
> > timer handler could not be mitigated.
> 
> I disagree. See below:
> 
> > 
> > 
> >      (Thread 1)                       |      (Thread 2)
> > ax25_dev_device_up() //(1)             |
> > ...                                    | ax25_kill_by_device()
> > ax25_bind()          //(2)             |
> > ax25_connect()                         | ...
> > ax25_std_establish_data_link()        |
> >  ax25_start_t1timer()                 | ax25_dev_device_down() //(3)
> >   mod_timer(&ax25->t1timer,..)        |
> >                                       | ax25_release()
> >   (wait a time)                       |  ...
> >                                       |  ax25_dev_put(ax25_dev) //(4)FREE
> >   ax25_t1timer_expiry()               |
> >    ax25->ax25_dev->values[..] //(5)USE|  ...
> >     ...                               |
> > 
> > The ax25_dev is deallocated in position (4) and use in position (5).
> 
> I see the problem.
> 
> 
> ax25_release() is the wrong place.
> ax25_release() is called on normal socket close() from userspace
> (i.E. termination of "call" session to a remote host).
> The (running) timers are needed for a normal session cleanup (ending DISC); 
> that's the sense of these timers.
> You see this in the
>   case AX25_STATE_3:
>   case AX25_STATE_4:
> conditions, where t1 is started.
> => The ax25_dev != NULL condition is normal for a normal session close condition.
> 
> 
> If we like to delete the timers here because we are in ax25_release due to the
> ax25_kill_by_device/ax25_dev_device_down event, do we have another option to see who
> called us, and for being able to handle the case correctly?
> 
> 
> Or better - I already asked this question:
> ax25_kill_by_device calls ax25_disconnect(ENETUNREACH), and in
> ax25_disconnect() the timers are stopped.
> That was introduced by one of your patches.
> After this, ax25_release() is called.
> Since ax25_disconnect() already stopped the timers, there's no need to stop them
> again. ->

If we don't call ax25_bind() before ax25_kill_by_device(), there is no s->ax25_dev
equals to ax25_dev when ax25_kill_by_device is executing and the ax25_disconnect()
will not execute. Even if ax25_kill_by_device() has been executed, the timers could
not be stopped as well. So we need to stop them again in ax25_release(). 

>   If ax25_release() is called by userspace socket_close(), timers are not stoped,
>   and session shutdown with timers runs properly.

I think adding a check in ax25_release() which is used to judge whether ax25_kill_by_device()
has been executed before ax25_release() is better. The detail is shown below:

diff --git a/include/net/ax25.h b/include/net/ax25.h
index 0f9790c455b..a427a05672e 100644
--- a/include/net/ax25.h
+++ b/include/net/ax25.h
@@ -228,6 +228,7 @@ typedef struct ax25_dev {
        ax25_dama_info          dama;
 #endif
        refcount_t              refcount;
+       bool device_up;
 } ax25_dev;

 typedef struct ax25_cb {
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index 363d47f9453..47ce6b630cc 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -81,6 +81,7 @@ static void ax25_kill_by_device(struct net_device *dev)

        if ((ax25_dev = ax25_dev_ax25dev(dev)) == NULL)
                return;
+       ax25_dev->device_up = false;

        spin_lock_bh(&ax25_list_lock);
 again:
@@ -1053,11 +1054,13 @@ static int ax25_release(struct socket *sock)
                ax25_destroy_socket(ax25);
        }
        if (ax25_dev) {
-               del_timer_sync(&ax25->timer);
-               del_timer_sync(&ax25->t1timer);
-               del_timer_sync(&ax25->t2timer);
-               del_timer_sync(&ax25->t3timer);
-               del_timer_sync(&ax25->idletimer);
+               if (!ax25_dev->device_up) {
+                       del_timer_sync(&ax25->timer);
+                       del_timer_sync(&ax25->t1timer);
+                       del_timer_sync(&ax25->t2timer);
+                       del_timer_sync(&ax25->t3timer);
+                       del_timer_sync(&ax25->idletimer);
+               }
                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
                ax25_dev_put(ax25_dev);
        }
diff --git a/net/ax25/ax25_dev.c b/net/ax25/ax25_dev.c
index d2a244e1c26..5451be15e07 100644
--- a/net/ax25/ax25_dev.c
+++ b/net/ax25/ax25_dev.c
@@ -62,6 +62,7 @@ void ax25_dev_device_up(struct net_device *dev)
        ax25_dev->dev     = dev;
        dev_hold_track(dev, &ax25_dev->dev_tracker, GFP_ATOMIC);
        ax25_dev->forward = NULL;
+       ax25_dev->device_up = true;

        ax25_dev->values[AX25_VALUES_IPDEFMODE] = AX25_DEF_IPDEFMODE;
        ax25_dev->values[AX25_VALUES_AXDEFMODE] = AX25_DEF_AXDEFMODE;

> Furthermore:
> directly after ax25_disconnect() which you call in ax25_kill_by_device(), 
> struct ax25_cb *s s->ax25_dev is set to NULL.
> -> If ax25_release() is called for the ax25_cb, the del_timer_assurances in
> ax25_release() won't work anyway, because they are in the "if (ax25_dev) { ..." part.

We only need to stop timers once. If the timers could be stopped in ax25_kill_by_device(),
there is no need to stop them in ax25_release(). In other words, if the timers could not
be stopped in ax25_kill_by_device(), we need to stop them in ax25_release().

Best regards,
Duoming Zhou


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-25  2:34         ` duoming
@ 2022-05-26 22:56           ` Thomas Osterried
  2022-05-27  4:48             ` duoming
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Osterried @ 2022-05-26 22:56 UTC (permalink / raw)
  To: Duoming Zhou; +Cc: Dan Carpenter, linux-hams, David S. Miller, Paolo Abeni

Hello,

first of all:
Your approach with
   if (!ax25_dev->device_up) { ...
looks good and I will test it.



I'm not so happy with adding new states to the ax25_dev struct.
When I asked "If we like to delete the timers here because we are in ax25_release due
  to the ax25_kill_by_device/ax25_dev_device_down event, do we have another option to
  see what called us, and for being able to handle the case correctly?",
I had in mind if we could see this situation in the current state inside
the ax25_cb, like ax25->state or ax25->sk->..

But in case of ax25->state, ax25_connect() changes the state when executed in
a parallel thread after ax25_kill_by_device() -> that's no option


Another one I thought about was an assurance in ax25_start_xxx_timer() functions,
that timer should not start if ax25_cb->ax25_dev == NULL,
but your examples showed, that
  mod_timer(&ax25->t1timer,..)
is also used directly to modify running timer.


An other option may be assurances in ax25_xxx_expiry() to immediately return if
ax25_cb->ax25_dev == NULL.

Indeed, I see in ax25_heartbeat_expiry(struct timer_list *t) this assurance:
          if (ax25->ax25_dev)
                proto = ax25->ax25_dev->values[AX25_VALUES_PROTOCOL];
Unfortunately, those checks are missing in the other timer_expiry functions and
one of them is ax25_t1timer_expiry, for which you showed in your kernel panic
backtrace.

If I see it correctly, this would solve the issue you had with the running
timers, without forced-stopping them in ax25_release().
But on the other hand: if the interface is really down, there's no need
for the timers to keep running, and keeping things running where parts
may already be freed is dangerous. The session has to been cleaned up immediately.
-> This clearly speaks for your latest approach.

Perhaps it's a good idea anyway to add the assurance (like ax25_heartbeat_expiry()
does) to the other ax25_xxx_expiry() functions. 



An other approach I had in mind was a variable in struct ax25_cb, that the cb
is in progress of being cleaned up.
This way, we could prohibit starting or modifying timers.
Unfortunately, there might be many checks in various functions necessary.
-> This clearly speaks for your latest approach.

Perhaps it's a good idea to also test in
ax25_setsockopt SO_BINDTODEVICE
and in
ax25_bind_to_device
if ax25_ dev->device_up flag is 0, and if so, return in  -ENODEV;


Off-topic-question: about the quite-new dev_hold_track():
while looking to 
ax25_setsockopt() and ax25_bind() for SO_BINDTODEVICE,
I se only that dev_hold_track() is called in ax25_bind(),
but not in ax25_setsockopt() for SO_BINDTODEVICE.
Is it correct that ax25_setsockopt() for SO_BINDTODEVICE
does not need to trigger the tracker?



About your answer (comments inline):


> Am 25.05.2022 um 04:34 schrieb duoming@zju.edu.cn:
> 
> 
> Hello,
> 
> On Tue, 24 May 2022 23:52:58 +0200 Thomas wrote:
> 
>>>>> I tried to apply your patch but the format is wrong...  :/
>>>> 
>>>> I thought it would help us to discuss the problem on the mailinglist
>>>> at the code fragment.
>>>> But I also asked for testing, and then you need a working patch
>>>> Was my fault. Sorry for that.
>>>> 
>>>> 
>>>> diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
>>>> index 363d47f94532..498e92fb43b7 100644
>>>> --- a/net/ax25/af_ax25.c
>>>> +++ b/net/ax25/af_ax25.c
>>>> @@ -86,11 +86,17 @@ static void ax25_kill_by_device(struct net_device *dev)
>>>> again:
>>>>       ax25_for_each(s, &ax25_list) {
>>>>               if (s->ax25_dev == ax25_dev) {
>>>> +                       del_timer_sync(&s->timer);
>>>> +                       del_timer_sync(&s->t1timer);
>>>> +                       del_timer_sync(&s->t2timer);
>>>> +                       del_timer_sync(&s->t3timer);
>>>> +                       del_timer_sync(&s->idletimer);
>>> 
>>> These del_timer_sync() functions are unnecessary, Because when the device is detaching, 
>>> the reason parameter of ax25_disconnect() equals to ENETUNREACH, the del_timer_sync()
>>> in ax25_disconnect() will execute.
>> 
>> On the other hand, we have the ENETUNREACH case only in exact this scenario:
>> device goes down -> ax25_kill_by_device -> ax25_disconnect().
>> 
>> That's why argued it's better here to stop the timers.
> 
> I think you miss one situation:

You mixed this e-mail together with a next e-mail where we already talked about keeping
del_timer_sync() in ax25_release().

> If we don't call ax25_bind() before ax25_kill_by_device(), there is no s->ax25_dev
> equals to ax25_dev when ax25_kill_by_device() is executing and the del_timer_sync()
> will not execute.
>
> static void ax25_kill_by_device(struct net_device *dev)
> {
> ...
> 	ax25_for_each(s, &ax25_list) {
> 		if (s->ax25_dev == ax25_dev) { //no s->ax25_dev equals to ax25_dev
> +                       del_timer_sync(&s->timer);  //will not execute and the same below
> +                       del_timer_sync(&s->t1timer); 
> +                       del_timer_sync(&s->t2timer);
> +                       del_timer_sync(&s->t3timer);
> +                       del_timer_sync(&s->idletimer);
> 
> Even if ax25_kill_by_device() has been executed, the timers could not be stopped as well. 
> As a result, the following UAF bug will happen:
> 
>      (Thread 1)                       |      (Thread 2)
> ax25_dev_device_up()                   |
> ...                                    | ax25_kill_by_device()
> ax25_bind()                            |
> ax25_connect()                         | ...
> ax25_std_establish_data_link()        |
>  ax25_start_t1timer()                 | ax25_dev_device_down()
>   mod_timer(&ax25->t1timer,..)        |
>                                       | ax25_release()
>   (wait a time)                       |  ...
>                                       |  ax25_dev_put(ax25_dev) //FREE
>   ax25_t1timer_expiry()               |
>    ax25->ax25_dev->values[..] //USE   |  ...
>     ...                               |
> 
> Your new patch could not mitigate this problem.





>>>>                       sk = s->sk;
>>>>                       if (!sk) {
>>>>                               spin_unlock_bh(&ax25_list_lock);
>>>>                               ax25_disconnect(s, ENETUNREACH);
>>>>                               s->ax25_dev = NULL;
>>>> +                               ax25_cb_del(s);
>>>>                               spin_lock_bh(&ax25_list_lock);
>>>>                               goto again;
>>>>                       }
>>>> @@ -104,6 +110,7 @@ static void ax25_kill_by_device(struct net_device *dev)
>>>>                               ax25_dev_put(ax25_dev);
>>>>                       }
>>>>                       release_sock(sk);
>>>> +                       ax25_cb_del(s);
>>>>                       spin_lock_bh(&ax25_list_lock);
>>>>                       sock_put(sk);
>>>>                       /* The entry could have been deleted from the
>>>> @@ -1052,12 +1059,8 @@ static int ax25_release(struct socket *sock)
>>>>               sk->sk_state_change(sk);
>>>>               ax25_destroy_socket(ax25);
>>>>       }
>>>> +
>>>>       if (ax25_dev) {
>>>> -               del_timer_sync(&ax25->timer);
>>>> -               del_timer_sync(&ax25->t1timer);
>>>> -               del_timer_sync(&ax25->t2timer);
>>>> -               del_timer_sync(&ax25->t3timer);
>>>> -               del_timer_sync(&ax25->idletimer);
>>>>               dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
>>>>               ax25_dev_put(ax25_dev);
>>>>       }
>>> 
>>> I think these del_timer_sync() functions could not be removed, otherwise the UAF bugs caused by
>>> timer handler could not be mitigated.
>> 
>> I disagree. See below:
>> 
>>> 
>>> 
>>>     (Thread 1)                       |      (Thread 2)
>>> ax25_dev_device_up() //(1)             |
>>> ...                                    | ax25_kill_by_device()
>>> ax25_bind()          //(2)             |
>>> ax25_connect()                         | ...
>>> ax25_std_establish_data_link()        |
>>> ax25_start_t1timer()                 | ax25_dev_device_down() //(3)
>>>  mod_timer(&ax25->t1timer,..)        |
>>>                                      | ax25_release()
>>>  (wait a time)                       |  ...
>>>                                      |  ax25_dev_put(ax25_dev) //(4)FREE
>>>  ax25_t1timer_expiry()               |
>>>   ax25->ax25_dev->values[..] //(5)USE|  ...
>>>    ...                               |
>>> 
>>> The ax25_dev is deallocated in position (4) and use in position (5).
>> 
>> I see the problem.
>> 
>> 
>> ax25_release() is the wrong place.
>> ax25_release() is called on normal socket close() from userspace
>> (i.E. termination of "call" session to a remote host).
>> The (running) timers are needed for a normal session cleanup (ending DISC); 
>> that's the sense of these timers.
>> You see this in the
>>  case AX25_STATE_3:
>>  case AX25_STATE_4:
>> conditions, where t1 is started.
>> => The ax25_dev != NULL condition is normal for a normal session close condition.
>> 
>> 
>> If we like to delete the timers here because we are in ax25_release due to the
>> ax25_kill_by_device/ax25_dev_device_down event, do we have another option to see who
>> called us, and for being able to handle the case correctly?
>> 
>> 
>> Or better - I already asked this question:
>> ax25_kill_by_device calls ax25_disconnect(ENETUNREACH), and in
>> ax25_disconnect() the timers are stopped.
>> That was introduced by one of your patches.
>> After this, ax25_release() is called.
>> Since ax25_disconnect() already stopped the timers, there's no need to stop them
>> again. ->
> 
> If we don't call ax25_bind() before ax25_kill_by_device(), there is no s->ax25_dev
> equals to ax25_dev when ax25_kill_by_device is executing and the ax25_disconnect()
> will not execute. Even if ax25_kill_by_device() has been executed, the timers could
> not be stopped as well. So we need to stop them again in ax25_release(). 
> 
>>  If ax25_release() is called by userspace socket_close(), timers are not stoped,
>>  and session shutdown with timers runs properly.
> 
> I think adding a check in ax25_release() which is used to judge whether ax25_kill_by_device()
> has been executed before ax25_release() is better. The detail is shown below:
> 
> diff --git a/include/net/ax25.h b/include/net/ax25.h
> index 0f9790c455b..a427a05672e 100644
> --- a/include/net/ax25.h
> +++ b/include/net/ax25.h
> @@ -228,6 +228,7 @@ typedef struct ax25_dev {
>        ax25_dama_info          dama;
> #endif
>        refcount_t              refcount;
> +       bool device_up;
> } ax25_dev;
> 
> typedef struct ax25_cb {
> diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
> index 363d47f9453..47ce6b630cc 100644
> --- a/net/ax25/af_ax25.c
> +++ b/net/ax25/af_ax25.c
> @@ -81,6 +81,7 @@ static void ax25_kill_by_device(struct net_device *dev)
> 
>        if ((ax25_dev = ax25_dev_ax25dev(dev)) == NULL)
>                return;
> +       ax25_dev->device_up = false;
> 
>        spin_lock_bh(&ax25_list_lock);
> again:
> @@ -1053,11 +1054,13 @@ static int ax25_release(struct socket *sock)
>                ax25_destroy_socket(ax25);
>        }
>        if (ax25_dev) {
> -               del_timer_sync(&ax25->timer);
> -               del_timer_sync(&ax25->t1timer);
> -               del_timer_sync(&ax25->t2timer);
> -               del_timer_sync(&ax25->t3timer);
> -               del_timer_sync(&ax25->idletimer);
> +               if (!ax25_dev->device_up) {
> +                       del_timer_sync(&ax25->timer);
> +                       del_timer_sync(&ax25->t1timer);
> +                       del_timer_sync(&ax25->t2timer);
> +                       del_timer_sync(&ax25->t3timer);
> +                       del_timer_sync(&ax25->idletimer);
> +               }
>                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
>                ax25_dev_put(ax25_dev);
>        }
> diff --git a/net/ax25/ax25_dev.c b/net/ax25/ax25_dev.c
> index d2a244e1c26..5451be15e07 100644
> --- a/net/ax25/ax25_dev.c
> +++ b/net/ax25/ax25_dev.c
> @@ -62,6 +62,7 @@ void ax25_dev_device_up(struct net_device *dev)
>        ax25_dev->dev     = dev;
>        dev_hold_track(dev, &ax25_dev->dev_tracker, GFP_ATOMIC);
>        ax25_dev->forward = NULL;
> +       ax25_dev->device_up = true;
> 
>        ax25_dev->values[AX25_VALUES_IPDEFMODE] = AX25_DEF_IPDEFMODE;
>        ax25_dev->values[AX25_VALUES_AXDEFMODE] = AX25_DEF_AXDEFMODE;

ack

It's by far the easiest and clearest solution.

> 
>> Furthermore:
>> directly after ax25_disconnect() which you call in ax25_kill_by_device(), 
>> struct ax25_cb *s s->ax25_dev is set to NULL.
>> -> If ax25_release() is called for the ax25_cb, the del_timer_assurances in
>> ax25_release() won't work anyway, because they are in the "if (ax25_dev) { ..." part.
> 
> We only need to stop timers once. If the timers could be stopped in ax25_kill_by_device(),
> there is no need to stop them in ax25_release(). In other words, if the timers could not
> be stopped in ax25_kill_by_device(), we need to stop them in ax25_release().
> 
> Best regards,
> Duoming Zhou

vy 73,
	- Thomas  dl9sau


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
  2022-05-26 22:56           ` Thomas Osterried
@ 2022-05-27  4:48             ` duoming
  0 siblings, 0 replies; 12+ messages in thread
From: duoming @ 2022-05-27  4:48 UTC (permalink / raw)
  To: Thomas Osterried; +Cc: Dan Carpenter, linux-hams, David S. Miller, Paolo Abeni


Hello,

On Fri, 27 May 2022 00:56:37 +0200 thomas wrote:

> I'm not so happy with adding new states to the ax25_dev struct.
> When I asked "If we like to delete the timers here because we are in ax25_release due
>   to the ax25_kill_by_device/ax25_dev_device_down event, do we have another option to
>   see what called us, and for being able to handle the case correctly?",
> I had in mind if we could see this situation in the current state inside
> the ax25_cb, like ax25->state or ax25->sk->..
> 
> But in case of ax25->state, ax25_connect() changes the state when executed in
> a parallel thread after ax25_kill_by_device() -> that's no option

The "ax25->state" or "ax25->sk->.." inside ax25_cb could not be used, because these flags could
be changed in many places such as ax25_destroy_socket(), ax25_create_cb() or ax25_disconnect().
We need a flag that only represents the ax25_kill_by_device() has been executed.

> Another one I thought about was an assurance in ax25_start_xxx_timer() functions,
> that timer should not start if ax25_cb->ax25_dev == NULL,
> but your examples showed, that
>   mod_timer(&ax25->t1timer,..)
> is also used directly to modify running timer.
> 
> 
> An other option may be assurances in ax25_xxx_expiry() to immediately return if
> ax25_cb->ax25_dev == NULL.
> 
> Indeed, I see in ax25_heartbeat_expiry(struct timer_list *t) this assurance:
>           if (ax25->ax25_dev)
>                 proto = ax25->ax25_dev->values[AX25_VALUES_PROTOCOL];
> Unfortunately, those checks are missing in the other timer_expiry functions and
> one of them is ax25_t1timer_expiry, for which you showed in your kernel panic
> backtrace.

The "if(ax25->ax25_dev)" check in ax25_heartbeat_expiry() is useless, because there
is not any lock that could synchronize with ax25_kill_by_device(), ax25_dev_device_down()
or ax25_release(). Even if we add "if(ax25->ax25_dev)" check in other timer_expiry functions,
the race condition will also happen.

> If I see it correctly, this would solve the issue you had with the running
> timers, without forced-stopping them in ax25_release().
> But on the other hand: if the interface is really down, there's no need
> for the timers to keep running, and keeping things running where parts
> may already be freed is dangerous. The session has to been cleaned up immediately.
> -> This clearly speaks for your latest approach.

The timers will start when we call ax25_bind()-->ax25_connect() to establish a session or
during the period of session cleanup. 

If we only use ax25_dev_device_up() to turn the device on instead of using ax25_bind()-->ax25_connect()
to establish a session, no timers will start. So if the interface is down, there are no timers running.

In some cases, the timers could not be stopped in ax25_kill_by_device(). If you call ax25_kill_by_device()
before ax25_bind(), which is shown below. You could not stop timers in ax25_kill_by_device(), because it 
has already been executed.

      (Thread 1)                       |      (Thread 2)
ax25_dev_device_up()                   |
...                                    | ax25_kill_by_device()
ax25_bind()                            |
ax25_connect()  //! timers start       | ...
 ax25_std_establish_data_link()        |
  ax25_start_t1timer()                 | ax25_dev_device_down()
   mod_timer(&ax25->t1timer,..)        |
                                       | ax25_release()
   (wait a time)                       |  ...
                                       |  ax25_dev_put(ax25_dev) //FREE
   ax25_t1timer_expiry()               |
    ax25->ax25_dev->values[..] //USE   |  ...
     ...                               |

We have to use del_timer_sync() in ax25_release() to stop timers, because the device could
only be removed once in the real world. What's more, If we use ax25_bind()-->ax25_connect()
to establish a session, we need to call ax25_release() to close the session in normal case.
So putting the del_timer_sync() in ax25_release() is reasonable.

> Perhaps it's a good idea anyway to add the assurance (like ax25_heartbeat_expiry()
> does) to the other ax25_xxx_expiry() functions. 

I disagree, there is not any lock that could synchronize with other functions in 
ax25_xxx_expiry() functions.

> An other approach I had in mind was a variable in struct ax25_cb, that the cb
> is in progress of being cleaned up.
> This way, we could prohibit starting or modifying timers.
> Unfortunately, there might be many checks in various functions necessary.
> -> This clearly speaks for your latest approach.
> 
> Perhaps it's a good idea to also test in
> ax25_setsockopt SO_BINDTODEVICE
> and in
> ax25_bind_to_device
> if ax25_ dev->device_up flag is 0, and if so, return in  -ENODEV;

It is unnecessary to add ax25_dev->device_up flag in ax25_setsockopt SO_BINDTODEVICE
or ax25_bind(). Because there is no bug and we need to add extra locks in order to 
synchronize with ax25_dev->device_up in ax25_kill_by_device(), which is tedious.
 
> Off-topic-question: about the quite-new dev_hold_track():
> while looking to 
> ax25_setsockopt() and ax25_bind() for SO_BINDTODEVICE,
> I se only that dev_hold_track() is called in ax25_bind(),
> but not in ax25_setsockopt() for SO_BINDTODEVICE.
> Is it correct that ax25_setsockopt() for SO_BINDTODEVICE
> does not need to trigger the tracker?

There is no need to add refcount of net_device in SO_BINDTODEVICE, because there is
rtnl_lock() which is well synchronized in it.

Best regards,
Duoming Zhou

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-05-27  4:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-23  5:46 Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 Thomas Osterried
2022-05-23 13:52 ` Dan Carpenter
2022-05-24  9:25   ` Thomas Osterried
2022-05-24 10:47     ` duoming
2022-05-24 21:52       ` Thomas Osterried
2022-05-25  2:34         ` duoming
2022-05-26 22:56           ` Thomas Osterried
2022-05-27  4:48             ` duoming
2022-05-23 14:07 ` duoming
2022-05-23 16:05   ` Roland Schwarz
2022-05-23 18:53     ` Thomas Osterried
2022-05-23 19:14   ` Thomas Osterried

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.