Regression in "ax25: Fix UAF bugs in ax25 timers", patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0

* Regression in "ax25: Fix UAF bugs in ax25 timers", patch  82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
@ 2022-05-23  5:46 Thomas Osterried
  2022-05-23 13:52 ` Dan Carpenter
  2022-05-23 14:07 ` duoming
  0 siblings, 2 replies; 12+ messages in thread
From: Thomas Osterried @ 2022-05-23  5:46 UTC (permalink / raw)
  To: linux-hams, Duoming Zhou; +Cc: David S. Miller, Paolo Abeni


Regression due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 "ax25: Fix UAF bugs in ax25 timers",
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/af_ax25.c?id=82e31755e55fbcea6a9dfaae5fe4860ade17cbc0
and discussion about necessarity of
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ax25/ax25_subr.c?id=fc6d01ff9ef03b66d4a3a23b46fc3c3d8cf92009
(because it addresses the same problem).


Due to patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc, any ax25 session is cleaned up anymore on disconnect.
-> netstat --ax25 shows the ex-connected session it in state "LISTEN". Makes ax25 unusable in a production environment.
=> Regression.


This is my proposal for a fix.
I'd appreciate you test and give feedback.

For more details, see below.

$ diff -Naur af_ax25.c.orig af_ax25.c.dl9sau

--- af_ax25.c.orig      2022-05-04 12:16:35.229499063 +0200
+++ af_ax25.c.dl9sau    2022-05-23 06:48:06.268171116 +0200
@@ -86,11 +86,18 @@
 again:
        ax25_for_each(s, &ax25_list) {
                if (s->ax25_dev == ax25_dev) {
+                       // Stop timers. Really needed? ax25_disconnect(), that we call with arg ENETUNREACH will also stop the timers.
+                       del_timer_sync(&s->timer);
+                       del_timer_sync(&s->t1timer);
+                       del_timer_sync(&s->t2timer);
+                       del_timer_sync(&s->t3timer);
+                       del_timer_sync(&s->idletimer);
                        sk = s->sk;
                        if (!sk) {
                                spin_unlock_bh(&ax25_list_lock);
                                ax25_disconnect(s, ENETUNREACH);
                                s->ax25_dev = NULL;
+                               ax25_cb_del(s);
                                spin_lock_bh(&ax25_list_lock);
                                goto again;
                        }
@@ -104,6 +111,7 @@
                                ax25_dev_put(ax25_dev);
                        }
                        release_sock(sk);
+                       ax25_cb_del(s);
                        spin_lock_bh(&ax25_list_lock);
                        sock_put(sk);
                        /* The entry could have been deleted from the
@@ -995,9 +1003,11 @@
        if (sk->sk_type == SOCK_SEQPACKET) {
                switch (ax25->state) {
                case AX25_STATE_0:
-                       release_sock(sk);
-                       ax25_disconnect(ax25, 0);
-                       lock_sock(sk);
+                       if (!sock_flag(ax25->sk, SOCK_DEAD)) {
+                               release_sock(sk);
+                               ax25_disconnect(ax25, 0);
+                               lock_sock(sk);
+                       }
                        ax25_destroy_socket(ax25);
                        break;
 
@@ -1052,12 +1062,8 @@
                sk->sk_state_change(sk);
                ax25_destroy_socket(ax25);
        }
+
        if (ax25_dev) {
-               del_timer_sync(&ax25->timer);
-               del_timer_sync(&ax25->t1timer);
-               del_timer_sync(&ax25->t2timer);
-               del_timer_sync(&ax25->t3timer);
-               del_timer_sync(&ax25->idletimer);
                dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker);
                ax25_dev_put(ax25_dev);
        }

Argumentation:

Author Duoming Zhou <duoming@zju.edu.cn> described his patch 82e31755e55fbcea6a9dfaae5fe4860ade17cbc0 , that lead to the regession, with:

===
ax25: Fix UAF bugs in ax25 timers
There are race conditions that may lead to UAF bugs in
ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(),
ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we call
ax25_release() to deallocate ax25_dev.

One of the UAF bugs caused by ax25_release() is shown below:

      (Thread 1)                    |      (Thread 2)
ax25_dev_device_up() //(1)          |
...                                 | ax25_kill_by_device()
ax25_bind()          //(2)          |
ax25_connect()                      | ...
 ax25_std_establish_data_link()     |
  ax25_start_t1timer()              | ax25_dev_device_down() //(3)
   mod_timer(&ax25->t1timer,..)     |
                                    | ax25_release()
   (wait a time)                    |  ...
                                    |  ax25_dev_put(ax25_dev) //(4)FREE
   ax25_t1timer_expiry()            |
    ax25->ax25_dev->values[..] //USE|  ...
     ...                            |

We increase dev in position (1) and (2), and
decrease the refcount of ax25_dev in position (3) and (4).
The ax25_dev will be freed in position (4) and be used in
ax25_t1timer_expiry().
===


Imho, Duoming has overseen when he fixed the ifconfig device down issue, that ax25_release() is called every time a userspace program closes his socket.
(.release        = ax25_release)

Timers are for good, not for bad. ax25 timers are important for correct behavior of the AX25_STATE machine. And the "heartbeat" timer is used internally for correct session cleanup.
I.e., if user closes his AX25-sessioon, kernel should send DISC upon either UA or DM are received, or until t1 expiry. The session and it's state needs still to be known for some time on normal session close. The only case where the session sate not needed anymore if you set ifcoonfig iface down.


Duoming's approach lead to
if (ax25_dev) {  // existence of ax25_dev: normal situation for a regularly closed ax25 session
              del_timer_sync(&ax25->timer);   // this is the "heartbeat" timer
              del_timer_sync(&ax25->t1timer); // .. and these are the others
              del_timer_sync(&ax25->t2timer);
              del_timer_sync(&ax25->t3timer);
              del_timer_sync(&ax25->idletimer);


The assurance that timers are stopped should to go to ax25_kill_by_device() [this is for the situation, that the SysOp sets ifconfig ax25iface down], from where ax25_release() is called. See my patch suggestion.

I argue, that the stopping of the timer in ax25_kill_by_device() is not really necessary, because it calls
ax25_disconnect(s, ENETUNREACH);
and ax25_disconnect() does:
        if (reason == ENETUNREACH) {
                del_timer_sync(&ax25->timer);
                del_timer_sync(&ax25->t1timer);
                del_timer_sync(&ax25->t2timer);
                del_timer_sync(&ax25->t3timer);
                del_timer_sync(&ax25->idletimer);

-> Our timers get stopped anyway.

But on the other hand, Duoming had the "BUG: KASAN: use-after-free in ax25_t1timer_expiry+0x1c/0x60" issue, that inspired him for his patch..

But on another hand. the reason == ENETUNREACH check in ax5_disconnect is quite new: It came by Duoming's "ax25: Fix NULL pointer dereferences in ax25 timers" patch to ax25_subr.c, ten days before (commit ID fc6d01ff). If ax25_disconnect() deletes the timers correctly, it may not be needed as an insurance for ax25_kill_by_device().
Both of Duoming's patches try to solve the same problem.
It should be investigated if the del_timer_sync()-assurance has to be at so many different points: we stop them, and call a function that also stops them. If this is really needed, than the problem we try to resolve with this lays somewhere else.


ax25_destroy_socket() is called on heartbeat expiry if AX25_STATE is 0. In the case we stop the timers in ax25_kill_by_device(), ax25_destroy_socket() is not called.
ax25_cb_del() is normaly done by ax25_destroy_socket(). 

If we don't call ax25_cb_del() directly, the situation is this, for a current incoming connection in state ESTABLISHED and after ifdown bpq0:
DL9AU-3    DL9SAU-1   ???     LISTENING    000/000  0       0
-> No timer cleans this up.

If I'd try to call ax25_destroy_socket() instead of ax25_cb_del(), this would cause a refcount-underflow OOPS due to the dev_put_track-stuff.

ax25_destroy_socket() would do ax25_cb_del(), and also stop timers. But it also may start timers again (ax25_start_heartbeat(sax25), or by timer_setup(&ax25->dtimer, ax25_destroy_timer, 0) ) -- which is not in our interest in this the case the device is set to down.

That's why assure ax25_cb_del() in my approach.


In the ax25_disconnect() del_timer_sync in "reason == ENETUNREACH", I suggest to revert. This was also introduced to solve the ax25_kill_by_device() problem.
The need for enforced stopping the timers I only see in the special case when a device is set to down.


Cc goes to David Miller and Paolo Abeni, because they committed the both patches I discussed above.


vy 73,
	- Thomas  dl9sau




^ permalink raw reply	[flat|nested] 12+ messages in thread