All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paolo Abeni <pabeni@redhat.com>
To: Joanne Koong <joannelkoong@gmail.com>,
	netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Martin KaFai Lau <kafai@fb.com>, Jakub Kicinski <kuba@kernel.org>,
	David Miller <davem@davemloft.net>,
	syzbot <syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com>,
	Mat Martineau <mathew.j.martineau@linux.intel.com>
Subject: Re: [PATCH net-next v2 1/2] net: Update bhash2 when socket's rcv saddr changes
Date: Wed, 08 Jun 2022 09:35:20 +0200	[thread overview]
Message-ID: <a2cc40d282a28f6eef29274dda8b55f27bbf3a6f.camel@redhat.com> (raw)
In-Reply-To: <CAJnrk1Zv2ZYyRHdpW6ZZEVh+n=ryvJqmLTDbfa=_Y-A2gD8ANQ@mail.gmail.com>

On Tue, 2022-06-07 at 13:24 -0700, Joanne Koong wrote:
> On Tue, Jun 7, 2022 at 1:33 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > Hello,
> > 
> > On Thu, 2022-06-02 at 09:51 -0700, Joanne Koong wrote:
> > > Commit d5a42de8bdbe ("net: Add a second bind table hashed by port and
> > > address") added a second bind table, bhash2, that hashes by a socket's port
> > > and rcv address.
> > > 
> > > However, there are two cases where the socket's rcv saddr can change
> > > after it has been binded:
> > > 
> > > 1) The case where there is a bind() call on "::" (IPADDR_ANY) and then
> > > a connect() call. The kernel will assign the socket an address when it
> > > handles the connect()
> > > 
> > > 2) In inet_sk_reselect_saddr(), which is called when rerouting fails
> > > when rebuilding the sk header (invoked by inet_sk_rebuild_header)
> > > 
> > > In these two cases, we need to update the bhash2 table by removing the
> > > entry for the old address, and adding a new entry reflecting the updated
> > > address.
> > > 
> > > Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
> > > Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > >  include/net/inet_hashtables.h |  6 ++-
> > >  include/net/ipv6.h            |  2 +-
> > >  net/dccp/ipv4.c               | 10 +++--
> > >  net/dccp/ipv6.c               |  4 +-
> > >  net/ipv4/af_inet.c            |  7 +++-
> > >  net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++++---
> > >  net/ipv4/tcp_ipv4.c           |  8 +++-
> > >  net/ipv6/inet6_hashtables.c   |  4 +-
> > >  net/ipv6/tcp_ipv6.c           |  4 +-
> > >  9 files changed, 97 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > index a0887b70967b..2c331ce6ca73 100644
> > > --- a/include/net/inet_hashtables.h
> > > +++ b/include/net/inet_hashtables.h
> > > @@ -448,11 +448,13 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
> > >  }
> > > 
> > >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                     struct sock *sk, u64 port_offset,
> > > +                     struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> > >                       int (*check_established)(struct inet_timewait_death_row *,
> > >                                                struct sock *, __u16,
> > >                                                struct inet_timewait_sock **));
> > > 
> > >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                   struct sock *sk);
> > > +                   struct sock *sk, bool prev_inaddr_any);
> > > +
> > > +int inet_bhash2_update_saddr(struct sock *sk);
> > >  #endif /* _INET_HASHTABLES_H */
> > > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> > > index 5b38bf1a586b..6a50aca56d50 100644
> > > --- a/include/net/ipv6.h
> > > +++ b/include/net/ipv6.h
> > > @@ -1187,7 +1187,7 @@ int inet6_compat_ioctl(struct socket *sock, unsigned int cmd,
> > >               unsigned long arg);
> > > 
> > >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                           struct sock *sk);
> > > +                    struct sock *sk, bool prev_inaddr_any);
> > >  int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
> > >  int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> > >                 int flags);
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index da6e3b20cd75..37a8bc3ee49e 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -47,12 +47,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > >       struct inet_sock *inet = inet_sk(sk);
> > >       struct dccp_sock *dp = dccp_sk(sk);
> > > +     struct ip_options_rcu *inet_opt;
> > >       __be16 orig_sport, orig_dport;
> > > +     bool prev_inaddr_any = false;
> > >       __be32 daddr, nexthop;
> > >       struct flowi4 *fl4;
> > >       struct rtable *rt;
> > >       int err;
> > > -     struct ip_options_rcu *inet_opt;
> > > 
> > >       dp->dccps_role = DCCP_ROLE_CLIENT;
> > > 
> > > @@ -89,8 +90,11 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       if (inet_opt == NULL || !inet_opt->opt.srr)
> > >               daddr = fl4->daddr;
> > > 
> > > -     if (inet->inet_saddr == 0)
> > > +     if (inet->inet_saddr == 0) {
> > >               inet->inet_saddr = fl4->saddr;
> > > +             prev_inaddr_any = true;
> > > +     }
> > > +
> > >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > >       inet->inet_dport = usin->sin_port;
> > >       sk_daddr_set(sk, daddr);
> > > @@ -105,7 +109,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >        * complete initialization after this.
> > >        */
> > >       dccp_set_state(sk, DCCP_REQUESTING);
> > > -     err = inet_hash_connect(&dccp_death_row, sk);
> > > +     err = inet_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> > >       if (err != 0)
> > >               goto failure;
> > > 
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index fd44638ec16b..03013522acab 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -824,6 +824,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       struct ipv6_pinfo *np = inet6_sk(sk);
> > >       struct dccp_sock *dp = dccp_sk(sk);
> > >       struct in6_addr *saddr = NULL, *final_p, final;
> > > +     bool prev_inaddr_any = false;
> > >       struct ipv6_txoptions *opt;
> > >       struct flowi6 fl6;
> > >       struct dst_entry *dst;
> > > @@ -936,6 +937,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       if (saddr == NULL) {
> > >               saddr = &fl6.saddr;
> > >               sk->sk_v6_rcv_saddr = *saddr;
> > > +             prev_inaddr_any = true;
> > >       }
> > > 
> > >       /* set the source address */
> > > @@ -951,7 +953,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       inet->inet_dport = usin->sin6_port;
> > > 
> > >       dccp_set_state(sk, DCCP_REQUESTING);
> > > -     err = inet6_hash_connect(&dccp_death_row, sk);
> > > +     err = inet6_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> > >       if (err)
> > >               goto late_failure;
> > > 
> > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > index 93da9f783bec..ad627a99ff9d 100644
> > > --- a/net/ipv4/af_inet.c
> > > +++ b/net/ipv4/af_inet.c
> > > @@ -1221,10 +1221,11 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > >       struct inet_sock *inet = inet_sk(sk);
> > >       __be32 old_saddr = inet->inet_saddr;
> > >       __be32 daddr = inet->inet_daddr;
> > > +     struct ip_options_rcu *inet_opt;
> > >       struct flowi4 *fl4;
> > >       struct rtable *rt;
> > >       __be32 new_saddr;
> > > -     struct ip_options_rcu *inet_opt;
> > > +     int err;
> > > 
> > >       inet_opt = rcu_dereference_protected(inet->inet_opt,
> > >                                            lockdep_sock_is_held(sk));
> > > @@ -1253,6 +1254,10 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > 
> > >       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > 
> > > +     err = inet_bhash2_update_saddr(sk);
> > > +     if (err)
> > > +             return err;
> > > +
> > >       /*
> > >        * XXX The only one ugly spot where we need to
> > >        * XXX really change the sockets identity after
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index e8de5e699b3f..592b70663a3b 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -826,6 +826,55 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> > >       return bhash2;
> > >  }
> > > 
> > > +/* the lock for the socket's corresponding bhash entry must be held */
> > > +static int __inet_bhash2_update_saddr(struct sock *sk,
> > > +                                   struct inet_hashinfo *hinfo,
> > > +                                   struct net *net, int port, int l3mdev)
> > > +{
> > > +     struct inet_bind2_hashbucket *head2;
> > > +     struct inet_bind2_bucket *tb2;
> > > +
> > > +     tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk,
> > > +                                  &head2);
> > > +     if (!tb2) {
> > > +             tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
> > > +                                            net, head2, port, l3mdev, sk);
> > > +             if (!tb2)
> > > +                     return -ENOMEM;
> > > +     }
> > > +
> > > +     /* Remove the socket's old entry from bhash2 */
> > > +     __sk_del_bind2_node(sk);
> > > +
> > > +     sk_add_bind2_node(sk, &tb2->owners);
> > > +     inet_csk(sk)->icsk_bind2_hash = tb2;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +/* This should be called if/when a socket's rcv saddr changes after it has
> > > + * been binded.
> > > + */
> > > +int inet_bhash2_update_saddr(struct sock *sk)
> > > +{
> > > +     struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> > > +     int l3mdev = inet_sk_bound_l3mdev(sk);
> > > +     struct inet_bind_hashbucket *head;
> > > +     int port = inet_sk(sk)->inet_num;
> > > +     struct net *net = sock_net(sk);
> > > +     int err;
> > > +
> > > +     head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
> > > +
> > > +     spin_lock_bh(&head->lock);
> > > +
> > > +     err = __inet_bhash2_update_saddr(sk, hinfo, net, port, l3mdev);
> > > +
> > > +     spin_unlock_bh(&head->lock);
> > > +
> > > +     return err;
> > > +}
> > > +
> > >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> > >   * Note that we use 32bit integers (vs RFC 'short integers')
> > >   * because 2^16 is not a multiple of num_ephemeral and this
> > > @@ -840,7 +889,7 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> > >  static u32 *table_perturb;
> > > 
> > >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -             struct sock *sk, u64 port_offset,
> > > +             struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> > >               int (*check_established)(struct inet_timewait_death_row *,
> > >                       struct sock *, __u16, struct inet_timewait_sock **))
> > >  {
> > > @@ -858,11 +907,24 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > >       int l3mdev;
> > >       u32 index;
> > > 
> > > +     l3mdev = inet_sk_bound_l3mdev(sk);
> > > +
> > >       if (port) {
> > >               head = &hinfo->bhash[inet_bhashfn(net, port,
> > >                                                 hinfo->bhash_size)];
> > >               tb = inet_csk(sk)->icsk_bind_hash;
> > > +
> > >               spin_lock_bh(&head->lock);
> > > +
> > > +             if (prev_inaddr_any) {
> > > +                     ret = __inet_bhash2_update_saddr(sk, hinfo, net, port,
> > > +                                                      l3mdev);
> > > +                     if (ret) {
> > > +                             spin_unlock_bh(&head->lock);
> > > +                             return ret;
> > > +                     }
> > > +             }
> > > +
> > >               if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
> > >                       inet_ehash_nolisten(sk, NULL, NULL);
> > >                       spin_unlock_bh(&head->lock);
> > > @@ -875,8 +937,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > >               return ret;
> > >       }
> > > 
> > > -     l3mdev = inet_sk_bound_l3mdev(sk);
> > > -
> > >       inet_get_local_port_range(net, &low, &high);
> > >       high++; /* [32768, 60999] -> [32768, 61000[ */
> > >       remaining = high - low;
> > > @@ -987,13 +1047,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > >   * Bind a port for a connect operation and hash it.
> > >   */
> > >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                   struct sock *sk)
> > > +                   struct sock *sk, bool prev_inaddr_any)
> > >  {
> > >       u64 port_offset = 0;
> > > 
> > >       if (!inet_sk(sk)->inet_num)
> > >               port_offset = inet_sk_port_offset(sk);
> > > -     return __inet_hash_connect(death_row, sk, port_offset,
> > > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> > >                                  __inet_check_established);
> > >  }
> > >  EXPORT_SYMBOL_GPL(inet_hash_connect);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index dac2650f3863..adf8d750933d 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -203,6 +203,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       struct inet_sock *inet = inet_sk(sk);
> > >       struct tcp_sock *tp = tcp_sk(sk);
> > >       __be16 orig_sport, orig_dport;
> > > +     bool prev_inaddr_any = false;
> > >       __be32 daddr, nexthop;
> > >       struct flowi4 *fl4;
> > >       struct rtable *rt;
> > > @@ -246,8 +247,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       if (!inet_opt || !inet_opt->opt.srr)
> > >               daddr = fl4->daddr;
> > > 
> > > -     if (!inet->inet_saddr)
> > > +     if (!inet->inet_saddr) {
> > >               inet->inet_saddr = fl4->saddr;
> > > +             prev_inaddr_any = true;
> > > +     }
> > > +
> > >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > 
> > >       if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > @@ -273,7 +277,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >        * complete initialization after this.
> > >        */
> > >       tcp_set_state(sk, TCP_SYN_SENT);
> > > -     err = inet_hash_connect(tcp_death_row, sk);
> > > +     err = inet_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> > >       if (err)
> > >               goto failure;
> > > 
> > > diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> > > index 7d53d62783b1..c87c5933f3be 100644
> > > --- a/net/ipv6/inet6_hashtables.c
> > > +++ b/net/ipv6/inet6_hashtables.c
> > > @@ -317,13 +317,13 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
> > >  }
> > > 
> > >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                    struct sock *sk)
> > > +                    struct sock *sk, bool prev_inaddr_any)
> > >  {
> > >       u64 port_offset = 0;
> > > 
> > >       if (!inet_sk(sk)->inet_num)
> > >               port_offset = inet6_sk_port_offset(sk);
> > > -     return __inet_hash_connect(death_row, sk, port_offset,
> > > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> > >                                  __inet6_check_established);
> > >  }
> > >  EXPORT_SYMBOL_GPL(inet6_hash_connect);
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index f37dd4aa91c6..81e3312c2a97 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -152,6 +152,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       struct ipv6_pinfo *np = tcp_inet6_sk(sk);
> > >       struct tcp_sock *tp = tcp_sk(sk);
> > >       struct in6_addr *saddr = NULL, *final_p, final;
> > > +     bool prev_inaddr_any = false;
> > >       struct ipv6_txoptions *opt;
> > >       struct flowi6 fl6;
> > >       struct dst_entry *dst;
> > > @@ -289,6 +290,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       if (!saddr) {
> > >               saddr = &fl6.saddr;
> > >               sk->sk_v6_rcv_saddr = *saddr;
> > > +             prev_inaddr_any = true;
> > >       }
> > > 
> > >       /* set the source address */
> > > @@ -309,7 +311,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > 
> > >       tcp_set_state(sk, TCP_SYN_SENT);
> > >       tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
> > > -     err = inet6_hash_connect(tcp_death_row, sk);
> > > +     err = inet6_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> > >       if (err)
> > >               goto late_failure;
> > > 
> > 
> > I'm sorry for the late notice, but it looks like that the mptcp
> > syzkaller instance is still hitting the Warning in icsk_get_port on top
> > of the v1 of this series:
> > 
> > https://github.com/multipath-tcp/mptcp_net-next/issues/279
> > 
> > and the change in v2 should not address that. @Mat could you please
> > confirm the above?
> > 
> > Dumb question: I don't understand how the locking in bhash2 works.
> > Could you explain that?
> > 
> > What happens when 2 different processes bind different sockets on
> > different ports (with different bhash buckets) using different
> > addresses so that they hit the same bhash2 bucket? AFAICS each process
> > will use a different lock and access/modification to bhash2 could
> > happen simultaneusly?
> Hi Paolo. Yes, I think you are correct here that there could be a
> scenario where this happens. Unfortunately, I think this means the
> bhash2 table will need its own lock. I will submit a follow-up for
> this.
> 

I'm wondering if we could (and more importantly, if it would make any
sense) resort to use bhash2 usage only?

e.g. add a spinlock per row to bhash2, use it to protect row
manipulation, always hash into bhash2 and then drop bhash, similar to
commit cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120?

Cheers,

Paolo





  reply	other threads:[~2022-06-08  8:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-02 16:50 [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Joanne Koong
2022-06-02 16:51 ` [PATCH net-next v2 1/2] net: " Joanne Koong
2022-06-07  8:33   ` Paolo Abeni
2022-06-07 17:10     ` Mat Martineau
2022-06-07 20:24     ` Joanne Koong
2022-06-08  7:35       ` Paolo Abeni [this message]
2022-06-08 17:47         ` Joanne Koong
2022-06-02 16:51 ` [PATCH net-next v2 2/2] selftests/net: Add sk_bind_sendto_listen test Joanne Koong
2022-06-03 18:54 ` [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Mat Martineau
2022-06-04  0:38   ` Joanne Koong
2022-06-07  0:02     ` Joanne Koong
2022-06-07 17:31       ` Mat Martineau
2022-06-08 20:27         ` Joanne Koong
2022-06-09 23:52           ` Joanne Koong
2022-06-10  0:37             ` Mat Martineau
2022-06-10 18:09               ` Joanne Koong
2022-06-10 20:15                 ` Mat Martineau
2022-06-09 23:59           ` Mat Martineau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a2cc40d282a28f6eef29274dda8b55f27bbf3a6f.camel@redhat.com \
    --to=pabeni@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=joannelkoong@gmail.com \
    --cc=kafai@fb.com \
    --cc=kuba@kernel.org \
    --cc=mathew.j.martineau@linux.intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.