Netdev Archive on lore.kernel.org
 help / color / Atom feed
* possible race in __inet_lookup_established()
@ 2019-11-20  8:39 Michal Kubecek
       [not found] ` <CANn89iJYXh7AwK8_Aiz3wXqugG0icPNW6OPsPxwOvpH90kr+Ew@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Kubecek @ 2019-11-20  8:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Firo Yang

Hello Eric,

we are investigating a crash in socket lookup in a distribution kernel
based on v4.12 but the possible problem we found seems to also apply to
current mainline (or net) code.

The common pattern is:

- the crash always happens in __inet_lookup_established() in

	sk_nulls_for_each_rcu(sk, node, &head->chain) {
		if (sk->sk_hash != hash)     <-----------------
			continue;

  as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null
  so dereferencing sk->sk_hash faults

- the reason is that previous sk value pointed to a listening socket
  rather than an established one; as listening socket uses sk_node, end
  of the chain is marked by a null pointer which is not detected as
  a chain end by sk_nulls_for_each_rcu()

- there is no socket matching skb which is a TCP pure ACK having
  127.0.0.1 as both source and destination

- the chain pointed to by head variable is empty

Firo Yang came with the theory that this could be a race between socket
lookup and freing the socket and replacing it with a listening one:

1. CPU A gets a pointer to an established socket as sk in the
sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not
thake a reference to it.

2. CPU B frees the socket

3. Slab object pointed to by sk is reused for a new listening socket.
This socket has null sk->sk_node->next which uses the same spot as
sk->sk_nulls_node->next

4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and
follows the pointer, resulting in a fault dereferencing sk->sk_hash.

Unless we missed something, there is no protection against established
socket being freed and replaced by a new listening one while
__inet_lookup_established() has a pointer to it. The RCU loop only
prevents the slab object being reused for a different slab cache or
something completely different but as established and listening sockets
share the same slab cache, it does not protect us from switching from
established to listening.

As far as I can say, this kind of race could have happened for quite
long but before your commit 3b24d854cb35 ("tcp/dccp: do not touch
listener sk_refcnt under synflood"), the worst that could happen would
be switching to a chain in listener lookup table, following it to its
end and then (most likely) restarting the lookup or failing. Now that
established and listening sockets use different list types, replacing
one with the other can be deadly.

Do you agree that this race is possible or is there something we missed
that would prevent it?

Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
       [not found] ` <CANn89iJYXh7AwK8_Aiz3wXqugG0icPNW6OPsPxwOvpH90kr+Ew@mail.gmail.com>
@ 2019-11-20 18:10   ` Michal Kubecek
  2019-11-20 19:13     ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Kubecek @ 2019-11-20 18:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Firo Yang

On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote:
> On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> 
> > Hello Eric,
> >
> > we are investigating a crash in socket lookup in a distribution kernel
> > based on v4.12 but the possible problem we found seems to also apply to
> > current mainline (or net) code.
> >
> > The common pattern is:
> >
> > - the crash always happens in __inet_lookup_established() in
> >
> >         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> >                 if (sk->sk_hash != hash)     <-----------------
> >                         continue;
> >
> >   as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null
> >   so dereferencing sk->sk_hash faults
> >
> > - the reason is that previous sk value pointed to a listening socket
> >   rather than an established one; as listening socket uses sk_node, end
> >   of the chain is marked by a null pointer which is not detected as
> >   a chain end by sk_nulls_for_each_rcu()
> >
> > - there is no socket matching skb which is a TCP pure ACK having
> >   127.0.0.1 as both source and destination
> >
> > - the chain pointed to by head variable is empty
> >
> > Firo Yang came with the theory that this could be a race between socket
> > lookup and freing the socket and replacing it with a listening one:
> >
> > 1. CPU A gets a pointer to an established socket as sk in the
> > sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not
> > thake a reference to it.
> >
> > 2. CPU B frees the socket
> >
> > 3. Slab object pointed to by sk is reused for a new listening socket.
> > This socket has null sk->sk_node->next which uses the same spot as
> > sk->sk_nulls_node->next
> >
> > 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and
> > follows the pointer, resulting in a fault dereferencing sk->sk_hash.
> >
> > Unless we missed something, there is no protection against established
> > socket being freed and replaced by a new listening one while
> > __inet_lookup_established() has a pointer to it. The RCU loop only
> > prevents the slab object being reused for a different slab cache or
> > something completely different but as established and listening sockets
> > share the same slab cache, it does not protect us from switching from
> > established to listening.
> >
> > As far as I can say, this kind of race could have happened for quite
> > long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch
> > listener sk_refcnt under synflood"), the worst that could happen would
> > be switching to a chain in listener lookup table, following it to its
> > end and then (most likely) restarting the lookup or failing. Now that
> > established and listening sockets use different list types, replacing
> > one with the other can be deadly.
> >
> > Do you agree that this race is possible or is there something we missed
> > that would prevent it?
> >
> A listener is hashed on icsk_listen_portaddr_node, so I do not see how a
> listener could be found in the establish chain ?

It is not really in the chain. What we suspect is that between sk is
assigned pointer to an established socket in __inet_lookup_established()
and using sk->sk_nulls_node->next to go to the next (or stop if it's odd
nulls value), this established socket could be freed and its slab object
reused for a listening socket. As listening sockets no longer use a
nulls hashlist but a normal hashlist, in the most common case where the
socket is last in the chain, sk->sk_node->next (which occupies the same
place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls()
does not recognize the chain end and the loop would go on to next socket
in the chain.

Michal

> 
> sock_copy() makes sure to not touch sk_node
> 
> sk_prot_clear_nulls() makes sure to not touch sk_node
> 
> So maybe you miss a backport or something ?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
  2019-11-20 18:10   ` Michal Kubecek
@ 2019-11-20 19:13     ` Eric Dumazet
  2019-11-20 19:52       ` Michal Kubecek
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2019-11-20 19:13 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: netdev, Firo Yang

On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote:
> > On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > > Hello Eric,
> > >
> > > we are investigating a crash in socket lookup in a distribution kernel
> > > based on v4.12 but the possible problem we found seems to also apply to
> > > current mainline (or net) code.
> > >
> > > The common pattern is:
> > >
> > > - the crash always happens in __inet_lookup_established() in
> > >
> > >         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > >                 if (sk->sk_hash != hash)     <-----------------
> > >                         continue;
> > >
> > >   as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null
> > >   so dereferencing sk->sk_hash faults
> > >
> > > - the reason is that previous sk value pointed to a listening socket
> > >   rather than an established one; as listening socket uses sk_node, end
> > >   of the chain is marked by a null pointer which is not detected as
> > >   a chain end by sk_nulls_for_each_rcu()
> > >
> > > - there is no socket matching skb which is a TCP pure ACK having
> > >   127.0.0.1 as both source and destination
> > >
> > > - the chain pointed to by head variable is empty
> > >
> > > Firo Yang came with the theory that this could be a race between socket
> > > lookup and freing the socket and replacing it with a listening one:
> > >
> > > 1. CPU A gets a pointer to an established socket as sk in the
> > > sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not
> > > thake a reference to it.
> > >
> > > 2. CPU B frees the socket
> > >
> > > 3. Slab object pointed to by sk is reused for a new listening socket.
> > > This socket has null sk->sk_node->next which uses the same spot as
> > > sk->sk_nulls_node->next
> > >
> > > 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and
> > > follows the pointer, resulting in a fault dereferencing sk->sk_hash.
> > >
> > > Unless we missed something, there is no protection against established
> > > socket being freed and replaced by a new listening one while
> > > __inet_lookup_established() has a pointer to it. The RCU loop only
> > > prevents the slab object being reused for a different slab cache or
> > > something completely different but as established and listening sockets
> > > share the same slab cache, it does not protect us from switching from
> > > established to listening.
> > >
> > > As far as I can say, this kind of race could have happened for quite
> > > long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch
> > > listener sk_refcnt under synflood"), the worst that could happen would
> > > be switching to a chain in listener lookup table, following it to its
> > > end and then (most likely) restarting the lookup or failing. Now that
> > > established and listening sockets use different list types, replacing
> > > one with the other can be deadly.
> > >
> > > Do you agree that this race is possible or is there something we missed
> > > that would prevent it?
> > >
> > A listener is hashed on icsk_listen_portaddr_node, so I do not see how a
> > listener could be found in the establish chain ?
>
> It is not really in the chain. What we suspect is that between sk is
> assigned pointer to an established socket in __inet_lookup_established()
> and using sk->sk_nulls_node->next to go to the next (or stop if it's odd
> nulls value), this established socket could be freed and its slab object
> reused for a listening socket. As listening sockets no longer use a
> nulls hashlist but a normal hashlist, in the most common case where the
> socket is last in the chain, sk->sk_node->next (which occupies the same
> place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls()
> does not recognize the chain end and the loop would go on to next socket
> in the chain.
>

I hear you, but where is the sk->sk_nulls_node->next would be set to
NULL exactly ?

> Michal
>
> >
> > sock_copy() makes sure to not touch sk_node
> >
> > sk_prot_clear_nulls() makes sure to not touch sk_node
> >
> > So maybe you miss a backport or something ?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
  2019-11-20 19:13     ` Eric Dumazet
@ 2019-11-20 19:52       ` Michal Kubecek
  2019-11-20 20:04         ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Kubecek @ 2019-11-20 19:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Firo Yang

On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote:
> On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote:
> > > On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > >
> > > > Hello Eric,
> > > >
> > > > we are investigating a crash in socket lookup in a distribution kernel
> > > > based on v4.12 but the possible problem we found seems to also apply to
> > > > current mainline (or net) code.
> > > >
> > > > The common pattern is:
> > > >
> > > > - the crash always happens in __inet_lookup_established() in
> > > >
> > > >         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > > >                 if (sk->sk_hash != hash)     <-----------------
> > > >                         continue;
> > > >
> > > >   as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null
> > > >   so dereferencing sk->sk_hash faults
> > > >
> > > > - the reason is that previous sk value pointed to a listening socket
> > > >   rather than an established one; as listening socket uses sk_node, end
> > > >   of the chain is marked by a null pointer which is not detected as
> > > >   a chain end by sk_nulls_for_each_rcu()
> > > >
> > > > - there is no socket matching skb which is a TCP pure ACK having
> > > >   127.0.0.1 as both source and destination
> > > >
> > > > - the chain pointed to by head variable is empty
> > > >
> > > > Firo Yang came with the theory that this could be a race between socket
> > > > lookup and freing the socket and replacing it with a listening one:
> > > >
> > > > 1. CPU A gets a pointer to an established socket as sk in the
> > > > sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not
> > > > thake a reference to it.
> > > >
> > > > 2. CPU B frees the socket
> > > >
> > > > 3. Slab object pointed to by sk is reused for a new listening socket.
> > > > This socket has null sk->sk_node->next which uses the same spot as
> > > > sk->sk_nulls_node->next
> > > >
> > > > 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and
> > > > follows the pointer, resulting in a fault dereferencing sk->sk_hash.
> > > >
> > > > Unless we missed something, there is no protection against established
> > > > socket being freed and replaced by a new listening one while
> > > > __inet_lookup_established() has a pointer to it. The RCU loop only
> > > > prevents the slab object being reused for a different slab cache or
> > > > something completely different but as established and listening sockets
> > > > share the same slab cache, it does not protect us from switching from
> > > > established to listening.
> > > >
> > > > As far as I can say, this kind of race could have happened for quite
> > > > long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch
> > > > listener sk_refcnt under synflood"), the worst that could happen would
> > > > be switching to a chain in listener lookup table, following it to its
> > > > end and then (most likely) restarting the lookup or failing. Now that
> > > > established and listening sockets use different list types, replacing
> > > > one with the other can be deadly.
> > > >
> > > > Do you agree that this race is possible or is there something we missed
> > > > that would prevent it?
> > > >
> > > A listener is hashed on icsk_listen_portaddr_node, so I do not see how a
> > > listener could be found in the establish chain ?
> >
> > It is not really in the chain. What we suspect is that between sk is
> > assigned pointer to an established socket in __inet_lookup_established()
> > and using sk->sk_nulls_node->next to go to the next (or stop if it's odd
> > nulls value), this established socket could be freed and its slab object
> > reused for a listening socket. As listening sockets no longer use a
> > nulls hashlist but a normal hashlist, in the most common case where the
> > socket is last in the chain, sk->sk_node->next (which occupies the same
> > place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls()
> > does not recognize the chain end and the loop would go on to next socket
> > in the chain.
> >
> 
> I hear you, but where is the sk->sk_nulls_node->next would be set to
> NULL exactly ?

In __inet_hash() when the new listening socket is inserted into the
listening hashtable:

	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
		sk->sk_family == AF_INET6)
		hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
	else
		hlist_add_head_rcu(&sk->sk_node, &ilb->head);

If the chain is empty, sk->sk_node->next will be set to NULL by either
branch. And even if it's not, the loop in __inet_lookup_established()
would follow the chain from listening hashtable and still get to the
NULL end marker eventually.

Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
  2019-11-20 19:52       ` Michal Kubecek
@ 2019-11-20 20:04         ` Eric Dumazet
  2019-11-20 20:49           ` Michal Kubecek
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2019-11-20 20:04 UTC (permalink / raw)
  To: Michal Kubecek, Eric Dumazet; +Cc: netdev, Firo Yang



On 11/20/19 11:52 AM, Michal Kubecek wrote:
> On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote:
>> On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>>>
>>> On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote:
>>>> On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>>>>
>>>>> Hello Eric,
>>>>>
>>>>> we are investigating a crash in socket lookup in a distribution kernel
>>>>> based on v4.12 but the possible problem we found seems to also apply to
>>>>> current mainline (or net) code.
>>>>>
>>>>> The common pattern is:
>>>>>
>>>>> - the crash always happens in __inet_lookup_established() in
>>>>>
>>>>>         sk_nulls_for_each_rcu(sk, node, &head->chain) {
>>>>>                 if (sk->sk_hash != hash)     <-----------------
>>>>>                         continue;
>>>>>
>>>>>   as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null
>>>>>   so dereferencing sk->sk_hash faults
>>>>>
>>>>> - the reason is that previous sk value pointed to a listening socket
>>>>>   rather than an established one; as listening socket uses sk_node, end
>>>>>   of the chain is marked by a null pointer which is not detected as
>>>>>   a chain end by sk_nulls_for_each_rcu()
>>>>>
>>>>> - there is no socket matching skb which is a TCP pure ACK having
>>>>>   127.0.0.1 as both source and destination
>>>>>
>>>>> - the chain pointed to by head variable is empty
>>>>>
>>>>> Firo Yang came with the theory that this could be a race between socket
>>>>> lookup and freing the socket and replacing it with a listening one:
>>>>>
>>>>> 1. CPU A gets a pointer to an established socket as sk in the
>>>>> sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not
>>>>> thake a reference to it.
>>>>>
>>>>> 2. CPU B frees the socket
>>>>>
>>>>> 3. Slab object pointed to by sk is reused for a new listening socket.
>>>>> This socket has null sk->sk_node->next which uses the same spot as
>>>>> sk->sk_nulls_node->next
>>>>>
>>>>> 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and
>>>>> follows the pointer, resulting in a fault dereferencing sk->sk_hash.
>>>>>
>>>>> Unless we missed something, there is no protection against established
>>>>> socket being freed and replaced by a new listening one while
>>>>> __inet_lookup_established() has a pointer to it. The RCU loop only
>>>>> prevents the slab object being reused for a different slab cache or
>>>>> something completely different but as established and listening sockets
>>>>> share the same slab cache, it does not protect us from switching from
>>>>> established to listening.
>>>>>
>>>>> As far as I can say, this kind of race could have happened for quite
>>>>> long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch
>>>>> listener sk_refcnt under synflood"), the worst that could happen would
>>>>> be switching to a chain in listener lookup table, following it to its
>>>>> end and then (most likely) restarting the lookup or failing. Now that
>>>>> established and listening sockets use different list types, replacing
>>>>> one with the other can be deadly.
>>>>>
>>>>> Do you agree that this race is possible or is there something we missed
>>>>> that would prevent it?
>>>>>
>>>> A listener is hashed on icsk_listen_portaddr_node, so I do not see how a
>>>> listener could be found in the establish chain ?
>>>
>>> It is not really in the chain. What we suspect is that between sk is
>>> assigned pointer to an established socket in __inet_lookup_established()
>>> and using sk->sk_nulls_node->next to go to the next (or stop if it's odd
>>> nulls value), this established socket could be freed and its slab object
>>> reused for a listening socket. As listening sockets no longer use a
>>> nulls hashlist but a normal hashlist, in the most common case where the
>>> socket is last in the chain, sk->sk_node->next (which occupies the same
>>> place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls()
>>> does not recognize the chain end and the loop would go on to next socket
>>> in the chain.
>>>
>>
>> I hear you, but where is the sk->sk_nulls_node->next would be set to
>> NULL exactly ?
> 
> In __inet_hash() when the new listening socket is inserted into the
> listening hashtable:
> 
> 	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
> 		sk->sk_family == AF_INET6)
> 		hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
> 	else
> 		hlist_add_head_rcu(&sk->sk_node, &ilb->head);
> 
> If the chain is empty, sk->sk_node->next will be set to NULL by either
> branch. And even if it's not, the loop in __inet_lookup_established()
> would follow the chain from listening hashtable and still get to the
> NULL end marker eventually.


Oh right, I was confused by icsk_listen_portaddr_node, but listener use two
hashes...

Do you have a patch, or do you want me to work on a fix ?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
  2019-11-20 20:04         ` Eric Dumazet
@ 2019-11-20 20:49           ` Michal Kubecek
  2019-11-20 20:57             ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Kubecek @ 2019-11-20 20:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, netdev, Firo Yang

On Wed, Nov 20, 2019 at 12:04:53PM -0800, Eric Dumazet wrote:
> 
> 
> On 11/20/19 11:52 AM, Michal Kubecek wrote:
> > On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote:
> >> On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> >>>
> >>> On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote:
> >>>> On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> >>>>
> >>>>> Hello Eric,
> >>>>>
> >>>>> we are investigating a crash in socket lookup in a distribution kernel
> >>>>> based on v4.12 but the possible problem we found seems to also apply to
> >>>>> current mainline (or net) code.
> >>>>>
> >>>>> The common pattern is:
> >>>>>
> >>>>> - the crash always happens in __inet_lookup_established() in
> >>>>>
> >>>>>         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> >>>>>                 if (sk->sk_hash != hash)     <-----------------
> >>>>>                         continue;
> >>>>>
> >>>>>   as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null
> >>>>>   so dereferencing sk->sk_hash faults
> >>>>>
> >>>>> - the reason is that previous sk value pointed to a listening socket
> >>>>>   rather than an established one; as listening socket uses sk_node, end
> >>>>>   of the chain is marked by a null pointer which is not detected as
> >>>>>   a chain end by sk_nulls_for_each_rcu()
> >>>>>
> >>>>> - there is no socket matching skb which is a TCP pure ACK having
> >>>>>   127.0.0.1 as both source and destination
> >>>>>
> >>>>> - the chain pointed to by head variable is empty
> >>>>>
> >>>>> Firo Yang came with the theory that this could be a race between socket
> >>>>> lookup and freing the socket and replacing it with a listening one:
> >>>>>
> >>>>> 1. CPU A gets a pointer to an established socket as sk in the
> >>>>> sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not
> >>>>> thake a reference to it.
> >>>>>
> >>>>> 2. CPU B frees the socket
> >>>>>
> >>>>> 3. Slab object pointed to by sk is reused for a new listening socket.
> >>>>> This socket has null sk->sk_node->next which uses the same spot as
> >>>>> sk->sk_nulls_node->next
> >>>>>
> >>>>> 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and
> >>>>> follows the pointer, resulting in a fault dereferencing sk->sk_hash.
> >>>>>
> >>>>> Unless we missed something, there is no protection against established
> >>>>> socket being freed and replaced by a new listening one while
> >>>>> __inet_lookup_established() has a pointer to it. The RCU loop only
> >>>>> prevents the slab object being reused for a different slab cache or
> >>>>> something completely different but as established and listening sockets
> >>>>> share the same slab cache, it does not protect us from switching from
> >>>>> established to listening.
> >>>>>
> >>>>> As far as I can say, this kind of race could have happened for quite
> >>>>> long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch
> >>>>> listener sk_refcnt under synflood"), the worst that could happen would
> >>>>> be switching to a chain in listener lookup table, following it to its
> >>>>> end and then (most likely) restarting the lookup or failing. Now that
> >>>>> established and listening sockets use different list types, replacing
> >>>>> one with the other can be deadly.
> >>>>>
> >>>>> Do you agree that this race is possible or is there something we missed
> >>>>> that would prevent it?
> >>>>>
> >>>> A listener is hashed on icsk_listen_portaddr_node, so I do not see how a
> >>>> listener could be found in the establish chain ?
> >>>
> >>> It is not really in the chain. What we suspect is that between sk is
> >>> assigned pointer to an established socket in __inet_lookup_established()
> >>> and using sk->sk_nulls_node->next to go to the next (or stop if it's odd
> >>> nulls value), this established socket could be freed and its slab object
> >>> reused for a listening socket. As listening sockets no longer use a
> >>> nulls hashlist but a normal hashlist, in the most common case where the
> >>> socket is last in the chain, sk->sk_node->next (which occupies the same
> >>> place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls()
> >>> does not recognize the chain end and the loop would go on to next socket
> >>> in the chain.
> >>>
> >>
> >> I hear you, but where is the sk->sk_nulls_node->next would be set to
> >> NULL exactly ?
> > 
> > In __inet_hash() when the new listening socket is inserted into the
> > listening hashtable:
> > 
> > 	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
> > 		sk->sk_family == AF_INET6)
> > 		hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
> > 	else
> > 		hlist_add_head_rcu(&sk->sk_node, &ilb->head);
> > 
> > If the chain is empty, sk->sk_node->next will be set to NULL by either
> > branch. And even if it's not, the loop in __inet_lookup_established()
> > would follow the chain from listening hashtable and still get to the
> > NULL end marker eventually.
> 
> 
> Oh right, I was confused by icsk_listen_portaddr_node, but listener use two
> hashes...
> 
> Do you have a patch, or do you want me to work on a fix ?

Firo suggested something like

------------------------------------------------------------------------
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net,
 
 begin:
 	sk_nulls_for_each_rcu(sk, node, &head->chain) {
+		if (unlikely(!node))
+			goto begin;
 		if (sk->sk_hash != hash)
 			continue;
 		if (likely(INET_MATCH(sk, net, acookie,
------------------------------------------------------------------------

It depends on implementation details but I believe it would work. It
would be nicer if we could detect the switch to a listening socket but
I don't see how to make such test race free without introducing
unacceptable performance penalty.

Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
  2019-11-20 20:49           ` Michal Kubecek
@ 2019-11-20 20:57             ` Eric Dumazet
  2019-11-20 21:13               ` Michal Kubecek
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2019-11-20 20:57 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: Eric Dumazet, netdev, Firo Yang

On Wed, Nov 20, 2019 at 12:49 PM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Wed, Nov 20, 2019 at 12:04:53PM -0800, Eric Dumazet wrote:
> >
> >
> > On 11/20/19 11:52 AM, Michal Kubecek wrote:
> > > On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote:
> > >> On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > >>>
> > >>> On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote:
> > >>>> On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > >>>>
> > >>>>> Hello Eric,
> > >>>>>
> > >>>>> we are investigating a crash in socket lookup in a distribution kernel
> > >>>>> based on v4.12 but the possible problem we found seems to also apply to
> > >>>>> current mainline (or net) code.
> > >>>>>
> > >>>>> The common pattern is:
> > >>>>>
> > >>>>> - the crash always happens in __inet_lookup_established() in
> > >>>>>
> > >>>>>         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > >>>>>                 if (sk->sk_hash != hash)     <-----------------
> > >>>>>                         continue;
> > >>>>>
> > >>>>>   as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null
> > >>>>>   so dereferencing sk->sk_hash faults
> > >>>>>
> > >>>>> - the reason is that previous sk value pointed to a listening socket
> > >>>>>   rather than an established one; as listening socket uses sk_node, end
> > >>>>>   of the chain is marked by a null pointer which is not detected as
> > >>>>>   a chain end by sk_nulls_for_each_rcu()
> > >>>>>
> > >>>>> - there is no socket matching skb which is a TCP pure ACK having
> > >>>>>   127.0.0.1 as both source and destination
> > >>>>>
> > >>>>> - the chain pointed to by head variable is empty
> > >>>>>
> > >>>>> Firo Yang came with the theory that this could be a race between socket
> > >>>>> lookup and freing the socket and replacing it with a listening one:
> > >>>>>
> > >>>>> 1. CPU A gets a pointer to an established socket as sk in the
> > >>>>> sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not
> > >>>>> thake a reference to it.
> > >>>>>
> > >>>>> 2. CPU B frees the socket
> > >>>>>
> > >>>>> 3. Slab object pointed to by sk is reused for a new listening socket.
> > >>>>> This socket has null sk->sk_node->next which uses the same spot as
> > >>>>> sk->sk_nulls_node->next
> > >>>>>
> > >>>>> 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and
> > >>>>> follows the pointer, resulting in a fault dereferencing sk->sk_hash.
> > >>>>>
> > >>>>> Unless we missed something, there is no protection against established
> > >>>>> socket being freed and replaced by a new listening one while
> > >>>>> __inet_lookup_established() has a pointer to it. The RCU loop only
> > >>>>> prevents the slab object being reused for a different slab cache or
> > >>>>> something completely different but as established and listening sockets
> > >>>>> share the same slab cache, it does not protect us from switching from
> > >>>>> established to listening.
> > >>>>>
> > >>>>> As far as I can say, this kind of race could have happened for quite
> > >>>>> long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch
> > >>>>> listener sk_refcnt under synflood"), the worst that could happen would
> > >>>>> be switching to a chain in listener lookup table, following it to its
> > >>>>> end and then (most likely) restarting the lookup or failing. Now that
> > >>>>> established and listening sockets use different list types, replacing
> > >>>>> one with the other can be deadly.
> > >>>>>
> > >>>>> Do you agree that this race is possible or is there something we missed
> > >>>>> that would prevent it?
> > >>>>>
> > >>>> A listener is hashed on icsk_listen_portaddr_node, so I do not see how a
> > >>>> listener could be found in the establish chain ?
> > >>>
> > >>> It is not really in the chain. What we suspect is that between sk is
> > >>> assigned pointer to an established socket in __inet_lookup_established()
> > >>> and using sk->sk_nulls_node->next to go to the next (or stop if it's odd
> > >>> nulls value), this established socket could be freed and its slab object
> > >>> reused for a listening socket. As listening sockets no longer use a
> > >>> nulls hashlist but a normal hashlist, in the most common case where the
> > >>> socket is last in the chain, sk->sk_node->next (which occupies the same
> > >>> place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls()
> > >>> does not recognize the chain end and the loop would go on to next socket
> > >>> in the chain.
> > >>>
> > >>
> > >> I hear you, but where is the sk->sk_nulls_node->next would be set to
> > >> NULL exactly ?
> > >
> > > In __inet_hash() when the new listening socket is inserted into the
> > > listening hashtable:
> > >
> > >     if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
> > >             sk->sk_family == AF_INET6)
> > >             hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
> > >     else
> > >             hlist_add_head_rcu(&sk->sk_node, &ilb->head);
> > >
> > > If the chain is empty, sk->sk_node->next will be set to NULL by either
> > > branch. And even if it's not, the loop in __inet_lookup_established()
> > > would follow the chain from listening hashtable and still get to the
> > > NULL end marker eventually.
> >
> >
> > Oh right, I was confused by icsk_listen_portaddr_node, but listener use two
> > hashes...
> >
> > Do you have a patch, or do you want me to work on a fix ?
>
> Firo suggested something like
>
> ------------------------------------------------------------------------
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net,
>
>  begin:
>         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> +               if (unlikely(!node))
> +                       goto begin;
>                 if (sk->sk_hash != hash)
>                         continue;
>                 if (likely(INET_MATCH(sk, net, acookie,
> ------------------------------------------------------------------------
>
> It depends on implementation details but I believe it would work. It
> would be nicer if we could detect the switch to a listening socket but
> I don't see how to make such test race free without introducing
> unacceptable performance penalty.

No, we do not want to add more checks in the fast path really.

I was more thinking about not breaking the RCU invariants.

(ie : adding back the nulls stuff that I removed in 3b24d854cb35
("tcp/dccp: do not touch
listener sk_refcnt under synflood")

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
  2019-11-20 20:57             ` Eric Dumazet
@ 2019-11-20 21:13               ` Michal Kubecek
  2019-11-20 21:54                 ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Kubecek @ 2019-11-20 21:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, netdev, Firo Yang

On Wed, Nov 20, 2019 at 12:57:48PM -0800, Eric Dumazet wrote:
> On Wed, Nov 20, 2019 at 12:49 PM Michal Kubecek <mkubecek@suse.cz> wrote:
> > Firo suggested something like
> >
> > ------------------------------------------------------------------------
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net,
> >
> >  begin:
> >         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > +               if (unlikely(!node))
> > +                       goto begin;
> >                 if (sk->sk_hash != hash)
> >                         continue;
> >                 if (likely(INET_MATCH(sk, net, acookie,
> > ------------------------------------------------------------------------
> >
> > It depends on implementation details but I believe it would work. It
> > would be nicer if we could detect the switch to a listening socket but
> > I don't see how to make such test race free without introducing
> > unacceptable performance penalty.
> 
> No, we do not want to add more checks in the fast path really.
> 
> I was more thinking about not breaking the RCU invariants.
> 
> (ie : adding back the nulls stuff that I removed in 3b24d854cb35
> ("tcp/dccp: do not touch
> listener sk_refcnt under synflood")

Yes, that would do the trick. It would add some cycles to listener
lookup but that is less harm than slowing down established socket
lookup.

Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible race in __inet_lookup_established()
  2019-11-20 21:13               ` Michal Kubecek
@ 2019-11-20 21:54                 ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2019-11-20 21:54 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: Eric Dumazet, netdev, Firo Yang

On Wed, Nov 20, 2019 at 1:13 PM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Wed, Nov 20, 2019 at 12:57:48PM -0800, Eric Dumazet wrote:
> > On Wed, Nov 20, 2019 at 12:49 PM Michal Kubecek <mkubecek@suse.cz> wrote:
> > > Firo suggested something like
> > >
> > > ------------------------------------------------------------------------
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net,
> > >
> > >  begin:
> > >         sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > > +               if (unlikely(!node))
> > > +                       goto begin;
> > >                 if (sk->sk_hash != hash)
> > >                         continue;
> > >                 if (likely(INET_MATCH(sk, net, acookie,
> > > ------------------------------------------------------------------------
> > >
> > > It depends on implementation details but I believe it would work. It
> > > would be nicer if we could detect the switch to a listening socket but
> > > I don't see how to make such test race free without introducing
> > > unacceptable performance penalty.
> >
> > No, we do not want to add more checks in the fast path really.
> >
> > I was more thinking about not breaking the RCU invariants.
> >
> > (ie : adding back the nulls stuff that I removed in 3b24d854cb35
> > ("tcp/dccp: do not touch
> > listener sk_refcnt under synflood")
>
> Yes, that would do the trick. It would add some cycles to listener
> lookup but that is less harm than slowing down established socket
> lookup.
>

It should not change cycles spent in listener lookup.

Only the test to check for the iteration end will not use NULL, that's about it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-20  8:39 possible race in __inet_lookup_established() Michal Kubecek
     [not found] ` <CANn89iJYXh7AwK8_Aiz3wXqugG0icPNW6OPsPxwOvpH90kr+Ew@mail.gmail.com>
2019-11-20 18:10   ` Michal Kubecek
2019-11-20 19:13     ` Eric Dumazet
2019-11-20 19:52       ` Michal Kubecek
2019-11-20 20:04         ` Eric Dumazet
2019-11-20 20:49           ` Michal Kubecek
2019-11-20 20:57             ` Eric Dumazet
2019-11-20 21:13               ` Michal Kubecek
2019-11-20 21:54                 ` Eric Dumazet

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git