* possible race in __inet_lookup_established() @ 2019-11-20 8:39 Michal Kubecek [not found] ` <CANn89iJYXh7AwK8_Aiz3wXqugG0icPNW6OPsPxwOvpH90kr+Ew@mail.gmail.com> 0 siblings, 1 reply; 9+ messages in thread From: Michal Kubecek @ 2019-11-20 8:39 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, Firo Yang Hello Eric, we are investigating a crash in socket lookup in a distribution kernel based on v4.12 but the possible problem we found seems to also apply to current mainline (or net) code. The common pattern is: - the crash always happens in __inet_lookup_established() in sk_nulls_for_each_rcu(sk, node, &head->chain) { if (sk->sk_hash != hash) <----------------- continue; as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null so dereferencing sk->sk_hash faults - the reason is that previous sk value pointed to a listening socket rather than an established one; as listening socket uses sk_node, end of the chain is marked by a null pointer which is not detected as a chain end by sk_nulls_for_each_rcu() - there is no socket matching skb which is a TCP pure ACK having 127.0.0.1 as both source and destination - the chain pointed to by head variable is empty Firo Yang came with the theory that this could be a race between socket lookup and freing the socket and replacing it with a listening one: 1. CPU A gets a pointer to an established socket as sk in the sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not thake a reference to it. 2. CPU B frees the socket 3. Slab object pointed to by sk is reused for a new listening socket. This socket has null sk->sk_node->next which uses the same spot as sk->sk_nulls_node->next 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and follows the pointer, resulting in a fault dereferencing sk->sk_hash. Unless we missed something, there is no protection against established socket being freed and replaced by a new listening one while __inet_lookup_established() has a pointer to it. The RCU loop only prevents the slab object being reused for a different slab cache or something completely different but as established and listening sockets share the same slab cache, it does not protect us from switching from established to listening. As far as I can say, this kind of race could have happened for quite long but before your commit 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood"), the worst that could happen would be switching to a chain in listener lookup table, following it to its end and then (most likely) restarting the lookup or failing. Now that established and listening sockets use different list types, replacing one with the other can be deadly. Do you agree that this race is possible or is there something we missed that would prevent it? Michal ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CANn89iJYXh7AwK8_Aiz3wXqugG0icPNW6OPsPxwOvpH90kr+Ew@mail.gmail.com>]
* Re: possible race in __inet_lookup_established() [not found] ` <CANn89iJYXh7AwK8_Aiz3wXqugG0icPNW6OPsPxwOvpH90kr+Ew@mail.gmail.com> @ 2019-11-20 18:10 ` Michal Kubecek 2019-11-20 19:13 ` Eric Dumazet 0 siblings, 1 reply; 9+ messages in thread From: Michal Kubecek @ 2019-11-20 18:10 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, Firo Yang On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote: > On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote: > > > Hello Eric, > > > > we are investigating a crash in socket lookup in a distribution kernel > > based on v4.12 but the possible problem we found seems to also apply to > > current mainline (or net) code. > > > > The common pattern is: > > > > - the crash always happens in __inet_lookup_established() in > > > > sk_nulls_for_each_rcu(sk, node, &head->chain) { > > if (sk->sk_hash != hash) <----------------- > > continue; > > > > as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null > > so dereferencing sk->sk_hash faults > > > > - the reason is that previous sk value pointed to a listening socket > > rather than an established one; as listening socket uses sk_node, end > > of the chain is marked by a null pointer which is not detected as > > a chain end by sk_nulls_for_each_rcu() > > > > - there is no socket matching skb which is a TCP pure ACK having > > 127.0.0.1 as both source and destination > > > > - the chain pointed to by head variable is empty > > > > Firo Yang came with the theory that this could be a race between socket > > lookup and freing the socket and replacing it with a listening one: > > > > 1. CPU A gets a pointer to an established socket as sk in the > > sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not > > thake a reference to it. > > > > 2. CPU B frees the socket > > > > 3. Slab object pointed to by sk is reused for a new listening socket. > > This socket has null sk->sk_node->next which uses the same spot as > > sk->sk_nulls_node->next > > > > 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and > > follows the pointer, resulting in a fault dereferencing sk->sk_hash. > > > > Unless we missed something, there is no protection against established > > socket being freed and replaced by a new listening one while > > __inet_lookup_established() has a pointer to it. The RCU loop only > > prevents the slab object being reused for a different slab cache or > > something completely different but as established and listening sockets > > share the same slab cache, it does not protect us from switching from > > established to listening. > > > > As far as I can say, this kind of race could have happened for quite > > long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch > > listener sk_refcnt under synflood"), the worst that could happen would > > be switching to a chain in listener lookup table, following it to its > > end and then (most likely) restarting the lookup or failing. Now that > > established and listening sockets use different list types, replacing > > one with the other can be deadly. > > > > Do you agree that this race is possible or is there something we missed > > that would prevent it? > > > A listener is hashed on icsk_listen_portaddr_node, so I do not see how a > listener could be found in the establish chain ? It is not really in the chain. What we suspect is that between sk is assigned pointer to an established socket in __inet_lookup_established() and using sk->sk_nulls_node->next to go to the next (or stop if it's odd nulls value), this established socket could be freed and its slab object reused for a listening socket. As listening sockets no longer use a nulls hashlist but a normal hashlist, in the most common case where the socket is last in the chain, sk->sk_node->next (which occupies the same place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls() does not recognize the chain end and the loop would go on to next socket in the chain. Michal > > sock_copy() makes sure to not touch sk_node > > sk_prot_clear_nulls() makes sure to not touch sk_node > > So maybe you miss a backport or something ? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible race in __inet_lookup_established() 2019-11-20 18:10 ` Michal Kubecek @ 2019-11-20 19:13 ` Eric Dumazet 2019-11-20 19:52 ` Michal Kubecek 0 siblings, 1 reply; 9+ messages in thread From: Eric Dumazet @ 2019-11-20 19:13 UTC (permalink / raw) To: Michal Kubecek; +Cc: netdev, Firo Yang On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote: > > On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote: > > On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote: > > > > > Hello Eric, > > > > > > we are investigating a crash in socket lookup in a distribution kernel > > > based on v4.12 but the possible problem we found seems to also apply to > > > current mainline (or net) code. > > > > > > The common pattern is: > > > > > > - the crash always happens in __inet_lookup_established() in > > > > > > sk_nulls_for_each_rcu(sk, node, &head->chain) { > > > if (sk->sk_hash != hash) <----------------- > > > continue; > > > > > > as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null > > > so dereferencing sk->sk_hash faults > > > > > > - the reason is that previous sk value pointed to a listening socket > > > rather than an established one; as listening socket uses sk_node, end > > > of the chain is marked by a null pointer which is not detected as > > > a chain end by sk_nulls_for_each_rcu() > > > > > > - there is no socket matching skb which is a TCP pure ACK having > > > 127.0.0.1 as both source and destination > > > > > > - the chain pointed to by head variable is empty > > > > > > Firo Yang came with the theory that this could be a race between socket > > > lookup and freing the socket and replacing it with a listening one: > > > > > > 1. CPU A gets a pointer to an established socket as sk in the > > > sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not > > > thake a reference to it. > > > > > > 2. CPU B frees the socket > > > > > > 3. Slab object pointed to by sk is reused for a new listening socket. > > > This socket has null sk->sk_node->next which uses the same spot as > > > sk->sk_nulls_node->next > > > > > > 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and > > > follows the pointer, resulting in a fault dereferencing sk->sk_hash. > > > > > > Unless we missed something, there is no protection against established > > > socket being freed and replaced by a new listening one while > > > __inet_lookup_established() has a pointer to it. The RCU loop only > > > prevents the slab object being reused for a different slab cache or > > > something completely different but as established and listening sockets > > > share the same slab cache, it does not protect us from switching from > > > established to listening. > > > > > > As far as I can say, this kind of race could have happened for quite > > > long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch > > > listener sk_refcnt under synflood"), the worst that could happen would > > > be switching to a chain in listener lookup table, following it to its > > > end and then (most likely) restarting the lookup or failing. Now that > > > established and listening sockets use different list types, replacing > > > one with the other can be deadly. > > > > > > Do you agree that this race is possible or is there something we missed > > > that would prevent it? > > > > > A listener is hashed on icsk_listen_portaddr_node, so I do not see how a > > listener could be found in the establish chain ? > > It is not really in the chain. What we suspect is that between sk is > assigned pointer to an established socket in __inet_lookup_established() > and using sk->sk_nulls_node->next to go to the next (or stop if it's odd > nulls value), this established socket could be freed and its slab object > reused for a listening socket. As listening sockets no longer use a > nulls hashlist but a normal hashlist, in the most common case where the > socket is last in the chain, sk->sk_node->next (which occupies the same > place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls() > does not recognize the chain end and the loop would go on to next socket > in the chain. > I hear you, but where is the sk->sk_nulls_node->next would be set to NULL exactly ? > Michal > > > > > sock_copy() makes sure to not touch sk_node > > > > sk_prot_clear_nulls() makes sure to not touch sk_node > > > > So maybe you miss a backport or something ? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible race in __inet_lookup_established() 2019-11-20 19:13 ` Eric Dumazet @ 2019-11-20 19:52 ` Michal Kubecek 2019-11-20 20:04 ` Eric Dumazet 0 siblings, 1 reply; 9+ messages in thread From: Michal Kubecek @ 2019-11-20 19:52 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, Firo Yang On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote: > On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote: > > > > On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote: > > > On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote: > > > > > > > Hello Eric, > > > > > > > > we are investigating a crash in socket lookup in a distribution kernel > > > > based on v4.12 but the possible problem we found seems to also apply to > > > > current mainline (or net) code. > > > > > > > > The common pattern is: > > > > > > > > - the crash always happens in __inet_lookup_established() in > > > > > > > > sk_nulls_for_each_rcu(sk, node, &head->chain) { > > > > if (sk->sk_hash != hash) <----------------- > > > > continue; > > > > > > > > as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null > > > > so dereferencing sk->sk_hash faults > > > > > > > > - the reason is that previous sk value pointed to a listening socket > > > > rather than an established one; as listening socket uses sk_node, end > > > > of the chain is marked by a null pointer which is not detected as > > > > a chain end by sk_nulls_for_each_rcu() > > > > > > > > - there is no socket matching skb which is a TCP pure ACK having > > > > 127.0.0.1 as both source and destination > > > > > > > > - the chain pointed to by head variable is empty > > > > > > > > Firo Yang came with the theory that this could be a race between socket > > > > lookup and freing the socket and replacing it with a listening one: > > > > > > > > 1. CPU A gets a pointer to an established socket as sk in the > > > > sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not > > > > thake a reference to it. > > > > > > > > 2. CPU B frees the socket > > > > > > > > 3. Slab object pointed to by sk is reused for a new listening socket. > > > > This socket has null sk->sk_node->next which uses the same spot as > > > > sk->sk_nulls_node->next > > > > > > > > 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and > > > > follows the pointer, resulting in a fault dereferencing sk->sk_hash. > > > > > > > > Unless we missed something, there is no protection against established > > > > socket being freed and replaced by a new listening one while > > > > __inet_lookup_established() has a pointer to it. The RCU loop only > > > > prevents the slab object being reused for a different slab cache or > > > > something completely different but as established and listening sockets > > > > share the same slab cache, it does not protect us from switching from > > > > established to listening. > > > > > > > > As far as I can say, this kind of race could have happened for quite > > > > long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch > > > > listener sk_refcnt under synflood"), the worst that could happen would > > > > be switching to a chain in listener lookup table, following it to its > > > > end and then (most likely) restarting the lookup or failing. Now that > > > > established and listening sockets use different list types, replacing > > > > one with the other can be deadly. > > > > > > > > Do you agree that this race is possible or is there something we missed > > > > that would prevent it? > > > > > > > A listener is hashed on icsk_listen_portaddr_node, so I do not see how a > > > listener could be found in the establish chain ? > > > > It is not really in the chain. What we suspect is that between sk is > > assigned pointer to an established socket in __inet_lookup_established() > > and using sk->sk_nulls_node->next to go to the next (or stop if it's odd > > nulls value), this established socket could be freed and its slab object > > reused for a listening socket. As listening sockets no longer use a > > nulls hashlist but a normal hashlist, in the most common case where the > > socket is last in the chain, sk->sk_node->next (which occupies the same > > place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls() > > does not recognize the chain end and the loop would go on to next socket > > in the chain. > > > > I hear you, but where is the sk->sk_nulls_node->next would be set to > NULL exactly ? In __inet_hash() when the new listening socket is inserted into the listening hashtable: if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport && sk->sk_family == AF_INET6) hlist_add_tail_rcu(&sk->sk_node, &ilb->head); else hlist_add_head_rcu(&sk->sk_node, &ilb->head); If the chain is empty, sk->sk_node->next will be set to NULL by either branch. And even if it's not, the loop in __inet_lookup_established() would follow the chain from listening hashtable and still get to the NULL end marker eventually. Michal ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible race in __inet_lookup_established() 2019-11-20 19:52 ` Michal Kubecek @ 2019-11-20 20:04 ` Eric Dumazet 2019-11-20 20:49 ` Michal Kubecek 0 siblings, 1 reply; 9+ messages in thread From: Eric Dumazet @ 2019-11-20 20:04 UTC (permalink / raw) To: Michal Kubecek, Eric Dumazet; +Cc: netdev, Firo Yang On 11/20/19 11:52 AM, Michal Kubecek wrote: > On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote: >> On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote: >>> >>> On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote: >>>> On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote: >>>> >>>>> Hello Eric, >>>>> >>>>> we are investigating a crash in socket lookup in a distribution kernel >>>>> based on v4.12 but the possible problem we found seems to also apply to >>>>> current mainline (or net) code. >>>>> >>>>> The common pattern is: >>>>> >>>>> - the crash always happens in __inet_lookup_established() in >>>>> >>>>> sk_nulls_for_each_rcu(sk, node, &head->chain) { >>>>> if (sk->sk_hash != hash) <----------------- >>>>> continue; >>>>> >>>>> as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null >>>>> so dereferencing sk->sk_hash faults >>>>> >>>>> - the reason is that previous sk value pointed to a listening socket >>>>> rather than an established one; as listening socket uses sk_node, end >>>>> of the chain is marked by a null pointer which is not detected as >>>>> a chain end by sk_nulls_for_each_rcu() >>>>> >>>>> - there is no socket matching skb which is a TCP pure ACK having >>>>> 127.0.0.1 as both source and destination >>>>> >>>>> - the chain pointed to by head variable is empty >>>>> >>>>> Firo Yang came with the theory that this could be a race between socket >>>>> lookup and freing the socket and replacing it with a listening one: >>>>> >>>>> 1. CPU A gets a pointer to an established socket as sk in the >>>>> sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not >>>>> thake a reference to it. >>>>> >>>>> 2. CPU B frees the socket >>>>> >>>>> 3. Slab object pointed to by sk is reused for a new listening socket. >>>>> This socket has null sk->sk_node->next which uses the same spot as >>>>> sk->sk_nulls_node->next >>>>> >>>>> 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and >>>>> follows the pointer, resulting in a fault dereferencing sk->sk_hash. >>>>> >>>>> Unless we missed something, there is no protection against established >>>>> socket being freed and replaced by a new listening one while >>>>> __inet_lookup_established() has a pointer to it. The RCU loop only >>>>> prevents the slab object being reused for a different slab cache or >>>>> something completely different but as established and listening sockets >>>>> share the same slab cache, it does not protect us from switching from >>>>> established to listening. >>>>> >>>>> As far as I can say, this kind of race could have happened for quite >>>>> long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch >>>>> listener sk_refcnt under synflood"), the worst that could happen would >>>>> be switching to a chain in listener lookup table, following it to its >>>>> end and then (most likely) restarting the lookup or failing. Now that >>>>> established and listening sockets use different list types, replacing >>>>> one with the other can be deadly. >>>>> >>>>> Do you agree that this race is possible or is there something we missed >>>>> that would prevent it? >>>>> >>>> A listener is hashed on icsk_listen_portaddr_node, so I do not see how a >>>> listener could be found in the establish chain ? >>> >>> It is not really in the chain. What we suspect is that between sk is >>> assigned pointer to an established socket in __inet_lookup_established() >>> and using sk->sk_nulls_node->next to go to the next (or stop if it's odd >>> nulls value), this established socket could be freed and its slab object >>> reused for a listening socket. As listening sockets no longer use a >>> nulls hashlist but a normal hashlist, in the most common case where the >>> socket is last in the chain, sk->sk_node->next (which occupies the same >>> place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls() >>> does not recognize the chain end and the loop would go on to next socket >>> in the chain. >>> >> >> I hear you, but where is the sk->sk_nulls_node->next would be set to >> NULL exactly ? > > In __inet_hash() when the new listening socket is inserted into the > listening hashtable: > > if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport && > sk->sk_family == AF_INET6) > hlist_add_tail_rcu(&sk->sk_node, &ilb->head); > else > hlist_add_head_rcu(&sk->sk_node, &ilb->head); > > If the chain is empty, sk->sk_node->next will be set to NULL by either > branch. And even if it's not, the loop in __inet_lookup_established() > would follow the chain from listening hashtable and still get to the > NULL end marker eventually. Oh right, I was confused by icsk_listen_portaddr_node, but listener use two hashes... Do you have a patch, or do you want me to work on a fix ? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible race in __inet_lookup_established() 2019-11-20 20:04 ` Eric Dumazet @ 2019-11-20 20:49 ` Michal Kubecek 2019-11-20 20:57 ` Eric Dumazet 0 siblings, 1 reply; 9+ messages in thread From: Michal Kubecek @ 2019-11-20 20:49 UTC (permalink / raw) To: Eric Dumazet; +Cc: Eric Dumazet, netdev, Firo Yang On Wed, Nov 20, 2019 at 12:04:53PM -0800, Eric Dumazet wrote: > > > On 11/20/19 11:52 AM, Michal Kubecek wrote: > > On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote: > >> On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote: > >>> > >>> On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote: > >>>> On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote: > >>>> > >>>>> Hello Eric, > >>>>> > >>>>> we are investigating a crash in socket lookup in a distribution kernel > >>>>> based on v4.12 but the possible problem we found seems to also apply to > >>>>> current mainline (or net) code. > >>>>> > >>>>> The common pattern is: > >>>>> > >>>>> - the crash always happens in __inet_lookup_established() in > >>>>> > >>>>> sk_nulls_for_each_rcu(sk, node, &head->chain) { > >>>>> if (sk->sk_hash != hash) <----------------- > >>>>> continue; > >>>>> > >>>>> as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null > >>>>> so dereferencing sk->sk_hash faults > >>>>> > >>>>> - the reason is that previous sk value pointed to a listening socket > >>>>> rather than an established one; as listening socket uses sk_node, end > >>>>> of the chain is marked by a null pointer which is not detected as > >>>>> a chain end by sk_nulls_for_each_rcu() > >>>>> > >>>>> - there is no socket matching skb which is a TCP pure ACK having > >>>>> 127.0.0.1 as both source and destination > >>>>> > >>>>> - the chain pointed to by head variable is empty > >>>>> > >>>>> Firo Yang came with the theory that this could be a race between socket > >>>>> lookup and freing the socket and replacing it with a listening one: > >>>>> > >>>>> 1. CPU A gets a pointer to an established socket as sk in the > >>>>> sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not > >>>>> thake a reference to it. > >>>>> > >>>>> 2. CPU B frees the socket > >>>>> > >>>>> 3. Slab object pointed to by sk is reused for a new listening socket. > >>>>> This socket has null sk->sk_node->next which uses the same spot as > >>>>> sk->sk_nulls_node->next > >>>>> > >>>>> 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and > >>>>> follows the pointer, resulting in a fault dereferencing sk->sk_hash. > >>>>> > >>>>> Unless we missed something, there is no protection against established > >>>>> socket being freed and replaced by a new listening one while > >>>>> __inet_lookup_established() has a pointer to it. The RCU loop only > >>>>> prevents the slab object being reused for a different slab cache or > >>>>> something completely different but as established and listening sockets > >>>>> share the same slab cache, it does not protect us from switching from > >>>>> established to listening. > >>>>> > >>>>> As far as I can say, this kind of race could have happened for quite > >>>>> long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch > >>>>> listener sk_refcnt under synflood"), the worst that could happen would > >>>>> be switching to a chain in listener lookup table, following it to its > >>>>> end and then (most likely) restarting the lookup or failing. Now that > >>>>> established and listening sockets use different list types, replacing > >>>>> one with the other can be deadly. > >>>>> > >>>>> Do you agree that this race is possible or is there something we missed > >>>>> that would prevent it? > >>>>> > >>>> A listener is hashed on icsk_listen_portaddr_node, so I do not see how a > >>>> listener could be found in the establish chain ? > >>> > >>> It is not really in the chain. What we suspect is that between sk is > >>> assigned pointer to an established socket in __inet_lookup_established() > >>> and using sk->sk_nulls_node->next to go to the next (or stop if it's odd > >>> nulls value), this established socket could be freed and its slab object > >>> reused for a listening socket. As listening sockets no longer use a > >>> nulls hashlist but a normal hashlist, in the most common case where the > >>> socket is last in the chain, sk->sk_node->next (which occupies the same > >>> place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls() > >>> does not recognize the chain end and the loop would go on to next socket > >>> in the chain. > >>> > >> > >> I hear you, but where is the sk->sk_nulls_node->next would be set to > >> NULL exactly ? > > > > In __inet_hash() when the new listening socket is inserted into the > > listening hashtable: > > > > if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport && > > sk->sk_family == AF_INET6) > > hlist_add_tail_rcu(&sk->sk_node, &ilb->head); > > else > > hlist_add_head_rcu(&sk->sk_node, &ilb->head); > > > > If the chain is empty, sk->sk_node->next will be set to NULL by either > > branch. And even if it's not, the loop in __inet_lookup_established() > > would follow the chain from listening hashtable and still get to the > > NULL end marker eventually. > > > Oh right, I was confused by icsk_listen_portaddr_node, but listener use two > hashes... > > Do you have a patch, or do you want me to work on a fix ? Firo suggested something like ------------------------------------------------------------------------ --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net, begin: sk_nulls_for_each_rcu(sk, node, &head->chain) { + if (unlikely(!node)) + goto begin; if (sk->sk_hash != hash) continue; if (likely(INET_MATCH(sk, net, acookie, ------------------------------------------------------------------------ It depends on implementation details but I believe it would work. It would be nicer if we could detect the switch to a listening socket but I don't see how to make such test race free without introducing unacceptable performance penalty. Michal ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible race in __inet_lookup_established() 2019-11-20 20:49 ` Michal Kubecek @ 2019-11-20 20:57 ` Eric Dumazet 2019-11-20 21:13 ` Michal Kubecek 0 siblings, 1 reply; 9+ messages in thread From: Eric Dumazet @ 2019-11-20 20:57 UTC (permalink / raw) To: Michal Kubecek; +Cc: Eric Dumazet, netdev, Firo Yang On Wed, Nov 20, 2019 at 12:49 PM Michal Kubecek <mkubecek@suse.cz> wrote: > > On Wed, Nov 20, 2019 at 12:04:53PM -0800, Eric Dumazet wrote: > > > > > > On 11/20/19 11:52 AM, Michal Kubecek wrote: > > > On Wed, Nov 20, 2019 at 11:13:09AM -0800, Eric Dumazet wrote: > > >> On Wed, Nov 20, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote: > > >>> > > >>> On Wed, Nov 20, 2019 at 08:12:10AM -0800, Eric Dumazet wrote: > > >>>> On Wed, Nov 20, 2019 at 12:39 AM Michal Kubecek <mkubecek@suse.cz> wrote: > > >>>> > > >>>>> Hello Eric, > > >>>>> > > >>>>> we are investigating a crash in socket lookup in a distribution kernel > > >>>>> based on v4.12 but the possible problem we found seems to also apply to > > >>>>> current mainline (or net) code. > > >>>>> > > >>>>> The common pattern is: > > >>>>> > > >>>>> - the crash always happens in __inet_lookup_established() in > > >>>>> > > >>>>> sk_nulls_for_each_rcu(sk, node, &head->chain) { > > >>>>> if (sk->sk_hash != hash) <----------------- > > >>>>> continue; > > >>>>> > > >>>>> as sk is an invalid pointer; in particular, &sk->sk_nulls_node is null > > >>>>> so dereferencing sk->sk_hash faults > > >>>>> > > >>>>> - the reason is that previous sk value pointed to a listening socket > > >>>>> rather than an established one; as listening socket uses sk_node, end > > >>>>> of the chain is marked by a null pointer which is not detected as > > >>>>> a chain end by sk_nulls_for_each_rcu() > > >>>>> > > >>>>> - there is no socket matching skb which is a TCP pure ACK having > > >>>>> 127.0.0.1 as both source and destination > > >>>>> > > >>>>> - the chain pointed to by head variable is empty > > >>>>> > > >>>>> Firo Yang came with the theory that this could be a race between socket > > >>>>> lookup and freing the socket and replacing it with a listening one: > > >>>>> > > >>>>> 1. CPU A gets a pointer to an established socket as sk in the > > >>>>> sk_nulls_for_each_rcu() loop in __inet_lookup_established() but does not > > >>>>> thake a reference to it. > > >>>>> > > >>>>> 2. CPU B frees the socket > > >>>>> > > >>>>> 3. Slab object pointed to by sk is reused for a new listening socket. > > >>>>> This socket has null sk->sk_node->next which uses the same spot as > > >>>>> sk->sk_nulls_node->next > > >>>>> > > >>>>> 4. CPU A tests sk->sk_nulls_node->next with is_a_nulls() (false) and > > >>>>> follows the pointer, resulting in a fault dereferencing sk->sk_hash. > > >>>>> > > >>>>> Unless we missed something, there is no protection against established > > >>>>> socket being freed and replaced by a new listening one while > > >>>>> __inet_lookup_established() has a pointer to it. The RCU loop only > > >>>>> prevents the slab object being reused for a different slab cache or > > >>>>> something completely different but as established and listening sockets > > >>>>> share the same slab cache, it does not protect us from switching from > > >>>>> established to listening. > > >>>>> > > >>>>> As far as I can say, this kind of race could have happened for quite > > >>>>> long but before your commit ou3b24d854cb35 ("tcp/dccp: do not touch > > >>>>> listener sk_refcnt under synflood"), the worst that could happen would > > >>>>> be switching to a chain in listener lookup table, following it to its > > >>>>> end and then (most likely) restarting the lookup or failing. Now that > > >>>>> established and listening sockets use different list types, replacing > > >>>>> one with the other can be deadly. > > >>>>> > > >>>>> Do you agree that this race is possible or is there something we missed > > >>>>> that would prevent it? > > >>>>> > > >>>> A listener is hashed on icsk_listen_portaddr_node, so I do not see how a > > >>>> listener could be found in the establish chain ? > > >>> > > >>> It is not really in the chain. What we suspect is that between sk is > > >>> assigned pointer to an established socket in __inet_lookup_established() > > >>> and using sk->sk_nulls_node->next to go to the next (or stop if it's odd > > >>> nulls value), this established socket could be freed and its slab object > > >>> reused for a listening socket. As listening sockets no longer use a > > >>> nulls hashlist but a normal hashlist, in the most common case where the > > >>> socket is last in the chain, sk->sk_node->next (which occupies the same > > >>> place as sk->sk_nulls_node->next) would be NULL so that is_a_nulls() > > >>> does not recognize the chain end and the loop would go on to next socket > > >>> in the chain. > > >>> > > >> > > >> I hear you, but where is the sk->sk_nulls_node->next would be set to > > >> NULL exactly ? > > > > > > In __inet_hash() when the new listening socket is inserted into the > > > listening hashtable: > > > > > > if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport && > > > sk->sk_family == AF_INET6) > > > hlist_add_tail_rcu(&sk->sk_node, &ilb->head); > > > else > > > hlist_add_head_rcu(&sk->sk_node, &ilb->head); > > > > > > If the chain is empty, sk->sk_node->next will be set to NULL by either > > > branch. And even if it's not, the loop in __inet_lookup_established() > > > would follow the chain from listening hashtable and still get to the > > > NULL end marker eventually. > > > > > > Oh right, I was confused by icsk_listen_portaddr_node, but listener use two > > hashes... > > > > Do you have a patch, or do you want me to work on a fix ? > > Firo suggested something like > > ------------------------------------------------------------------------ > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net, > > begin: > sk_nulls_for_each_rcu(sk, node, &head->chain) { > + if (unlikely(!node)) > + goto begin; > if (sk->sk_hash != hash) > continue; > if (likely(INET_MATCH(sk, net, acookie, > ------------------------------------------------------------------------ > > It depends on implementation details but I believe it would work. It > would be nicer if we could detect the switch to a listening socket but > I don't see how to make such test race free without introducing > unacceptable performance penalty. No, we do not want to add more checks in the fast path really. I was more thinking about not breaking the RCU invariants. (ie : adding back the nulls stuff that I removed in 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible race in __inet_lookup_established() 2019-11-20 20:57 ` Eric Dumazet @ 2019-11-20 21:13 ` Michal Kubecek 2019-11-20 21:54 ` Eric Dumazet 0 siblings, 1 reply; 9+ messages in thread From: Michal Kubecek @ 2019-11-20 21:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: Eric Dumazet, netdev, Firo Yang On Wed, Nov 20, 2019 at 12:57:48PM -0800, Eric Dumazet wrote: > On Wed, Nov 20, 2019 at 12:49 PM Michal Kubecek <mkubecek@suse.cz> wrote: > > Firo suggested something like > > > > ------------------------------------------------------------------------ > > --- a/net/ipv4/inet_hashtables.c > > +++ b/net/ipv4/inet_hashtables.c > > @@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net, > > > > begin: > > sk_nulls_for_each_rcu(sk, node, &head->chain) { > > + if (unlikely(!node)) > > + goto begin; > > if (sk->sk_hash != hash) > > continue; > > if (likely(INET_MATCH(sk, net, acookie, > > ------------------------------------------------------------------------ > > > > It depends on implementation details but I believe it would work. It > > would be nicer if we could detect the switch to a listening socket but > > I don't see how to make such test race free without introducing > > unacceptable performance penalty. > > No, we do not want to add more checks in the fast path really. > > I was more thinking about not breaking the RCU invariants. > > (ie : adding back the nulls stuff that I removed in 3b24d854cb35 > ("tcp/dccp: do not touch > listener sk_refcnt under synflood") Yes, that would do the trick. It would add some cycles to listener lookup but that is less harm than slowing down established socket lookup. Michal ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible race in __inet_lookup_established() 2019-11-20 21:13 ` Michal Kubecek @ 2019-11-20 21:54 ` Eric Dumazet 0 siblings, 0 replies; 9+ messages in thread From: Eric Dumazet @ 2019-11-20 21:54 UTC (permalink / raw) To: Michal Kubecek; +Cc: Eric Dumazet, netdev, Firo Yang On Wed, Nov 20, 2019 at 1:13 PM Michal Kubecek <mkubecek@suse.cz> wrote: > > On Wed, Nov 20, 2019 at 12:57:48PM -0800, Eric Dumazet wrote: > > On Wed, Nov 20, 2019 at 12:49 PM Michal Kubecek <mkubecek@suse.cz> wrote: > > > Firo suggested something like > > > > > > ------------------------------------------------------------------------ > > > --- a/net/ipv4/inet_hashtables.c > > > +++ b/net/ipv4/inet_hashtables.c > > > @@ -362,6 +362,8 @@ struct sock *__inet_lookup_established(struct net *net, > > > > > > begin: > > > sk_nulls_for_each_rcu(sk, node, &head->chain) { > > > + if (unlikely(!node)) > > > + goto begin; > > > if (sk->sk_hash != hash) > > > continue; > > > if (likely(INET_MATCH(sk, net, acookie, > > > ------------------------------------------------------------------------ > > > > > > It depends on implementation details but I believe it would work. It > > > would be nicer if we could detect the switch to a listening socket but > > > I don't see how to make such test race free without introducing > > > unacceptable performance penalty. > > > > No, we do not want to add more checks in the fast path really. > > > > I was more thinking about not breaking the RCU invariants. > > > > (ie : adding back the nulls stuff that I removed in 3b24d854cb35 > > ("tcp/dccp: do not touch > > listener sk_refcnt under synflood") > > Yes, that would do the trick. It would add some cycles to listener > lookup but that is less harm than slowing down established socket > lookup. > It should not change cycles spent in listener lookup. Only the test to check for the iteration end will not use NULL, that's about it. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-11-20 21:54 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-20 8:39 possible race in __inet_lookup_established() Michal Kubecek [not found] ` <CANn89iJYXh7AwK8_Aiz3wXqugG0icPNW6OPsPxwOvpH90kr+Ew@mail.gmail.com> 2019-11-20 18:10 ` Michal Kubecek 2019-11-20 19:13 ` Eric Dumazet 2019-11-20 19:52 ` Michal Kubecek 2019-11-20 20:04 ` Eric Dumazet 2019-11-20 20:49 ` Michal Kubecek 2019-11-20 20:57 ` Eric Dumazet 2019-11-20 21:13 ` Michal Kubecek 2019-11-20 21:54 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).