netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net: fix for a race condition in the inet frag code
@ 2014-03-03 14:05 Nikolay Aleksandrov
  2014-03-03 14:40 ` Florian Westphal
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2014-03-03 14:05 UTC (permalink / raw)
  To: netdev; +Cc: Nikolay Aleksandrov, Jesper Dangaard Brouer, David S. Miller

I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
the users of inet_frag_kill/inet_frag_put).
What happens is that after a fragment has been added to the hash chain but
before it's been added to the lru_list (inet_frag_lru_add), it may get
deleted (either by an expired timer if the system load is high or the
timer sufficiently low, or by the fraq_queue function for different
reasons) before it's added to the lru_list, then after it gets added
it's a matter of time for the evictor to get to a piece of memory which
has been freed leading to a number of different bugs depending on what's
left there. I've been able to trigger this on both IPv4 and IPv6 (which
is normal as the frag code is the same), but it's been much more
difficult to trigger on IPv4 due to the protocol differences about how
fragments are treated. The setup I used to reproduce this is:
2 machines with 4 x 10G bonded in a RR bond, so the same flow can be
seen on multiple cards at the same time. Then I used multiple instances
of ping/ping6 to generate fragmented packets and flood the machines with
them while running other processes to load the attacked machine.
*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.

The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for the
fragment to be added to the lru_list, and then it'll remove it (it works
because the hash chain removal is done before the lru_list one and
there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.

Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
rwlock")

CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
I'm new to this code, so I'm not sure if this is the best approach to fix
the issue and am open to other suggestions, since I consider the issue
quite serious (remotely triggerable) I'll be closely monitoring this thread
to get it fixed asap.

 net/ipv4/inet_fragment.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index bb075fc9a14f..322dcebfc588 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 
 	atomic_inc(&qp->refcnt);
 	hlist_add_head(&qp->list, &hb->chain);
+	inet_frag_lru_add(nf, qp);
 	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
-	inet_frag_lru_add(nf, qp);
+
 	return qp;
 }
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 14:05 [PATCH] net: fix for a race condition in the inet frag code Nikolay Aleksandrov
@ 2014-03-03 14:40 ` Florian Westphal
  2014-03-03 14:43   ` Nikolay Aleksandrov
  2014-03-03 14:49   ` Nikolay Aleksandrov
  2014-03-03 21:34 ` David Miller
  2014-03-03 22:19 ` [PATCH v2] " Nikolay Aleksandrov
  2 siblings, 2 replies; 13+ messages in thread
From: Florian Westphal @ 2014-03-03 14:40 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, Jesper Dangaard Brouer, David S. Miller

Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> I stumbled upon this very serious bug while hunting for another one,
> it's a very subtle race condition between inet_frag_evictor,
> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> the users of inet_frag_kill/inet_frag_put).
> What happens is that after a fragment has been added to the hash chain but
> before it's been added to the lru_list (inet_frag_lru_add), it may get
> deleted (either by an expired timer if the system load is high or the
> timer sufficiently low, or by the fraq_queue function for different
> reasons) before it's added to the lru_list

Sorry.  Not following here, see below.

> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index bb075fc9a14f..322dcebfc588 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>  
>  	atomic_inc(&qp->refcnt);
>  	hlist_add_head(&qp->list, &hb->chain);
> +	inet_frag_lru_add(nf, qp);
>  	spin_unlock(&hb->chain_lock);
>  	read_unlock(&f->lock);

If I understand correctly your're saying that qp can be free'd on
another/cpu timer right after dropping the locks.  But how is it
possible?

->refcnt is bumped above when arming the timer (before dropping chain
lock), so even if the frag_expire timer fires instantly it should not
free qp.

What am I missing?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 14:40 ` Florian Westphal
@ 2014-03-03 14:43   ` Nikolay Aleksandrov
  2014-03-03 17:13     ` Jesper Dangaard Brouer
  2014-03-03 14:49   ` Nikolay Aleksandrov
  1 sibling, 1 reply; 13+ messages in thread
From: Nikolay Aleksandrov @ 2014-03-03 14:43 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Jesper Dangaard Brouer, David S. Miller

On 03/03/2014 03:40 PM, Florian Westphal wrote:
> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>> I stumbled upon this very serious bug while hunting for another one,
>> it's a very subtle race condition between inet_frag_evictor,
>> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
>> the users of inet_frag_kill/inet_frag_put).
>> What happens is that after a fragment has been added to the hash chain but
>> before it's been added to the lru_list (inet_frag_lru_add), it may get
>> deleted (either by an expired timer if the system load is high or the
>> timer sufficiently low, or by the fraq_queue function for different
>> reasons) before it's added to the lru_list
> 
> Sorry.  Not following here, see below.
> 
>> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
>> index bb075fc9a14f..322dcebfc588 100644
>> --- a/net/ipv4/inet_fragment.c
>> +++ b/net/ipv4/inet_fragment.c
>> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>>  
>>  	atomic_inc(&qp->refcnt);
>>  	hlist_add_head(&qp->list, &hb->chain);
>> +	inet_frag_lru_add(nf, qp);
>>  	spin_unlock(&hb->chain_lock);
>>  	read_unlock(&f->lock);
> 
> If I understand correctly your're saying that qp can be free'd on
> another/cpu timer right after dropping the locks.  But how is it
> possible?
> 
> ->refcnt is bumped above when arming the timer (before dropping chain
> lock), so even if the frag_expire timer fires instantly it should not
> free qp.
> 
> What am I missing?
> 
> Thanks,
> Florian
> 
inet_frag_kill when called from the IPv4/6 frag_queue function will remove the
timer refcount, then inet_frag_put afterwards will drop it to 0 and free it and
all of this could happen before the frag was ever added to the LRU list, then it
gets added. This happens much easier for IPv6 because of the dropping of
overlapping fragments in its frag_queue function, the point is we need to have
the timer's refcount removed in any way (it could be the timer itself - there's
an inet_frag_put in the end, or much easier by the frag_queue function).
I think I've explained it badly, I hope this makes it clearer :-)

Cheers,
Nik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 14:40 ` Florian Westphal
  2014-03-03 14:43   ` Nikolay Aleksandrov
@ 2014-03-03 14:49   ` Nikolay Aleksandrov
  2014-03-03 15:31     ` Jesper Dangaard Brouer
  2014-03-03 17:17     ` Florian Westphal
  1 sibling, 2 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2014-03-03 14:49 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Jesper Dangaard Brouer, David S. Miller

On 03/03/2014 03:40 PM, Florian Westphal wrote:
> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>> I stumbled upon this very serious bug while hunting for another one,
>> it's a very subtle race condition between inet_frag_evictor,
>> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
>> the users of inet_frag_kill/inet_frag_put).
>> What happens is that after a fragment has been added to the hash chain but
>> before it's been added to the lru_list (inet_frag_lru_add), it may get
>> deleted (either by an expired timer if the system load is high or the
>> timer sufficiently low, or by the fraq_queue function for different
>> reasons) before it's added to the lru_list
> 
> Sorry.  Not following here, see below.
> 
>> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
>> index bb075fc9a14f..322dcebfc588 100644
>> --- a/net/ipv4/inet_fragment.c
>> +++ b/net/ipv4/inet_fragment.c
>> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>>  
>>  	atomic_inc(&qp->refcnt);
>>  	hlist_add_head(&qp->list, &hb->chain);
>> +	inet_frag_lru_add(nf, qp);
>>  	spin_unlock(&hb->chain_lock);
>>  	read_unlock(&f->lock);
> 
> If I understand correctly your're saying that qp can be free'd on
> another/cpu timer right after dropping the locks.  But how is it
> possible?
> 
> ->refcnt is bumped above when arming the timer (before dropping chain
> lock), so even if the frag_expire timer fires instantly it should not
> free qp.
> 
> What am I missing?
> 
> Thanks,
> Florian
> 
An important point is that inet_frag_kill removes both the timer's refcnt and
has an unconditional atomic_dec to remove the original/guarding refcnt, so it
basically removes everything that's in the way.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 14:49   ` Nikolay Aleksandrov
@ 2014-03-03 15:31     ` Jesper Dangaard Brouer
  2014-03-03 15:34       ` Nikolay Aleksandrov
  2014-03-03 17:17     ` Florian Westphal
  1 sibling, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2014-03-03 15:31 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: Florian Westphal, netdev, David S. Miller, brouer

On Mon, 03 Mar 2014 15:49:56 +0100
Nikolay Aleksandrov <nikolay@redhat.com> wrote:

> On 03/03/2014 03:40 PM, Florian Westphal wrote:
> > Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> >> I stumbled upon this very serious bug while hunting for another one,
> >> it's a very subtle race condition between inet_frag_evictor,
> >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> >> the users of inet_frag_kill/inet_frag_put).
> >> What happens is that after a fragment has been added to the hash chain but
> >> before it's been added to the lru_list (inet_frag_lru_add), it may get
> >> deleted (either by an expired timer if the system load is high or the
> >> timer sufficiently low, or by the fraq_queue function for different
> >> reasons) before it's added to the lru_list
> > 
> > Sorry.  Not following here, see below.
> > 
> >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> >> index bb075fc9a14f..322dcebfc588 100644
> >> --- a/net/ipv4/inet_fragment.c
> >> +++ b/net/ipv4/inet_fragment.c
> >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
> >>  
> >>  	atomic_inc(&qp->refcnt);
> >>  	hlist_add_head(&qp->list, &hb->chain);
> >> +	inet_frag_lru_add(nf, qp);
> >>  	spin_unlock(&hb->chain_lock);
> >>  	read_unlock(&f->lock);
> > 
> > If I understand correctly your're saying that qp can be free'd on
> > another/cpu timer right after dropping the locks.  But how is it
> > possible?
> > 
> > ->refcnt is bumped above when arming the timer (before dropping chain
> > lock), so even if the frag_expire timer fires instantly it should not
> > free qp.
> > 
> > What am I missing?
> > 
> > Thanks,
> > Florian
> > 
> An important point is that inet_frag_kill removes both the timer's refcnt and
> has an unconditional atomic_dec to remove the original/guarding refcnt, so it
> basically removes everything that's in the way.
 
It sound like we might have a refcnt problem...
Do we need an extra refcnt for maintaining elements the LRU list?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 15:31     ` Jesper Dangaard Brouer
@ 2014-03-03 15:34       ` Nikolay Aleksandrov
  0 siblings, 0 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2014-03-03 15:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Florian Westphal, netdev, David S. Miller

On 03/03/2014 04:31 PM, Jesper Dangaard Brouer wrote:
> On Mon, 03 Mar 2014 15:49:56 +0100
> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> 
>> On 03/03/2014 03:40 PM, Florian Westphal wrote:
>>> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>>> I stumbled upon this very serious bug while hunting for another one,
>>>> it's a very subtle race condition between inet_frag_evictor,
>>>> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
>>>> the users of inet_frag_kill/inet_frag_put).
>>>> What happens is that after a fragment has been added to the hash chain but
>>>> before it's been added to the lru_list (inet_frag_lru_add), it may get
>>>> deleted (either by an expired timer if the system load is high or the
>>>> timer sufficiently low, or by the fraq_queue function for different
>>>> reasons) before it's added to the lru_list
>>>
>>> Sorry.  Not following here, see below.
>>>
>>>> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
>>>> index bb075fc9a14f..322dcebfc588 100644
>>>> --- a/net/ipv4/inet_fragment.c
>>>> +++ b/net/ipv4/inet_fragment.c
>>>> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>>>>  
>>>>  	atomic_inc(&qp->refcnt);
>>>>  	hlist_add_head(&qp->list, &hb->chain);
>>>> +	inet_frag_lru_add(nf, qp);
>>>>  	spin_unlock(&hb->chain_lock);
>>>>  	read_unlock(&f->lock);
>>>
>>> If I understand correctly your're saying that qp can be free'd on
>>> another/cpu timer right after dropping the locks.  But how is it
>>> possible?
>>>
>>> ->refcnt is bumped above when arming the timer (before dropping chain
>>> lock), so even if the frag_expire timer fires instantly it should not
>>> free qp.
>>>
>>> What am I missing?
>>>
>>> Thanks,
>>> Florian
>>>
>> An important point is that inet_frag_kill removes both the timer's refcnt and
>> has an unconditional atomic_dec to remove the original/guarding refcnt, so it
>> basically removes everything that's in the way.
>  
> It sound like we might have a refcnt problem...
> Do we need an extra refcnt for maintaining elements the LRU list?
> 
I don't think so, if you keep the lru_add separated from the chain addition as
before you still got a race condition when frags are seen by the fq_find
functions but are not in the LRU list yet unless you add the refcnt before
adding to the lru list while holding the chain lock (or before that) which will
alter the behaviour, but still if inet_frag_kill gets called it should remove
that refcnt and thus put us in the same position as now if we want to keep the
behaviour as it's now.

Nik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 14:43   ` Nikolay Aleksandrov
@ 2014-03-03 17:13     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2014-03-03 17:13 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: Florian Westphal, netdev, David S. Miller, brouer


On Mon, 03 Mar 2014 15:43:00 +0100
Nikolay Aleksandrov <nikolay@redhat.com> wrote:

> On 03/03/2014 03:40 PM, Florian Westphal wrote:
> > Nikolay Aleksandrov <nikolay@redhat.com> wrote:

[...]
> >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> >> index bb075fc9a14f..322dcebfc588 100644
> >> --- a/net/ipv4/inet_fragment.c
> >> +++ b/net/ipv4/inet_fragment.c
> >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
> >>  
> >>  	atomic_inc(&qp->refcnt);
> >>  	hlist_add_head(&qp->list, &hb->chain);
> >> +	inet_frag_lru_add(nf, qp);
> >>  	spin_unlock(&hb->chain_lock);
> >>  	read_unlock(&f->lock);
> > 
[...]
> > 
> inet_frag_kill when called from the IPv4/6 frag_queue function will remove the
> timer refcount, then inet_frag_put afterwards will drop it to 0 and free it and
> all of this could happen before the frag was ever added to the LRU list, then it
> gets added. This happens much easier for IPv6 because of the dropping of
> overlapping fragments in its frag_queue function, the point is we need to have
> the timer's refcount removed in any way (it could be the timer itself - there's
> an inet_frag_put in the end, or much easier by the frag_queue function).
> I think I've explained it badly, I hope this makes it clearer :-)

I like this desc better.

After some IRC discussions with Nik and Florian, I acknowledge this is
real race condition.

The real solution is the remove the LRU list system (which will also
solve a scalability problem), but short-term we need Nik's fix, which I
guess should be a stable fix.

Thanks Nik!
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 14:49   ` Nikolay Aleksandrov
  2014-03-03 15:31     ` Jesper Dangaard Brouer
@ 2014-03-03 17:17     ` Florian Westphal
  1 sibling, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2014-03-03 17:17 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Florian Westphal, netdev, Jesper Dangaard Brouer, David S. Miller

Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> On 03/03/2014 03:40 PM, Florian Westphal wrote:
> > Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> >> I stumbled upon this very serious bug while hunting for another one,
> >> it's a very subtle race condition between inet_frag_evictor,
> >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> >> the users of inet_frag_kill/inet_frag_put).
> >> What happens is that after a fragment has been added to the hash chain but
> >> before it's been added to the lru_list (inet_frag_lru_add), it may get
> >> deleted (either by an expired timer if the system load is high or the
> >> timer sufficiently low, or by the fraq_queue function for different
> >> reasons) before it's added to the lru_list
> > 
> > Sorry.  Not following here, see below.
> > 
> >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> >> index bb075fc9a14f..322dcebfc588 100644
> >> --- a/net/ipv4/inet_fragment.c
> >> +++ b/net/ipv4/inet_fragment.c
> >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
> >>  
> >>  	atomic_inc(&qp->refcnt);
> >>  	hlist_add_head(&qp->list, &hb->chain);
> >> +	inet_frag_lru_add(nf, qp);
> >>  	spin_unlock(&hb->chain_lock);
> >>  	read_unlock(&f->lock);
> > 
> > If I understand correctly your're saying that qp can be free'd on
> > another/cpu timer right after dropping the locks.  But how is it
> > possible?
> > 
> > ->refcnt is bumped above when arming the timer (before dropping chain
> > lock), so even if the frag_expire timer fires instantly it should not
> > free qp.
> > 
> > What am I missing?
> > 
> > Thanks,
> > Florian
> > 
> An important point is that inet_frag_kill removes both the timer's refcnt and
> has an unconditional atomic_dec to remove the original/guarding refcnt, so it
> basically removes everything that's in the way.

You're right.

Problem is that when we return from inet_frag_intern() we can end up
with a qp that is no longer in the hash (inet_frag_kill was invoked)
but has been added to the lru list _after_ inet_frag_kill supposedly
removed it.

The refcnt is not 0 (yet) by the time inet_frag_intern returns
but it turns to 0 soon after on the next _put event.

Your fix makes 'in hash table but not on lru list' impossible and
thus avoids the problem.

Thanks for explaining!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: fix for a race condition in the inet frag code
  2014-03-03 14:05 [PATCH] net: fix for a race condition in the inet frag code Nikolay Aleksandrov
  2014-03-03 14:40 ` Florian Westphal
@ 2014-03-03 21:34 ` David Miller
  2014-03-03 22:19 ` [PATCH v2] " Nikolay Aleksandrov
  2 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2014-03-03 21:34 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, brouer

From: Nikolay Aleksandrov <nikolay@redhat.com>
Date: Mon,  3 Mar 2014 15:05:20 +0100

> I stumbled upon this very serious bug while hunting for another one,
> it's a very subtle race condition between inet_frag_evictor,
> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> the users of inet_frag_kill/inet_frag_put).
> What happens is that after a fragment has been added to the hash chain but
> before it's been added to the lru_list (inet_frag_lru_add), it may get
> deleted (either by an expired timer if the system load is high or the
> timer sufficiently low, or by the fraq_queue function for different
> reasons) before it's added to the lru_list, then after it gets added
> it's a matter of time for the evictor to get to a piece of memory which
> has been freed leading to a number of different bugs depending on what's
> left there. I've been able to trigger this on both IPv4 and IPv6 (which
> is normal as the frag code is the same), but it's been much more
> difficult to trigger on IPv4 due to the protocol differences about how
> fragments are treated. The setup I used to reproduce this is:
> 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be
> seen on multiple cards at the same time. Then I used multiple instances
> of ping/ping6 to generate fragmented packets and flood the machines with
> them while running other processes to load the attacked machine.
> *It is very important to have the _same flow_ coming in on multiple CPUs
> concurrently. Usually the attacked machine would die in less than 30
> minutes, if configured properly to have many evictor calls and timeouts
> it could happen in 10 minutes or so.
> 
> The fix is simple, just move the lru_add under the hash chain locked
> region so when a removing function is called it'll have to wait for the
> fragment to be added to the lru_list, and then it'll remove it (it works
> because the hash chain removal is done before the lru_list one and
> there's no window between the two list adds when the frag can get
> dropped). With this fix applied I couldn't kill the same machine in 24
> hours with the same setup.
> 
> Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
> rwlock")
> 
> CC: Jesper Dangaard Brouer <brouer@redhat.com>
> CC: David S. Miller <davem@davemloft.net>
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>

Nik, please beef up the commit message a bit and resubmit.

Some of your replies explained the situation a bit better.  Don't
be afraid of making the commit message too long or too verbose :-)

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] net: fix for a race condition in the inet frag code
  2014-03-03 14:05 [PATCH] net: fix for a race condition in the inet frag code Nikolay Aleksandrov
  2014-03-03 14:40 ` Florian Westphal
  2014-03-03 21:34 ` David Miller
@ 2014-03-03 22:19 ` Nikolay Aleksandrov
  2014-03-03 22:21   ` Florian Westphal
                     ` (2 more replies)
  2 siblings, 3 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2014-03-03 22:19 UTC (permalink / raw)
  To: netdev
  Cc: Nikolay Aleksandrov, Florian Westphal, Jesper Dangaard Brouer,
	David S. Miller

I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
the users of inet_frag_kill/inet_frag_put).
What happens is that after a fragment has been added to the hash chain but
before it's been added to the lru_list (inet_frag_lru_add) in
inet_frag_intern, may get deleted (either by an expired timer if the system
load is high or the timer sufficiently low, or by the fraq_queue function
for different reasons) before it's added to the lru_list, then after it
gets added it's a matter of time for the evictor to get to a piece of
memory which has been freed leading to a number of different bugs depending
on what's left there. I've been able to trigger this on both IPv4 and IPv6
(which is normal as the frag code is the same), but it's been much more
difficult to trigger on IPv4 due to the protocol differences about how
fragments are treated. The setup I used to reproduce this is:
2 machines with 4 x 10G bonded in a RR bond, so the same flow can be
seen on multiple cards at the same time. Then I used multiple instances
of ping/ping6 to generate fragmented packets and flood the machines with
them while running other processes to load the attacked machine.
*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.
An important point to make is that any caller (frag_queue or timer) of
inet_frag_kill will remove both the timer refcount and the
original/guarding refcount thus removing everything that's keeping the
frag from being freed at the next inet_frag_put.
All of this could happen before the frag was ever added to the LRU list,
then it gets added and the evictor uses a freed fragment.
An example for IPv6 would be if a fragment is being added and is at the
stage of being inserted in the hash after the hash lock is released, but
before inet_frag_lru_add executes (or is able to obtain the lru lock)
another overlapping fragment for the same flow arrives at a different
CPU which finds it in the hash, but since it's overlapping it drops it
invoking inet_frag_kill and thus removing all guarding refcounts, and
afterwards freeing it by invoking inet_frag_put which removes the last
refcount added previously by inet_frag_find, then inet_frag_lru_add gets
executed by inet_frag_intern and we have a freed fragment in the lru_list.

The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for the
fragment to be added to the lru_list, and then it'll remove it (it works
because the hash chain removal is done before the lru_list one and
there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.

Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
rwlock")

CC: Florian Westphal <fw@strlen.de>
CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
v2: Add more information on the flow leading to the race and an example.
I hope this is better :-)

 net/ipv4/inet_fragment.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index bb075fc9a14f..322dcebfc588 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 
 	atomic_inc(&qp->refcnt);
 	hlist_add_head(&qp->list, &hb->chain);
+	inet_frag_lru_add(nf, qp);
 	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
-	inet_frag_lru_add(nf, qp);
+
 	return qp;
 }
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] net: fix for a race condition in the inet frag code
  2014-03-03 22:19 ` [PATCH v2] " Nikolay Aleksandrov
@ 2014-03-03 22:21   ` Florian Westphal
  2014-03-04  7:28   ` Jesper Dangaard Brouer
  2014-03-06  1:34   ` David Miller
  2 siblings, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2014-03-03 22:21 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, Florian Westphal, Jesper Dangaard Brouer, David S. Miller

Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> v2: Add more information on the flow leading to the race and an example.
> I hope this is better :-)

It is, thanks Nik!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] net: fix for a race condition in the inet frag code
  2014-03-03 22:19 ` [PATCH v2] " Nikolay Aleksandrov
  2014-03-03 22:21   ` Florian Westphal
@ 2014-03-04  7:28   ` Jesper Dangaard Brouer
  2014-03-06  1:34   ` David Miller
  2 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2014-03-04  7:28 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, Florian Westphal, David S. Miller, brouer

On Mon,  3 Mar 2014 23:19:18 +0100
Nikolay Aleksandrov <nikolay@redhat.com> wrote:

> Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
> rwlock")
> 
> CC: Florian Westphal <fw@strlen.de>
> CC: Jesper Dangaard Brouer <brouer@redhat.com>
> CC: David S. Miller <davem@davemloft.net>
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
> ---
> v2: Add more information on the flow leading to the race and an example.
> I hope this is better :-)

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

Good work, thanks a lot for finding this race bug!
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] net: fix for a race condition in the inet frag code
  2014-03-03 22:19 ` [PATCH v2] " Nikolay Aleksandrov
  2014-03-03 22:21   ` Florian Westphal
  2014-03-04  7:28   ` Jesper Dangaard Brouer
@ 2014-03-06  1:34   ` David Miller
  2 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2014-03-06  1:34 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, fw, brouer

From: Nikolay Aleksandrov <nikolay@redhat.com>
Date: Mon,  3 Mar 2014 23:19:18 +0100

> The fix is simple, just move the lru_add under the hash chain locked
> region so when a removing function is called it'll have to wait for the
> fragment to be added to the lru_list, and then it'll remove it (it works
> because the hash chain removal is done before the lru_list one and
> there's no window between the two list adds when the frag can get
> dropped). With this fix applied I couldn't kill the same machine in 24
> hours with the same setup.
> 
> Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
> rwlock")

Applied and queued up for -stable, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-03-06  1:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-03 14:05 [PATCH] net: fix for a race condition in the inet frag code Nikolay Aleksandrov
2014-03-03 14:40 ` Florian Westphal
2014-03-03 14:43   ` Nikolay Aleksandrov
2014-03-03 17:13     ` Jesper Dangaard Brouer
2014-03-03 14:49   ` Nikolay Aleksandrov
2014-03-03 15:31     ` Jesper Dangaard Brouer
2014-03-03 15:34       ` Nikolay Aleksandrov
2014-03-03 17:17     ` Florian Westphal
2014-03-03 21:34 ` David Miller
2014-03-03 22:19 ` [PATCH v2] " Nikolay Aleksandrov
2014-03-03 22:21   ` Florian Westphal
2014-03-04  7:28   ` Jesper Dangaard Brouer
2014-03-06  1:34   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).