All of lore.kernel.org
 help / color / mirror / Atom feed
* SKB paged fragment lifecycle on receive
@ 2011-06-24 15:43 ` Ian Campbell
  0 siblings, 0 replies; 21+ messages in thread
From: Ian Campbell @ 2011-06-24 15:43 UTC (permalink / raw)
  To: netdev, xen-devel; +Cc: Jeremy Fitzhardinge, Rusty Russell

When I was preparing Xen's netback driver for upstream one of the things
I removed was the zero-copy guest transmit (i.e. netback receive)
support.

In this mode guest data pages ("foreign pages") were mapped into the
backend domain (using Xen grant-table functionality) and placed into the
skb's paged frag list (skb_shinfo(skb)->frags, I hope I am using the
right term). Once the page is finished with netback unmaps it in order
to return it to the guest (we really want to avoid returning such pages
to the general allocation pool!). Unfortunately "page is finished with"
is an event which there is no way for the driver to see[0] and therefore
I replaced the grant-mapping with a grant-copy for upstreaming which has
performance and scalability implications (since the copy happens in, and
therefore is accounted to, the backend domain instead of the frontend
domain).

The root of the problem here is that the network stack manipulates the
paged frags using bare get/put_page and therefore has no visibility into
when a page reference count drops to zero and therefore there is no way
to provide an interface for netback to know when it has to tear down the
grant map.

I think this has implications for users other than Xen as well. For
instance I have previously observed an issue where NFS can transmit
bogus data onto the wire due to ACKs which arrive late and cross over
with the queuing of a retransmit on the client side (see
http://marc.info/?l=linux-nfs&m=122424132729720&w=2 which mainly
discusses RPC protocol level retransmit but I subsequently saw similar
issues due to TCP retransmission too). The issue here is that an ACK
from the server which is delayed in the network (but not dropped) can
arrive after a retransmission has been queued. The arrival of this ACK
causes the NFS client to complete the write back to userspace but the
same page is still referenced from the retransmitted skb. Therefore if
userspace reuses the write buffer quickly enough then incorrect data can
go out in the retransmission. Ideally NFS (and I suspect any network
filesystem or block device, e.g. iSCSI, could suffer from this sort of
issue) would be able to wait to complete the write until the buffer was
actually completely finished with.

Someone also suggested the Infiniband might also have an interest in
this sort of thing, although I must admit I don't know enough about IB
to imagine why (perhaps it's just the same as the NFS/iSCSI cases).

We've previously looked into solutions using the skb destructor callback
but that falls over if the skb is cloned since you also need to know
when the clone is destroyed. Jeremy Fitzhardinge and I subsequently
looked at the possibility of a no-clone skb flag (i.e. always forcing a
copy instead of a clone) but IIRC honouring it universally turned into a
very twisty maze with a number of nasty corner cases etc. It also seemed
that the proportion of SKBs which get cloned at least once appeared as
if it could be quite high which would presumably make the performance
impact unacceptable when using the flag. Another issue with using the
skb destructor is that functions such as __pskb_pull_tail will eat (and
free) pages from the start of the frag array such that by the time the
skb destructor is called they are no longer there.

AIUI Rusty Russell had previously looked into a per-page destructor in
the shinfo but found that it couldn't be made to work (I don't remember
why, or if I even knew at the time). Could that be an approach worth
reinvestigating?

I can't really think of any other solution which doesn't involve some
sort of driver callback at the time a page is free()d.

I expect that wrapping the uses of get/put_page in a network specific
wrapper (e.g. skb_{get,frag}_frag(skb, nr) would be a useful first step
in any solution. That's a pretty big task/patch in itself but could be
done. Might it be worthwhile in for its own sake?

Does anyone have any ideas or advice for other approaches I could try
(either on the driver or stack side)?

FWIW I proposed a session on the subject for LPC this year. The proposal
was for the virtualisation track although as I say I think the class of
problem reaches a bit wider than that. Whether the session will be a
discussion around ways of solving the issue or a presentation on the
solution remains to be seen ;-)

Ian.

[0] at least with a mainline kernel, in the older out-of-tree Xen stuff
we had a PageForeign page-flag and a destructor function in a spare
struct page field which was called from the mm free routines
(free_pages_prepare and free_hot_cold_page). I'm under no illusions
about the upstreamability of this approach...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* SKB paged fragment lifecycle on receive
@ 2011-06-24 15:43 ` Ian Campbell
  0 siblings, 0 replies; 21+ messages in thread
From: Ian Campbell @ 2011-06-24 15:43 UTC (permalink / raw)
  To: netdev, xen-devel; +Cc: Jeremy Fitzhardinge, Rusty Russell

When I was preparing Xen's netback driver for upstream one of the things
I removed was the zero-copy guest transmit (i.e. netback receive)
support.

In this mode guest data pages ("foreign pages") were mapped into the
backend domain (using Xen grant-table functionality) and placed into the
skb's paged frag list (skb_shinfo(skb)->frags, I hope I am using the
right term). Once the page is finished with netback unmaps it in order
to return it to the guest (we really want to avoid returning such pages
to the general allocation pool!). Unfortunately "page is finished with"
is an event which there is no way for the driver to see[0] and therefore
I replaced the grant-mapping with a grant-copy for upstreaming which has
performance and scalability implications (since the copy happens in, and
therefore is accounted to, the backend domain instead of the frontend
domain).

The root of the problem here is that the network stack manipulates the
paged frags using bare get/put_page and therefore has no visibility into
when a page reference count drops to zero and therefore there is no way
to provide an interface for netback to know when it has to tear down the
grant map.

I think this has implications for users other than Xen as well. For
instance I have previously observed an issue where NFS can transmit
bogus data onto the wire due to ACKs which arrive late and cross over
with the queuing of a retransmit on the client side (see
http://marc.info/?l=linux-nfs&m=122424132729720&w=2 which mainly
discusses RPC protocol level retransmit but I subsequently saw similar
issues due to TCP retransmission too). The issue here is that an ACK
from the server which is delayed in the network (but not dropped) can
arrive after a retransmission has been queued. The arrival of this ACK
causes the NFS client to complete the write back to userspace but the
same page is still referenced from the retransmitted skb. Therefore if
userspace reuses the write buffer quickly enough then incorrect data can
go out in the retransmission. Ideally NFS (and I suspect any network
filesystem or block device, e.g. iSCSI, could suffer from this sort of
issue) would be able to wait to complete the write until the buffer was
actually completely finished with.

Someone also suggested the Infiniband might also have an interest in
this sort of thing, although I must admit I don't know enough about IB
to imagine why (perhaps it's just the same as the NFS/iSCSI cases).

We've previously looked into solutions using the skb destructor callback
but that falls over if the skb is cloned since you also need to know
when the clone is destroyed. Jeremy Fitzhardinge and I subsequently
looked at the possibility of a no-clone skb flag (i.e. always forcing a
copy instead of a clone) but IIRC honouring it universally turned into a
very twisty maze with a number of nasty corner cases etc. It also seemed
that the proportion of SKBs which get cloned at least once appeared as
if it could be quite high which would presumably make the performance
impact unacceptable when using the flag. Another issue with using the
skb destructor is that functions such as __pskb_pull_tail will eat (and
free) pages from the start of the frag array such that by the time the
skb destructor is called they are no longer there.

AIUI Rusty Russell had previously looked into a per-page destructor in
the shinfo but found that it couldn't be made to work (I don't remember
why, or if I even knew at the time). Could that be an approach worth
reinvestigating?

I can't really think of any other solution which doesn't involve some
sort of driver callback at the time a page is free()d.

I expect that wrapping the uses of get/put_page in a network specific
wrapper (e.g. skb_{get,frag}_frag(skb, nr) would be a useful first step
in any solution. That's a pretty big task/patch in itself but could be
done. Might it be worthwhile in for its own sake?

Does anyone have any ideas or advice for other approaches I could try
(either on the driver or stack side)?

FWIW I proposed a session on the subject for LPC this year. The proposal
was for the virtualisation track although as I say I think the class of
problem reaches a bit wider than that. Whether the session will be a
discussion around ways of solving the issue or a presentation on the
solution remains to be seen ;-)

Ian.

[0] at least with a mainline kernel, in the older out-of-tree Xen stuff
we had a PageForeign page-flag and a destructor function in a spare
struct page field which was called from the mm free routines
(free_pages_prepare and free_hot_cold_page). I'm under no illusions
about the upstreamability of this approach...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 15:43 ` Ian Campbell
  (?)
@ 2011-06-24 17:29 ` Jeremy Fitzhardinge
  2011-06-24 17:56   ` Eric Dumazet
  2011-06-24 22:44   ` Ian Campbell
  -1 siblings, 2 replies; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-06-24 17:29 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, xen-devel, Rusty Russell

On 06/24/2011 08:43 AM, Ian Campbell wrote:
> We've previously looked into solutions using the skb destructor callback
> but that falls over if the skb is cloned since you also need to know
> when the clone is destroyed. Jeremy Fitzhardinge and I subsequently
> looked at the possibility of a no-clone skb flag (i.e. always forcing a
> copy instead of a clone) but IIRC honouring it universally turned into a
> very twisty maze with a number of nasty corner cases etc. It also seemed
> that the proportion of SKBs which get cloned at least once appeared as
> if it could be quite high which would presumably make the performance
> impact unacceptable when using the flag. Another issue with using the
> skb destructor is that functions such as __pskb_pull_tail will eat (and
> free) pages from the start of the frag array such that by the time the
> skb destructor is called they are no longer there.
>
> AIUI Rusty Russell had previously looked into a per-page destructor in
> the shinfo but found that it couldn't be made to work (I don't remember
> why, or if I even knew at the time). Could that be an approach worth
> reinvestigating?
>
> I can't really think of any other solution which doesn't involve some
> sort of driver callback at the time a page is free()d.

One simple approach would be to simply make sure that we retain a page
reference on any granted pages so that the network stack's put pages
will never result in them being released back to the kernel.  We can
also install an skb destructor.  If it sees a page being released with a
refcount of 1, then we know its our own reference and can free the page
immediately.  If the refcount is > 1 then we can add it to a queue of
pending pages, which can be periodically polled to free pages whose
other references have been dropped.

However, the question is how large will this queue get?  If it remains
small then this scheme could be entirely practical.  But if almost every
page ends up having transient stray references, it could become very
awkward.

So it comes down to "how useful is an skb destructor callback as a
heuristic for page free"?

That said, I think an event-based rather than polling based mechanism
would be much more preferable.

> I expect that wrapping the uses of get/put_page in a network specific
> wrapper (e.g. skb_{get,frag}_frag(skb, nr) would be a useful first step
> in any solution. That's a pretty big task/patch in itself but could be
> done. Might it be worthwhile in for its own sake?

Is there some way to do it so that you'd get compiler warnings/errors in
missed cases?  I guess wrap "struct page" in some other type would go
some way to helping.

> Does anyone have any ideas or advice for other approaches I could try
> (either on the driver or stack side)?
>
> FWIW I proposed a session on the subject for LPC this year. The proposal
> was for the virtualisation track although as I say I think the class of
> problem reaches a bit wider than that. Whether the session will be a
> discussion around ways of solving the issue or a presentation on the
> solution remains to be seen ;-)
>
> Ian.
>
> [0] at least with a mainline kernel, in the older out-of-tree Xen stuff
> we had a PageForeign page-flag and a destructor function in a spare
> struct page field which was called from the mm free routines
> (free_pages_prepare and free_hot_cold_page). I'm under no illusions
> about the upstreamability of this approach...

When I last asked AKPM about this - a long time ago - the problem was
that we'd simply run out of page flags (at least on 32-bit x86), so it
simply wasn't implementable.  But since then the page flags have been
rearranged and I think there's less pressure on them - but they're still
a valuable resource, so the justification would need to be strong (ie,
multiple convincing users).

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 17:29 ` Jeremy Fitzhardinge
@ 2011-06-24 17:56   ` Eric Dumazet
  2011-06-24 18:21     ` Jeremy Fitzhardinge
  2011-06-27 14:42     ` Ian Campbell
  2011-06-24 22:44   ` Ian Campbell
  1 sibling, 2 replies; 21+ messages in thread
From: Eric Dumazet @ 2011-06-24 17:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Ian Campbell, netdev, xen-devel, Rusty Russell

Le vendredi 24 juin 2011 à 10:29 -0700, Jeremy Fitzhardinge a écrit :
> On 06/24/2011 08:43 AM, Ian Campbell wrote:
> > We've previously looked into solutions using the skb destructor callback
> > but that falls over if the skb is cloned since you also need to know
> > when the clone is destroyed. Jeremy Fitzhardinge and I subsequently
> > looked at the possibility of a no-clone skb flag (i.e. always forcing a
> > copy instead of a clone) but IIRC honouring it universally turned into a
> > very twisty maze with a number of nasty corner cases etc. It also seemed
> > that the proportion of SKBs which get cloned at least once appeared as
> > if it could be quite high which would presumably make the performance
> > impact unacceptable when using the flag. Another issue with using the
> > skb destructor is that functions such as __pskb_pull_tail will eat (and
> > free) pages from the start of the frag array such that by the time the
> > skb destructor is called they are no longer there.
> >
> > AIUI Rusty Russell had previously looked into a per-page destructor in
> > the shinfo but found that it couldn't be made to work (I don't remember
> > why, or if I even knew at the time). Could that be an approach worth
> > reinvestigating?
> >
> > I can't really think of any other solution which doesn't involve some
> > sort of driver callback at the time a page is free()d.
> 

This reminds me the packet mmap (tx path) games we play with pages.

net/packet/af_packet.c : tpacket_destruct_skb(), poking
TP_STATUS_AVAILABLE back to user to tell him he can reuse space...

> One simple approach would be to simply make sure that we retain a page
> reference on any granted pages so that the network stack's put pages
> will never result in them being released back to the kernel.  We can
> also install an skb destructor.  If it sees a page being released with a
> refcount of 1, then we know its our own reference and can free the page
> immediately.  If the refcount is > 1 then we can add it to a queue of
> pending pages, which can be periodically polled to free pages whose
> other references have been dropped.
> 
> However, the question is how large will this queue get?  If it remains
> small then this scheme could be entirely practical.  But if almost every
> page ends up having transient stray references, it could become very
> awkward.
> 
> So it comes down to "how useful is an skb destructor callback as a
> heuristic for page free"?
> 

Dangerous I would say. You could have a skb1 page transferred to another
skb2, and call skb1 destructor way before page being released.

TCP stack could do that in tcp_collapse() [ it currently doesnt play
with pages ]




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 17:56   ` Eric Dumazet
@ 2011-06-24 18:21     ` Jeremy Fitzhardinge
  2011-06-24 19:46       ` David Miller
  2011-06-27 14:42     ` Ian Campbell
  1 sibling, 1 reply; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-06-24 18:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Rusty Russell, xen-devel, Ian Campbell

On 06/24/2011 10:56 AM, Eric Dumazet wrote:
> Le vendredi 24 juin 2011 à 10:29 -0700, Jeremy Fitzhardinge a écrit :
>> On 06/24/2011 08:43 AM, Ian Campbell wrote:
>>> We've previously looked into solutions using the skb destructor callback
>>> but that falls over if the skb is cloned since you also need to know
>>> when the clone is destroyed. Jeremy Fitzhardinge and I subsequently
>>> looked at the possibility of a no-clone skb flag (i.e. always forcing a
>>> copy instead of a clone) but IIRC honouring it universally turned into a
>>> very twisty maze with a number of nasty corner cases etc. It also seemed
>>> that the proportion of SKBs which get cloned at least once appeared as
>>> if it could be quite high which would presumably make the performance
>>> impact unacceptable when using the flag. Another issue with using the
>>> skb destructor is that functions such as __pskb_pull_tail will eat (and
>>> free) pages from the start of the frag array such that by the time the
>>> skb destructor is called they are no longer there.
>>>
>>> AIUI Rusty Russell had previously looked into a per-page destructor in
>>> the shinfo but found that it couldn't be made to work (I don't remember
>>> why, or if I even knew at the time). Could that be an approach worth
>>> reinvestigating?
>>>
>>> I can't really think of any other solution which doesn't involve some
>>> sort of driver callback at the time a page is free()d.
> This reminds me the packet mmap (tx path) games we play with pages.
>
> net/packet/af_packet.c : tpacket_destruct_skb(), poking
> TP_STATUS_AVAILABLE back to user to tell him he can reuse space...

Yes.  Its similar in the sense that its a tx from a page which isn't
being handed over entirely to the network stack, but has some other
longer-term lifetime.

>> One simple approach would be to simply make sure that we retain a page
>> reference on any granted pages so that the network stack's put pages
>> will never result in them being released back to the kernel.  We can
>> also install an skb destructor.  If it sees a page being released with a
>> refcount of 1, then we know its our own reference and can free the page
>> immediately.  If the refcount is > 1 then we can add it to a queue of
>> pending pages, which can be periodically polled to free pages whose
>> other references have been dropped.
>>
>> However, the question is how large will this queue get?  If it remains
>> small then this scheme could be entirely practical.  But if almost every
>> page ends up having transient stray references, it could become very
>> awkward.
>>
>> So it comes down to "how useful is an skb destructor callback as a
>> heuristic for page free"?
>>
> Dangerous I would say. You could have a skb1 page transferred to another
> skb2, and call skb1 destructor way before page being released.

Under what circumstances would that happen?

> TCP stack could do that in tcp_collapse() [ it currently doesnt play
> with pages ]

Do you mean "dangerous" in the sense that many pages could end up being
tied up in the pending-release queue?  We'd always check the page
refcount, so it should never release pages prematurely.

Thanks,
    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 18:21     ` Jeremy Fitzhardinge
@ 2011-06-24 19:46       ` David Miller
  2011-06-24 20:11         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2011-06-24 19:46 UTC (permalink / raw)
  To: jeremy; +Cc: eric.dumazet, Ian.Campbell, netdev, xen-devel, rusty

From: Jeremy Fitzhardinge <jeremy@goop.org>
Date: Fri, 24 Jun 2011 11:21:15 -0700

>> Dangerous I would say. You could have a skb1 page transferred to another
>> skb2, and call skb1 destructor way before page being released.
> 
> Under what circumstances would that happen?

Pages get transferred between different SKBs all the time.

For example, GRO makes extensive use of this technique.
See net/core/skbuff.c:skb_gro_receive().

It is just one example.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 19:46       ` David Miller
@ 2011-06-24 20:11         ` Jeremy Fitzhardinge
  2011-06-24 20:27           ` David Miller
  2011-06-25 11:58           ` Ian Campbell
  0 siblings, 2 replies; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-06-24 20:11 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, Ian.Campbell, netdev, xen-devel, rusty

On 06/24/2011 12:46 PM, David Miller wrote:
> Pages get transferred between different SKBs all the time.
>
> For example, GRO makes extensive use of this technique.
> See net/core/skbuff.c:skb_gro_receive().
>
> It is just one example.

I see, and the new skb doesn't get a destructor copied from the
original, so there'd be no second callback.

Are the pages still attached to the first skb, or are they transferred
completely?

Thanks,
    J



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 20:11         ` Jeremy Fitzhardinge
@ 2011-06-24 20:27           ` David Miller
  2011-06-25 11:58           ` Ian Campbell
  1 sibling, 0 replies; 21+ messages in thread
From: David Miller @ 2011-06-24 20:27 UTC (permalink / raw)
  To: jeremy; +Cc: eric.dumazet, Ian.Campbell, netdev, xen-devel, rusty

From: Jeremy Fitzhardinge <jeremy@goop.org>
Date: Fri, 24 Jun 2011 13:11:59 -0700

> Are the pages still attached to the first skb, or are they transferred
> completely?

In this case they are transferred completely, the refcount isn't
modified at all.

There are cases where partial transfers occur, and in such cases
the refcount is modified.  Just swim around in net/core/skbuff.c
you'll find them :-)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 17:29 ` Jeremy Fitzhardinge
  2011-06-24 17:56   ` Eric Dumazet
@ 2011-06-24 22:44   ` Ian Campbell
  2011-06-24 22:48     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2011-06-24 22:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: netdev, xen-devel, Rusty Russell

On Fri, 2011-06-24 at 18:29 +0100, Jeremy Fitzhardinge wrote:
> On 06/24/2011 08:43 AM, Ian Campbell wrote:
> > We've previously looked into solutions using the skb destructor callback
> > but that falls over if the skb is cloned since you also need to know
> > when the clone is destroyed. Jeremy Fitzhardinge and I subsequently
> > looked at the possibility of a no-clone skb flag (i.e. always forcing a
> > copy instead of a clone) but IIRC honouring it universally turned into a
> > very twisty maze with a number of nasty corner cases etc. It also seemed
> > that the proportion of SKBs which get cloned at least once appeared as
> > if it could be quite high which would presumably make the performance
> > impact unacceptable when using the flag. Another issue with using the
> > skb destructor is that functions such as __pskb_pull_tail will eat (and
> > free) pages from the start of the frag array such that by the time the
> > skb destructor is called they are no longer there.
> >
> > AIUI Rusty Russell had previously looked into a per-page destructor in
> > the shinfo but found that it couldn't be made to work (I don't remember
> > why, or if I even knew at the time). Could that be an approach worth
> > reinvestigating?
> >
> > I can't really think of any other solution which doesn't involve some
> > sort of driver callback at the time a page is free()d.
> 
> One simple approach would be to simply make sure that we retain a page
> reference on any granted pages so that the network stack's put pages
> will never result in them being released back to the kernel.  We can
> also install an skb destructor.  If it sees a page being released with a
> refcount of 1, then we know its our own reference and can free the page
> immediately.  If the refcount is > 1 then we can add it to a queue of
> pending pages, which can be periodically polled to free pages whose
> other references have been dropped.

One problem with this is that some functions (__pskb_pull_tail) drop the
ref count and then remove the page from the skb's fraglist. So by the
time the destructor is called you have lost the page and cannot do the
refcount checking.

I suppose we could keep a queue of _all_ pages we ever put in an SKB
which we poll. We could still check for pages with count==1 in the
destructor. Apart from the other issues with the destructor not being
copied over clone etc which would cause us to fall-back to polling the
queue more often than not I reckon.

> That said, I think an event-based rather than polling based mechanism
> would be much more preferable.

Absolutely.

> > I expect that wrapping the uses of get/put_page in a network specific
> > wrapper (e.g. skb_{get,frag}_frag(skb, nr) would be a useful first step
> > in any solution. That's a pretty big task/patch in itself but could be
> > done. Might it be worthwhile in for its own sake?
> 
> Is there some way to do it so that you'd get compiler warnings/errors in
> missed cases?  I guess wrap "struct page" in some other type would go
> some way to helping.

I was thinking it could be done by changing the field name (e.g. even
just to _frag), add the wrapper and fixup everything grep could find and
then run an allBLAHconfig build, fix the compile errors, repeat.

Once the transition is complete we would have the option of putting the
name back -- since it would only mean changing the wrapper. Although I
don't know if we would necessarily want that since otherwise new
open-coded users will likely creep in.

> > Does anyone have any ideas or advice for other approaches I could try
> > (either on the driver or stack side)?
> >
> > FWIW I proposed a session on the subject for LPC this year. The proposal
> > was for the virtualisation track although as I say I think the class of
> > problem reaches a bit wider than that. Whether the session will be a
> > discussion around ways of solving the issue or a presentation on the
> > solution remains to be seen ;-)
> >
> > Ian.
> >
> > [0] at least with a mainline kernel, in the older out-of-tree Xen stuff
> > we had a PageForeign page-flag and a destructor function in a spare
> > struct page field which was called from the mm free routines
> > (free_pages_prepare and free_hot_cold_page). I'm under no illusions
> > about the upstreamability of this approach...
> 
> When I last asked AKPM about this - a long time ago - the problem was
> that we'd simply run out of page flags (at least on 32-bit x86), so it
> simply wasn't implementable.  But since then the page flags have been
> rearranged and I think there's less pressure on them - but they're still
> a valuable resource, so the justification would need to be strong (ie,
> multiple convincing users).
> 
>     J



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 22:44   ` Ian Campbell
@ 2011-06-24 22:48     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-06-24 22:48 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Rusty Russell, xen-devel

On 06/24/2011 03:44 PM, Ian Campbell wrote:
> One problem with this is that some functions (__pskb_pull_tail) drop the
> ref count and then remove the page from the skb's fraglist. So by the
> time the destructor is called you have lost the page and cannot do the
> refcount checking.
>
> I suppose we could keep a queue of _all_ pages we ever put in an SKB
> which we poll.

Right, that seems like the only way to make sure we don't lose anything.

>  We could still check for pages with count==1 in the
> destructor. Apart from the other issues with the destructor not being
> copied over clone etc which would cause us to fall-back to polling the
> queue more often than not I reckon.

Yeah, sounds like it.

>> That said, I think an event-based rather than polling based mechanism
>> would be much more preferable.
> Absolutely.

>From what Eric and David were saying, it seems we don't really have any
other workable options.

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 20:11         ` Jeremy Fitzhardinge
  2011-06-24 20:27           ` David Miller
@ 2011-06-25 11:58           ` Ian Campbell
  2011-06-27 20:51             ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2011-06-25 11:58 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: David Miller, eric.dumazet, netdev, xen-devel, rusty

On Fri, 2011-06-24 at 13:11 -0700, Jeremy Fitzhardinge wrote:
> On 06/24/2011 12:46 PM, David Miller wrote:
> > Pages get transferred between different SKBs all the time.
> >
> > For example, GRO makes extensive use of this technique.
> > See net/core/skbuff.c:skb_gro_receive().
> >
> > It is just one example.
> 
> I see, and the new skb doesn't get a destructor copied from the
> original, so there'd be no second callback.

What about if we were to have a per-shinfo destructor (called once for
each page as its refcount goes 1->0, from whichever skb ends up with the
last ref) as well as the skb-destructors. This already handles the
cloning case but when pages are moved between shinfo then would it make
sense for that to be propagated between skb's under these circumstances
and/or require them to be the same? Since in the case of something like
skb_gro_receive the skbs (and hence the frag array pages) are all from
the same 'owner' (even if the skb is actually created by the stack on
their behalf) I suspect this could work?

But I bet this assumption isn't valid in all cases.

In which case I end up wondering about a destructor per page in the frag
array. At which point we might as well consider it as a part of the core
mm stuff rather than something net specific?

Ian.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 15:43 ` Ian Campbell
  (?)
  (?)
@ 2011-06-26 10:25 ` Michael S. Tsirkin
  2011-06-27  9:41   ` Ian Campbell
  -1 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2011-06-26 10:25 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, xen-devel, Jeremy Fitzhardinge, Rusty Russell, mashirle

On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote:
> In this mode guest data pages ("foreign pages") were mapped into the
> backend domain (using Xen grant-table functionality) and placed into the
> skb's paged frag list (skb_shinfo(skb)->frags, I hope I am using the
> right term). Once the page is finished with netback unmaps it in order
> to return it to the guest (we really want to avoid returning such pages
> to the general allocation pool!).

Are the pages writeable by the source guest while netback processes
them?  If yes, firewalling becomes unreliable as the packet can be
modified after it's checked, right?
Also, for guest to guest communication, do you wait for
the destination to stop looking at the packet in order
to return it to the source? If yes, can source guest
networking be disrupted by a slow destination?


> Jeremy Fitzhardinge and I subsequently
> looked at the possibility of a no-clone skb flag (i.e. always forcing a
> copy instead of a clone)

I think this is the approach that the patchset
'macvtap/vhost TX zero-copy support' takes.

> but IIRC honouring it universally turned into a
> very twisty maze with a number of nasty corner cases etc.

Any examples? Are they covered by the patchset above?

> FWIW I proposed a session on the subject for LPC this year.
We also plan to discuss this on kvm forum 2011
(colocated with linuxcon 2011).
http://www.linux-kvm.org/page/KVM_Forum_2011

-- 
MST

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-26 10:25 ` Michael S. Tsirkin
@ 2011-06-27  9:41   ` Ian Campbell
  2011-06-27 10:21     ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2011-06-27  9:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jeremy Fitzhardinge, xen-devel, netdev, mashirle, Russell, Rusty

On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote:
> On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote:
> > In this mode guest data pages ("foreign pages") were mapped into the
> > backend domain (using Xen grant-table functionality) and placed into the
> > skb's paged frag list (skb_shinfo(skb)->frags, I hope I am using the
> > right term). Once the page is finished with netback unmaps it in order
> > to return it to the guest (we really want to avoid returning such pages
> > to the general allocation pool!).
> 
> Are the pages writeable by the source guest while netback processes
> them?  If yes, firewalling becomes unreliable as the packet can be
> modified after it's checked, right?

We only map the paged frags, the linear area is always copied (enough to
cover maximally sized TCP/IP, including options), for this reason.

> Also, for guest to guest communication, do you wait for
> the destination to stop looking at the packet in order
> to return it to the source? If yes, can source guest
> networking be disrupted by a slow destination?

There is a timeout which ultimately does a copy into dom0 memory and
frees up the domain grant for return to the sending guest.

> > Jeremy Fitzhardinge and I subsequently
> > looked at the possibility of a no-clone skb flag (i.e. always forcing a
> > copy instead of a clone)
> 
> I think this is the approach that the patchset
> 'macvtap/vhost TX zero-copy support' takes.

That's TX from the guests PoV, the same as I am looking at here,
correct?

I should definitely check this work out, thanks for the pointer. Is V7
(http://marc.info/?l=linux-kernel&m=130661128431312&w=2) the most recent
posting?

I suppose one difference with this is that it deals with data from
"dom0" userspace buffers rather than (what looks like) kernel memory,
although I don't know if that matters yet. Also it hangs off of struct
sock which netback doesn't have. Anyway I'll check it out.

> > but IIRC honouring it universally turned into a
> > very twisty maze with a number of nasty corner cases etc.
> 
> Any examples? Are they covered by the patchset above?

It was quite a while ago so I don't remember many of the specifics.
Jeremy might remember better but for example any broadcast traffic
hitting a bridge (a very interesting case for Xen), seems like a likely
case? pcap was another one which I do remember, but that's obviously
less critical.

I presume with the TX zero-copy support the "copying due to attempted
clone" rate is low?

> > FWIW I proposed a session on the subject for LPC this year.
> We also plan to discuss this on kvm forum 2011
> (colocated with linuxcon 2011).
> http://www.linux-kvm.org/page/KVM_Forum_2011

I had already considered coming to LinuxCon for other reasons but
unfortunately I have family commitments around then :-(

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-27  9:41   ` Ian Campbell
@ 2011-06-27 10:21     ` Michael S. Tsirkin
  2011-06-27 10:54       ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2011-06-27 10:21 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, Jeremy Fitzhardinge, xen-devel, mashirle, Rusty Russell

On Mon, Jun 27, 2011 at 10:41:35AM +0100, Ian Campbell wrote:
> On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote:
> > On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote:
> > > In this mode guest data pages ("foreign pages") were mapped into the
> > > backend domain (using Xen grant-table functionality) and placed into the
> > > skb's paged frag list (skb_shinfo(skb)->frags, I hope I am using the
> > > right term). Once the page is finished with netback unmaps it in order
> > > to return it to the guest (we really want to avoid returning such pages
> > > to the general allocation pool!).
> > 
> > Are the pages writeable by the source guest while netback processes
> > them?  If yes, firewalling becomes unreliable as the packet can be
> > modified after it's checked, right?
> 
> We only map the paged frags, the linear area is always copied (enough to
> cover maximally sized TCP/IP, including options), for this reason.

Hmm. That'll cover the most common scenarios
(such as port filtering) but not deep inspection.
Not sure how important that is.

> > Also, for guest to guest communication, do you wait for
> > the destination to stop looking at the packet in order
> > to return it to the source? If yes, can source guest
> > networking be disrupted by a slow destination?
> 
> There is a timeout which ultimately does a copy into dom0 memory and
> frees up the domain grant for return to the sending guest.

Interesting. How long's the timeout?

> > > Jeremy Fitzhardinge and I subsequently
> > > looked at the possibility of a no-clone skb flag (i.e. always forcing a
> > > copy instead of a clone)
> > 
> > I think this is the approach that the patchset
> > 'macvtap/vhost TX zero-copy support' takes.
> 
> That's TX from the guests PoV, the same as I am looking at here,
> correct?

Right.

> I should definitely check this work out, thanks for the pointer. Is V7
> (http://marc.info/?l=linux-kernel&m=130661128431312&w=2) the most recent
> posting?

I think so.

> I suppose one difference with this is that it deals with data from
> "dom0" userspace buffers rather than (what looks like) kernel memory,
> although I don't know if that matters yet. Also it hangs off of struct
> sock which netback doesn't have. Anyway I'll check it out.

I think the most important detail is the copy on clone approach.
We can make it controlled by an skb flag if necessary.

> > > but IIRC honouring it universally turned into a
> > > very twisty maze with a number of nasty corner cases etc.
> > 
> > Any examples? Are they covered by the patchset above?
> 
> It was quite a while ago so I don't remember many of the specifics.
> Jeremy might remember better but for example any broadcast traffic
> hitting a bridge (a very interesting case for Xen), seems like a likely
> case? pcap was another one which I do remember, but that's obviously
> less critical.

Last I looked I thought these clone the skb, so if a copy happens on
clone things will work correctly?

> I presume with the TX zero-copy support the "copying due to attempted
> clone" rate is low?

Yes. My understanding is that this version targets a non-bridged setup
(guest connected to a macvlan on a physical dev) as the first step.

> > > FWIW I proposed a session on the subject for LPC this year.
> > We also plan to discuss this on kvm forum 2011
> > (colocated with linuxcon 2011).
> > http://www.linux-kvm.org/page/KVM_Forum_2011
> 
> I had already considered coming to LinuxCon for other reasons but
> unfortunately I have family commitments around then :-(
> 
> Ian.

And I'm not coming to LPC this year :(

-- 
MST

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-27 10:21     ` Michael S. Tsirkin
@ 2011-06-27 10:54       ` Ian Campbell
  2011-06-27 11:19         ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2011-06-27 10:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jeremy Fitzhardinge, xen-devel, netdev, mashirle, Russell, Rusty

On Mon, 2011-06-27 at 11:21 +0100, Michael S. Tsirkin wrote:
> On Mon, Jun 27, 2011 at 10:41:35AM +0100, Ian Campbell wrote:
> > On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote:
> > > On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote:
> > > > In this mode guest data pages ("foreign pages") were mapped into the
> > > > backend domain (using Xen grant-table functionality) and placed into the
> > > > skb's paged frag list (skb_shinfo(skb)->frags, I hope I am using the
> > > > right term). Once the page is finished with netback unmaps it in order
> > > > to return it to the guest (we really want to avoid returning such pages
> > > > to the general allocation pool!).
> > > 
> > > Are the pages writeable by the source guest while netback processes
> > > them?  If yes, firewalling becomes unreliable as the packet can be
> > > modified after it's checked, right?
> > 
> > We only map the paged frags, the linear area is always copied (enough to
> > cover maximally sized TCP/IP, including options), for this reason.
> 
> Hmm. That'll cover the most common scenarios
> (such as port filtering) but not deep inspection.

Right.

> Not sure how important that is.
> 
> > > Also, for guest to guest communication, do you wait for
> > > the destination to stop looking at the packet in order
> > > to return it to the source? If yes, can source guest
> > > networking be disrupted by a slow destination?
> > 
> > There is a timeout which ultimately does a copy into dom0 memory and
> > frees up the domain grant for return to the sending guest.
> 
> Interesting. How long's the timeout?

1 second IIRC.

> > I suppose one difference with this is that it deals with data from
> > "dom0" userspace buffers rather than (what looks like) kernel memory,
> > although I don't know if that matters yet. Also it hangs off of struct
> > sock which netback doesn't have. Anyway I'll check it out.
> 
> I think the most important detail is the copy on clone approach.
> We can make it controlled by an skb flag if necessary.
> 
> > > > but IIRC honouring it universally turned into a
> > > > very twisty maze with a number of nasty corner cases etc.
> > > 
> > > Any examples? Are they covered by the patchset above?
> > 
> > It was quite a while ago so I don't remember many of the specifics.
> > Jeremy might remember better but for example any broadcast traffic
> > hitting a bridge (a very interesting case for Xen), seems like a likely
> > case? pcap was another one which I do remember, but that's obviously
> > less critical.
> 
> Last I looked I thought these clone the skb, so if a copy happens on
> clone things will work correctly?

Things should be correct, but won't necessarily perform well.

In particular if the clones (which become copies with this flag) are
frequent enough then there is no advantage to doing mapping instead of
just copying upfront, in fact it probably hurts overall.

Taking a quick look at the callers of skb_clone I also see skb_segment
in there. Since Xen tries to pass around large skbs (using LRO/GSO over
the PV interface) in order to amortise costs it is quite common for
things to undergo GSO as they hit the physical device. I'm not sure if
these commonly hit the specific code path which causes a clone though.

> > I presume with the TX zero-copy support the "copying due to attempted
> > clone" rate is low?
> 
> Yes. My understanding is that this version targets a non-bridged setup
> (guest connected to a macvlan on a physical dev) as the first step.

OK.

> > > > FWIW I proposed a session on the subject for LPC this year.
> > > We also plan to discuss this on kvm forum 2011
> > > (colocated with linuxcon 2011).
> > > http://www.linux-kvm.org/page/KVM_Forum_2011
> > 
> > I had already considered coming to LinuxCon for other reasons but
> > unfortunately I have family commitments around then :-(

> And I'm not coming to LPC this year :(

That's a shame.

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-27 10:54       ` Ian Campbell
@ 2011-06-27 11:19         ` Michael S. Tsirkin
  0 siblings, 0 replies; 21+ messages in thread
From: Michael S. Tsirkin @ 2011-06-27 11:19 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, Jeremy Fitzhardinge, xen-devel, mashirle, Rusty Russell

On Mon, Jun 27, 2011 at 11:54:02AM +0100, Ian Campbell wrote:
> On Mon, 2011-06-27 at 11:21 +0100, Michael S. Tsirkin wrote:
> > On Mon, Jun 27, 2011 at 10:41:35AM +0100, Ian Campbell wrote:
> > > On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote:
> > > > > In this mode guest data pages ("foreign pages") were mapped into the
> > > > > backend domain (using Xen grant-table functionality) and placed into the
> > > > > skb's paged frag list (skb_shinfo(skb)->frags, I hope I am using the
> > > > > right term). Once the page is finished with netback unmaps it in order
> > > > > to return it to the guest (we really want to avoid returning such pages
> > > > > to the general allocation pool!).
> > > > 
> > > > Are the pages writeable by the source guest while netback processes
> > > > them?  If yes, firewalling becomes unreliable as the packet can be
> > > > modified after it's checked, right?
> > > 
> > > We only map the paged frags, the linear area is always copied (enough to
> > > cover maximally sized TCP/IP, including options), for this reason.
> > 
> > Hmm. That'll cover the most common scenarios
> > (such as port filtering) but not deep inspection.
> 
> Right.
> 
> > Not sure how important that is.
> > 
> > > > Also, for guest to guest communication, do you wait for
> > > > the destination to stop looking at the packet in order
> > > > to return it to the source? If yes, can source guest
> > > > networking be disrupted by a slow destination?
> > > 
> > > There is a timeout which ultimately does a copy into dom0 memory and
> > > frees up the domain grant for return to the sending guest.
> > 
> > Interesting. How long's the timeout?
> 
> 1 second IIRC.

I think that's unlikely to prevent networking disruption, only
complete loss of networking.

> > > I suppose one difference with this is that it deals with data from
> > > "dom0" userspace buffers rather than (what looks like) kernel memory,
> > > although I don't know if that matters yet. Also it hangs off of struct
> > > sock which netback doesn't have. Anyway I'll check it out.
> > 
> > I think the most important detail is the copy on clone approach.
> > We can make it controlled by an skb flag if necessary.
> > 
> > > > > but IIRC honouring it universally turned into a
> > > > > very twisty maze with a number of nasty corner cases etc.
> > > > 
> > > > Any examples? Are they covered by the patchset above?
> > > 
> > > It was quite a while ago so I don't remember many of the specifics.
> > > Jeremy might remember better but for example any broadcast traffic
> > > hitting a bridge (a very interesting case for Xen), seems like a likely
> > > case? pcap was another one which I do remember, but that's obviously
> > > less critical.
> > 
> > Last I looked I thought these clone the skb, so if a copy happens on
> > clone things will work correctly?
> 
> Things should be correct, but won't necessarily perform well.
> 
> In particular if the clones (which become copies with this flag) are
> frequent enough then there is no advantage to doing mapping instead of
> just copying upfront, in fact it probably hurts overall.

True. Further, the CPU used up by the copy isn't accounted for in the
appropriate cgroup.

> Taking a quick look at the callers of skb_clone I also see skb_segment
> in there. Since Xen tries to pass around large skbs (using LRO/GSO over
> the PV interface) in order to amortise costs it is quite common for
> things to undergo GSO as they hit the physical device. I'm not sure if
> these commonly hit the specific code path which causes a clone though.

Probably not, I think this patchset was tested with GSO as well.

> > > I presume with the TX zero-copy support the "copying due to attempted
> > > clone" rate is low?
> > 
> > Yes. My understanding is that this version targets a non-bridged setup
> > (guest connected to a macvlan on a physical dev) as the first step.
> 
> OK.
> 
> > > > > FWIW I proposed a session on the subject for LPC this year.
> > > > We also plan to discuss this on kvm forum 2011
> > > > (colocated with linuxcon 2011).
> > > > http://www.linux-kvm.org/page/KVM_Forum_2011
> > > 
> > > I had already considered coming to LinuxCon for other reasons but
> > > unfortunately I have family commitments around then :-(
> 
> > And I'm not coming to LPC this year :(
> 
> That's a shame.
> 
> Ian.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-24 17:56   ` Eric Dumazet
  2011-06-24 18:21     ` Jeremy Fitzhardinge
@ 2011-06-27 14:42     ` Ian Campbell
  2011-06-27 22:49       ` David Miller
  1 sibling, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2011-06-27 14:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jeremy Fitzhardinge, netdev, xen-devel, Rusty Russell

On Fri, 2011-06-24 at 18:56 +0100, Eric Dumazet wrote:
> Le vendredi 24 juin 2011 à 10:29 -0700, Jeremy Fitzhardinge a écrit :
> > On 06/24/2011 08:43 AM, Ian Campbell wrote:
> > > We've previously looked into solutions using the skb destructor callback
> > > but that falls over if the skb is cloned since you also need to know
> > > when the clone is destroyed. Jeremy Fitzhardinge and I subsequently
> > > looked at the possibility of a no-clone skb flag (i.e. always forcing a
> > > copy instead of a clone) but IIRC honouring it universally turned into a
> > > very twisty maze with a number of nasty corner cases etc. It also seemed
> > > that the proportion of SKBs which get cloned at least once appeared as
> > > if it could be quite high which would presumably make the performance
> > > impact unacceptable when using the flag. Another issue with using the
> > > skb destructor is that functions such as __pskb_pull_tail will eat (and
> > > free) pages from the start of the frag array such that by the time the
> > > skb destructor is called they are no longer there.
> > >
> > > AIUI Rusty Russell had previously looked into a per-page destructor in
> > > the shinfo but found that it couldn't be made to work (I don't remember
> > > why, or if I even knew at the time). Could that be an approach worth
> > > reinvestigating?
> > >
> > > I can't really think of any other solution which doesn't involve some
> > > sort of driver callback at the time a page is free()d.
> > 
> 
> This reminds me the packet mmap (tx path) games we play with pages.
> 
> net/packet/af_packet.c : tpacket_destruct_skb(), poking
> TP_STATUS_AVAILABLE back to user to tell him he can reuse space...

This is OK because af_packet involves no kernel side protocol and hence
there can be no retransmits etc? Otherwise you would suffer from the
same sorts of issues as I described with NFS at [0]?

However it seems like this might still have a problem if your SKBs are
ever cloned. What happens in this case, e.g if a user of AF_PACKET sends
a broadcast via a device associated with a bridge[1] (where it would be
flooded)?

Wouldn't you get into the situation where the destructor of the initial
skb is called before one or more of the clones going to a different
destination were sent. So you would send TP_STATUS_AVAILABLE to
userspace before the stack was really finished with the page and run the
risk of userspace reusing the buffer, leading to incorrect bytes going
to some destinations?

It looks to me that anything which does any zero-copy type thing to the
network stack will have problems with one or both of protocol retransmit
or SKB clone. There's simply no mechanism to know when the stack is
really finished with a page.

Ian.

[0] http://marc.info/?l=linux-nfs&m=122424132729720&w=2
[1] Since dhcp clients typically use AF_PACKET and you typically put the
IP address on the bridge itself in these configurations this won't be
that unusual.

Ian.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-25 11:58           ` Ian Campbell
@ 2011-06-27 20:51             ` Jeremy Fitzhardinge
  2011-06-28 10:25               ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-06-27 20:51 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, rusty, xen-devel, David Miller, eric.dumazet

On 06/25/2011 12:58 PM, Ian Campbell wrote:
> On Fri, 2011-06-24 at 13:11 -0700, Jeremy Fitzhardinge wrote:
>> On 06/24/2011 12:46 PM, David Miller wrote:
>>> Pages get transferred between different SKBs all the time.
>>>
>>> For example, GRO makes extensive use of this technique.
>>> See net/core/skbuff.c:skb_gro_receive().
>>>
>>> It is just one example.
>> I see, and the new skb doesn't get a destructor copied from the
>> original, so there'd be no second callback.
> What about if we were to have a per-shinfo destructor (called once for
> each page as its refcount goes 1->0, from whichever skb ends up with the
> last ref) as well as the skb-destructors.

We never want the refcount for granted pages to go from 1 -> 0.  The
safest thing is to make sure we always elevate the refcount to make sure
that nothing else can ever drop the last ref.

If we can trust the network stack to always do the last release (and not
hand it off to something else to do it), then we could have a destructor
which gets called before the last ref drop (or leaves the ref drop to
the destructor to do), and do everything required that way.  But it
seems pretty fragile.  At the very least it would need a thorough code
audit to make sure that everything handles page lifetimes in the
expected way - but then I'd still worry about out-of-tree patches
breaking something in subtle ways.

>  This already handles the
> cloning case but when pages are moved between shinfo then would it make
> sense for that to be propagated between skb's under these circumstances
> and/or require them to be the same? Since in the case of something like
> skb_gro_receive the skbs (and hence the frag array pages) are all from
> the same 'owner' (even if the skb is actually created by the stack on
> their behalf) I suspect this could work?
>
> But I bet this assumption isn't valid in all cases.

Hm.

> In which case I end up wondering about a destructor per page in the frag
> array. At which point we might as well consider it as a part of the core
> mm stuff rather than something net specific?

Doing it generically still needs some kind of marker that the page has a
special-case destructor (and the destructor pointer itself).

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-27 14:42     ` Ian Campbell
@ 2011-06-27 22:49       ` David Miller
  2011-06-28 10:24         ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2011-06-27 22:49 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: netdev, jeremy, xen-devel, eric.dumazet, rusty

From: Ian Campbell <Ian.Campbell@eu.citrix.com>
Date: Mon, 27 Jun 2011 15:42:04 +0100

> However it seems like this might still have a problem if your SKBs are
> ever cloned. What happens in this case, e.g if a user of AF_PACKET sends
> a broadcast via a device associated with a bridge[1] (where it would be
> flooded)?

You don't need a bridge to get a clone on transmit, the packet
scheduler can do clones.  Just grep for skb_clone in the packet
action handlers net/sched/act_*.c

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-27 22:49       ` David Miller
@ 2011-06-28 10:24         ` Ian Campbell
  0 siblings, 0 replies; 21+ messages in thread
From: Ian Campbell @ 2011-06-28 10:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jeremy, xen-devel, eric.dumazet, rusty

On Mon, 2011-06-27 at 23:49 +0100, David Miller wrote:
> From: Ian Campbell <Ian.Campbell@eu.citrix.com>
> Date: Mon, 27 Jun 2011 15:42:04 +0100
> 
> > However it seems like this might still have a problem if your SKBs are
> > ever cloned. What happens in this case, e.g if a user of AF_PACKET sends
> > a broadcast via a device associated with a bridge[1] (where it would be
> > flooded)?
> 
> You don't need a bridge to get a clone on transmit, the packet
> scheduler can do clones.  Just grep for skb_clone in the packet
> action handlers net/sched/act_*.c

Are you sure? I only see skb_cloned() and skb_clone_writeable() under
there )(3.0-rc4) and not any actual skb_clone()s.

The only actual clone I see under there is in net/sched/sch_netem.c.

However it sounds like it is expected that a clone can happen on pretty
any skb which makes the frag lifecycle issue seem like one which could
effect anything which sends a page to the network without relinquishing
complete control of it (common in any kind of zero-copy scenario).

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SKB paged fragment lifecycle on receive
  2011-06-27 20:51             ` Jeremy Fitzhardinge
@ 2011-06-28 10:25               ` Ian Campbell
  0 siblings, 0 replies; 21+ messages in thread
From: Ian Campbell @ 2011-06-28 10:25 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: netdev, rusty, xen-devel, David Miller, eric.dumazet

On Mon, 2011-06-27 at 21:51 +0100, Jeremy Fitzhardinge wrote:
> On 06/25/2011 12:58 PM, Ian Campbell wrote:
> > On Fri, 2011-06-24 at 13:11 -0700, Jeremy Fitzhardinge wrote:
> >> On 06/24/2011 12:46 PM, David Miller wrote:
> >>> Pages get transferred between different SKBs all the time.
> >>>
> >>> For example, GRO makes extensive use of this technique.
> >>> See net/core/skbuff.c:skb_gro_receive().
> >>>
> >>> It is just one example.
> >> I see, and the new skb doesn't get a destructor copied from the
> >> original, so there'd be no second callback.
> > What about if we were to have a per-shinfo destructor (called once for
> > each page as its refcount goes 1->0, from whichever skb ends up with the
> > last ref) as well as the skb-destructors.
> 
> We never want the refcount for granted pages to go from 1 -> 0.  The
> safest thing is to make sure we always elevate the refcount to make sure
> that nothing else can ever drop the last ref.

I guess I meant called just after the refcount goes 2->1 or something.
_But_... thinking about it some more I don't think this scheme works in
the general case because the entity injecting into the network stack may
not be the only reference count holder.

Consider the NFS case -- in that case there is already a reference
because the page belongs to userspace. You certainly don't want to wait
for the process to exit before considering the page done with.

You may also have multiple reference counts due to multiple threads
doing I/O on (overlapping or not) portions of the same page etc.

So now we're into the realms of a shinfo per frag reference count and
destructor, which reintroduces the whole question of what happens if a
page is copied/moved to another shinfo.

Could we add a per-frag pointer to the shinfo pointing to:
	struct {
		atomic_t ref;
		void (destructor*)(struct skb, int fragnr, /*TBD*/);
		void *data; /* user data, maybe */
	};
This data structure is owned by whoever injected the page into the stack
and the pointer to it is propagated everywhere that page goes, including
across skb splits and pages moving or copied between skbs etc etc.

The split between ref counting on this and struct page needs a bit of
careful thought but perhaps it's IFF this struct exists the stack will
hold a single struct page refcount and use this new refcount for
everything else, dropping the struct page refcount when the frag
refcount hits 0. Otherwise it simply uses struct page refcount as
normal.

> If we can trust the network stack to always do the last release (and not
> hand it off to something else to do it), then we could have a destructor
> which gets called before the last ref drop (or leaves the ref drop to
> the destructor to do), and do everything required that way.  But it
> seems pretty fragile.  At the very least it would need a thorough code
> audit to make sure that everything handles page lifetimes in the
> expected way - but then I'd still worry about out-of-tree patches
> breaking something in subtle ways.
> 
> >  This already handles the
> > cloning case but when pages are moved between shinfo then would it make
> > sense for that to be propagated between skb's under these circumstances
> > and/or require them to be the same? Since in the case of something like
> > skb_gro_receive the skbs (and hence the frag array pages) are all from
> > the same 'owner' (even if the skb is actually created by the stack on
> > their behalf) I suspect this could work?
> >
> > But I bet this assumption isn't valid in all cases.
> 
> Hm.

e.g. if you have some sort of tunnel protocol which is encapsulating
multiple skb streams into a single one might you get an skb with pages
from multiple sources?

> > In which case I end up wondering about a destructor per page in the frag
> > array. At which point we might as well consider it as a part of the core
> > mm stuff rather than something net specific?
> 
> Doing it generically still needs some kind of marker that the page has a
> special-case destructor (and the destructor pointer itself).

True. you only really need the destructor pointer (which can be NULL)
but yes, it would add stuff to struct page which may not be useful
elsewhere.

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-06-28 10:25 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-24 15:43 SKB paged fragment lifecycle on receive Ian Campbell
2011-06-24 15:43 ` Ian Campbell
2011-06-24 17:29 ` Jeremy Fitzhardinge
2011-06-24 17:56   ` Eric Dumazet
2011-06-24 18:21     ` Jeremy Fitzhardinge
2011-06-24 19:46       ` David Miller
2011-06-24 20:11         ` Jeremy Fitzhardinge
2011-06-24 20:27           ` David Miller
2011-06-25 11:58           ` Ian Campbell
2011-06-27 20:51             ` Jeremy Fitzhardinge
2011-06-28 10:25               ` Ian Campbell
2011-06-27 14:42     ` Ian Campbell
2011-06-27 22:49       ` David Miller
2011-06-28 10:24         ` Ian Campbell
2011-06-24 22:44   ` Ian Campbell
2011-06-24 22:48     ` Jeremy Fitzhardinge
2011-06-26 10:25 ` Michael S. Tsirkin
2011-06-27  9:41   ` Ian Campbell
2011-06-27 10:21     ` Michael S. Tsirkin
2011-06-27 10:54       ` Ian Campbell
2011-06-27 11:19         ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.