linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
@ 2016-07-07  6:25 Roland Dreier
  2016-07-07 18:13 ` Alexander Duyck
  0 siblings, 1 reply; 8+ messages in thread
From: Roland Dreier @ 2016-07-07  6:25 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-rdma
  Cc: Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Resurrecting this old thread, because the patch that ultimately went
upstream (commit 9207f9d45b0a / net: preserve IP control block during
GSO segmentation) causes a huge IPoIB performance regression (to the
point of being unusable):
https://bugzilla.kernel.org/show_bug.cgi?id=111921

I don't think anyone has explained what goes wrong or why IPoIB works
the way it does.  The underlying difference that IPoIB has from other
drivers is that there are two levels of address resolution.  First,
normal ARP / ND resolves an IP address to a "hardware" address.  The
difference is that in IPoIB, the hardware address is an IB GID (plus a
QPN, but we can ignore that).  To actually send data to that GID, the
IPoIB driver has to do a second lookup - it needs to ask the IB subnet
manager for a path record that tells it how to reach that GID.

In particular this means that "destination address" (as the IP / ARP
layer understands it) actually isn't in the packet anywhere - there's
nothing like an ethernet header as there is for "normal" network
drivers.  Instead, the driver stashes the address in skb->cb during
hard_header_ops->create() and then looks at it in the xmit routine -
this was designed way back around when commit a0417fa3a18a / net: Make
qdisc_skb_cb upper size bound explicit. was merged.  The expectation
was that the part of the cb after sizeof (struct qdisc_skb_cb) would
be preserved.

The problem with commit 9207f9d45b0a is that GSO operations now access
cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of
where IPoIB stashes its hwaddr.

It seems that the intent of the commit is to preserve the IP control
block - struct inet_skb_parm (and presumably struct inet6_skb_parm) -
even when using SKB_GSO_CB().  Seems like both inet_skb_parm and
inet6_skb_parm are 20 bytes.  IPoIB uses the part of cb after 28
bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and
set SKB_SGO_CB_OFFSET to 20, then everything would work.  The struct
is

struct skb_gso_cb {
        int     mac_offset;
        int     encap_level;
        __u16   csum_start;
};

is it feasible to make encap_level a __u16 (which would make the
overall struct exactly 8 bytes)?  If I understand this correctly, 64K
nested encapsulations seems like quite a bit for a packet...

Or, earlier in this thread, having the GSO in ip_output and other gso
paths save and restore the IP/IP6 control block was suggested as an
alternate approach.  I don't know if there are performance
implications to that.

What is the best way to keep the crash fix but not kill IPoIB performance?

Thanks!
 - R.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-07-07  6:25 Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation Roland Dreier
@ 2016-07-07 18:13 ` Alexander Duyck
  2016-07-07 22:01   ` Roland Dreier
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Duyck @ 2016-07-07 18:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Konstantin Khlebnikov, linux-rdma, Florian Westphal,
	Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Wed, Jul 6, 2016 at 11:25 PM, Roland Dreier <roland@purestorage.com> wrote:
> On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> Or just shift GSO CB and add couple checks like
>> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));
>
> Resurrecting this old thread, because the patch that ultimately went
> upstream (commit 9207f9d45b0a / net: preserve IP control block during
> GSO segmentation) causes a huge IPoIB performance regression (to the
> point of being unusable):
> https://bugzilla.kernel.org/show_bug.cgi?id=111921
>
> I don't think anyone has explained what goes wrong or why IPoIB works
> the way it does.  The underlying difference that IPoIB has from other
> drivers is that there are two levels of address resolution.  First,
> normal ARP / ND resolves an IP address to a "hardware" address.  The
> difference is that in IPoIB, the hardware address is an IB GID (plus a
> QPN, but we can ignore that).  To actually send data to that GID, the
> IPoIB driver has to do a second lookup - it needs to ask the IB subnet
> manager for a path record that tells it how to reach that GID.
>
> In particular this means that "destination address" (as the IP / ARP
> layer understands it) actually isn't in the packet anywhere - there's
> nothing like an ethernet header as there is for "normal" network
> drivers.  Instead, the driver stashes the address in skb->cb during
> hard_header_ops->create() and then looks at it in the xmit routine -
> this was designed way back around when commit a0417fa3a18a / net: Make
> qdisc_skb_cb upper size bound explicit. was merged.  The expectation
> was that the part of the cb after sizeof (struct qdisc_skb_cb) would
> be preserved.
>
> The problem with commit 9207f9d45b0a is that GSO operations now access
> cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of
> where IPoIB stashes its hwaddr.

Really I don't know if the problem is GSO so much as the desire to
maintain the control buffer between layers.  Really this kind of
violates how we have documented the control buffer in skbuff.h.

> It seems that the intent of the commit is to preserve the IP control
> block - struct inet_skb_parm (and presumably struct inet6_skb_parm) -
> even when using SKB_GSO_CB().  Seems like both inet_skb_parm and
> inet6_skb_parm are 20 bytes.  IPoIB uses the part of cb after 28
> bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and
> set SKB_SGO_CB_OFFSET to 20, then everything would work.  The struct
> is
>
> struct skb_gso_cb {
>         int     mac_offset;
>         int     encap_level;
>         __u16   csum_start;
> };

This is based on an out-dated version of this struct.  The 4.7 RC
kernel has a few more fields that were added to support local checksum
offload for encapsulated frames.

> is it feasible to make encap_level a __u16 (which would make the
> overall struct exactly 8 bytes)?  If I understand this correctly, 64K
> nested encapsulations seems like quite a bit for a packet...

The smallest we could probably get this structure at this point would
be around 10 bytes assuming we could make the offsets and encap level
only be a 16 bit field.  The added size is due to a wsum that was
added to keep a running checksum so we could keep csum_offset and
csum_start fields while generating a running checksum for the frame in
GSO.

> Or, earlier in this thread, having the GSO in ip_output and other gso
> paths save and restore the IP/IP6 control block was suggested as an
> alternate approach.  I don't know if there are performance
> implications to that.

Copying a block of data back and forth will always come at a cost.  In
addition, I don't see why we need to be moving the IP/IP6 control
block around.  If the IPoIB hwaddr is the only concern why couldn't it
be saved/restored instead?

> What is the best way to keep the crash fix but not kill IPoIB performance?

It seems like what would probably need to happen is to move where the
IPoIB address is stored.  I'm not sure the control buffer is really
the best place for it since the cb gets overwritten at various levels,
and storing 20 bytes makes it hard to avoid bumping up against the
size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
generated around the same time we generate the L2 header for the
frame, I wonder if you couldn't get away with using a bit of extra skb
headroom to store it and then use a offset from the MAC header to
access it.  An added bonus would be that with a few tricks with
SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
that you copy the hwaddr when you copy the header for each fragment
instead of having to go and copy the hwaddr out of the cb and clone it
for each frame.

Anyway I don't have that strong an understanding on IPoIB, but thought
I would add my $.02 since I noticed you were referencing an outdated
skb_gso_cb structure.

- Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-07-07 18:13 ` Alexander Duyck
@ 2016-07-07 22:01   ` Roland Dreier
  2016-07-07 22:57     ` Alexander Duyck
  2016-07-07 23:14     ` Jason Gunthorpe
  0 siblings, 2 replies; 8+ messages in thread
From: Roland Dreier @ 2016-07-07 22:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Konstantin Khlebnikov, linux-rdma, Florian Westphal,
	Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

>> struct skb_gso_cb {
>>         int     mac_offset;
>>         int     encap_level;
>>         __u16   csum_start;
>> };

> This is based on an out-dated version of this struct.  The 4.7 RC
> kernel has a few more fields that were added to support local checksum
> offload for encapsulated frames.

Thanks for pointing that out.  I hit the perf regression on 4.4.y
(stable) and looked at the struct there.  I see that latest upstream
has changed, and I agree that this struct really can't shrink below 10
bytes.

Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes,
we're 2 bytes over the 48 that are available in cb[].  So this is
harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET
unfortunately.

>> What is the best way to keep the crash fix but not kill IPoIB performance?
>
> It seems like what would probably need to happen is to move where the
> IPoIB address is stored.  I'm not sure the control buffer is really
> the best place for it since the cb gets overwritten at various levels,
> and storing 20 bytes makes it hard to avoid bumping up against the
> size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
> generated around the same time we generate the L2 header for the
> frame, I wonder if you couldn't get away with using a bit of extra skb
> headroom to store it and then use a offset from the MAC header to
> access it.  An added bonus would be that with a few tricks with
> SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
> that you copy the hwaddr when you copy the header for each fragment
> instead of having to go and copy the hwaddr out of the cb and clone it
> for each frame.

Can we assume there are 20 bytes of skb headroom available?  What if
we're forwarding an skb received on an Ethernet device?

The reason we moved to the cb storage is that in the past, trying to
hide some data in the actual skb buffer that we don't actually send
led to some awkward-at-best code.  (As I recall GRO was difficult to
handle before commit 936d7de3d736 "IPoIB: Stop lying about
hard_header_len and use skb->cb to stash LL addresses")  But maybe
there's a third way to handle this other than the old way and the
skb->cb way.

 - R.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-07-07 22:01   ` Roland Dreier
@ 2016-07-07 22:57     ` Alexander Duyck
  2016-07-07 23:14     ` Jason Gunthorpe
  1 sibling, 0 replies; 8+ messages in thread
From: Alexander Duyck @ 2016-07-07 22:57 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Konstantin Khlebnikov, linux-rdma, Florian Westphal,
	Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Thu, Jul 7, 2016 at 3:01 PM, Roland Dreier <roland@purestorage.com> wrote:
>>> struct skb_gso_cb {
>>>         int     mac_offset;
>>>         int     encap_level;
>>>         __u16   csum_start;
>>> };
>
>> This is based on an out-dated version of this struct.  The 4.7 RC
>> kernel has a few more fields that were added to support local checksum
>> offload for encapsulated frames.
>
> Thanks for pointing that out.  I hit the perf regression on 4.4.y
> (stable) and looked at the struct there.  I see that latest upstream
> has changed, and I agree that this struct really can't shrink below 10
> bytes.
>
> Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes,
> we're 2 bytes over the 48 that are available in cb[].  So this is
> harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET
> unfortunately.
>
>>> What is the best way to keep the crash fix but not kill IPoIB performance?
>>
>> It seems like what would probably need to happen is to move where the
>> IPoIB address is stored.  I'm not sure the control buffer is really
>> the best place for it since the cb gets overwritten at various levels,
>> and storing 20 bytes makes it hard to avoid bumping up against the
>> size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
>> generated around the same time we generate the L2 header for the
>> frame, I wonder if you couldn't get away with using a bit of extra skb
>> headroom to store it and then use a offset from the MAC header to
>> access it.  An added bonus would be that with a few tricks with
>> SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
>> that you copy the hwaddr when you copy the header for each fragment
>> instead of having to go and copy the hwaddr out of the cb and clone it
>> for each frame.
>
> Can we assume there are 20 bytes of skb headroom available?  What if
> we're forwarding an skb received on an Ethernet device?

The space should be there since a standard Ethernet device should be
reserving about 64B for headroom for Rx traffic assuming it is using
something like napi_alloc_skb.  I'm not sure how much you would need
for Infiniband headers and such, but I know the 20B for the address
would at least be there.

> The reason we moved to the cb storage is that in the past, trying to
> hide some data in the actual skb buffer that we don't actually send
> led to some awkward-at-best code.  (As I recall GRO was difficult to
> handle before commit 936d7de3d736 "IPoIB: Stop lying about
> hard_header_len and use skb->cb to stash LL addresses")  But maybe
> there's a third way to handle this other than the old way and the
> skb->cb way.

Well the description for that patch seems to indicate the problem was
the pseudo header length being included in the hard_header_len.  It
seems like other drivers include the length of such headers in
needed_headroom, although usually those types of headers don't get
added until after the devices ndo_start_xmit is called so I am not
sure if there would be any issues with trying to use needed_headroom
to indicate space needed for a pseudo header added in the header
creation call.

- Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-07-07 22:01   ` Roland Dreier
  2016-07-07 22:57     ` Alexander Duyck
@ 2016-07-07 23:14     ` Jason Gunthorpe
  2016-07-08 14:18       ` Roland Dreier
  1 sibling, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2016-07-07 23:14 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Alexander Duyck, Konstantin Khlebnikov, linux-rdma,
	Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Thu, Jul 07, 2016 at 03:01:40PM -0700, Roland Dreier wrote:

> The reason we moved to the cb storage is that in the past, trying to
> hide some data in the actual skb buffer that we don't actually send

We have neighbour_priv, and ndo_neigh_construct/destruct now ..

A first blush that would seem to be enough to let ipoib store the AH
and other path information in the neigh and avoid the cb? At least the
example in clip sure looks like what ipoib needs to do.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-07-07 23:14     ` Jason Gunthorpe
@ 2016-07-08 14:18       ` Roland Dreier
  2016-07-08 16:51         ` Jason Gunthorpe
  0 siblings, 1 reply; 8+ messages in thread
From: Roland Dreier @ 2016-07-08 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Duyck, Konstantin Khlebnikov, linux-rdma,
	Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> We have neighbour_priv, and ndo_neigh_construct/destruct now ..
>
> A first blush that would seem to be enough to let ipoib store the AH
> and other path information in the neigh and avoid the cb? At least the
> example in clip sure looks like what ipoib needs to do.

Do you think those new facilities let us go back to using the neigh
and still avoid the issues that led to commit b63b70d87741 ("IPoIB:
Use a private hash table for path lookup in xmit path")?

 - R.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-07-08 14:18       ` Roland Dreier
@ 2016-07-08 16:51         ` Jason Gunthorpe
  2016-07-08 21:17           ` Roland Dreier
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2016-07-08 16:51 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Alexander Duyck, Konstantin Khlebnikov, linux-rdma,
	Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Fri, Jul 08, 2016 at 07:18:11AM -0700, Roland Dreier wrote:
> On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > We have neighbour_priv, and ndo_neigh_construct/destruct now ..
> >
> > A first blush that would seem to be enough to let ipoib store the AH
> > and other path information in the neigh and avoid the cb? At least the
> > example in clip sure looks like what ipoib needs to do.
> 
> Do you think those new facilities let us go back to using the neigh
> and still avoid the issues that led to commit b63b70d87741 ("IPoIB:
> Use a private hash table for path lookup in xmit path")?

Well, the priv stuff were brought up in the discussion around
b63b70d87741 but never fully analyzed. Maybe it could have been used
to solve that problem, who knows.. I guess it doesn't help this exact
issue because we don't have a dst at hard header time anyhow.

But, DaveM suggested how to handle our current problem in the above thread:

http://marc.info/?l=linux-rdma&m=132813323907877&w=2

Which is the same route CLIP took:

331         struct dst_entry *dst = skb_dst(skb);
347         rt = (struct rtable *) dst;
348         if (rt->rt_gateway)
349                 daddr = &rt->rt_gateway;
350         else
351                 daddr = &ip_hdr(skb)->daddr;
352         n = dst_neigh_lookup(dst, daddr);

(DaveM said it should be &ip/ipv6_hdr(skb)->daddr, not the rtable cast)

Last time this was brought up you were concerned about ARP, ARP
sets skb_dst after calling dev_hard_header:

310         skb = arp_create(type, ptype, dest_ip, dev, src_ip,
311                          dest_hw, src_hw, target_hw);
312         if (!skb)
313                 return;
314
315         skb_dst_set(skb, dst_clone(dst));

However, there is at least one fringe case (arp_send) where the dst is
left NULL. Presumably there are other fringe cases too..

So, it appears, the dst and neigh can be used for all performances cases.

For the non performance dst == null case, can we just burn cycles and
stuff the daddr in front of the packet at hardheader time, even if we
have to copy?

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-07-08 16:51         ` Jason Gunthorpe
@ 2016-07-08 21:17           ` Roland Dreier
  0 siblings, 0 replies; 8+ messages in thread
From: Roland Dreier @ 2016-07-08 21:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Duyck, Konstantin Khlebnikov, linux-rdma,
	Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Fri, Jul 8, 2016 at 9:51 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> So, it appears, the dst and neigh can be used for all performances cases.
>
> For the non performance dst == null case, can we just burn cycles and
> stuff the daddr in front of the packet at hardheader time, even if we
> have to copy?

OK, sounds interesting.

Unfortunately the scope of this work has gotten to the point where I
can't take it on right now.  My system is running 4.4.y for now
(before struct skb_gso_cb grew) so I think shrinking struct skb_gso_cb
to 8 bytes plus changing SKB_SGO_CB_OFFSET to 20 will work for now.
Hope someone is able to come up with a real fix before I need to
upgrade to 4.10.y...

 - R.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-08 21:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-07  6:25 Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation Roland Dreier
2016-07-07 18:13 ` Alexander Duyck
2016-07-07 22:01   ` Roland Dreier
2016-07-07 22:57     ` Alexander Duyck
2016-07-07 23:14     ` Jason Gunthorpe
2016-07-08 14:18       ` Roland Dreier
2016-07-08 16:51         ` Jason Gunthorpe
2016-07-08 21:17           ` Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).