All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
@ 2013-01-09  5:15 Paul Pearce
  2013-01-09  6:06 ` Ani Sinha
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Pearce @ 2013-01-09  5:15 UTC (permalink / raw)
  To: netdev; +Cc: dborkman, edumazet, Ani Sinha, jpirko

On Tue, 2013-01-08 at 19:51 +0100, Daniel Borkmann wrote:
> VLAN packets that are locally injected through taps will loose their
> skb->vlan_tci value when they pass dev_hard_start_xmit and get looped
> back to a packet sniffer via dev_queue_xmit_nit. Besides others, this
> meta data is used in Linux socket filtering for VLANs. Tested with a
> VLAN ancillary ops filter.
>
> Patch is based on a previous version by Jiri Pirko.

I think there may be issues with the patch beyond Eric's comments. It
seems to trash packet contents.

I applied this patch to Fedora flavored kernel 3.6.11-1.fc16.x86_64.
vlan tagged packets injected via libpcap's pcap_inject() came out
mangled at the packet filters.

The following injected packet:

01:01:01:01:01:01 > 02:02:02:02:02:02, ethertype 802.1Q (0x8100),
length 64: vlan 99, p 0, ethertype ARP, Request who-has 192.168.0.1
tell 192.168.0.1, length 46
0x0000:  0202 0202 0202 0101 0101 0101 8100 0063
0x0010:  0806 0001 0800 0604 0001 0025 6438 8afc
0x0020:  c0a8 0001 0000 0000 0000 c0a8 0001 0000
0x0030:  0000 0000 0000 0000 0000 0000 0000 0000

Arrived as this:

01:01:81:00:00:63 > 02:02:01:01:01:01, ethertype 802.1Q (0x8100),
length 64: vlan 514, p 0, ethertype ARP, Request who-has 192.168.0.1
tell 192.168.0.1, length 46
0x0000:  0202 0101 0101 0101 8100 0063 8100 0202
0x0010:  0806 0001 0800 0604 0001 0025 6438 8afc
0x0020:  c0a8 0001 0000 0000 0000 c0a8 0001 0000
0x0030:  0000 0000 0000 0000 0000 0000 0000 0000

It also might be worth noting the modified libpcap is able to identify
this packet with the filter "vlan" or "vlan 514". Prior to this kernel
patch such a packet could not be identified with any vlan filter.

If this isn't a problem with the patch, perhaps I'm missing a
necessary post-3.6.11 patch?

Thoughts?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09  5:15 [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value Paul Pearce
@ 2013-01-09  6:06 ` Ani Sinha
  2013-01-09  6:27   ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Ani Sinha @ 2013-01-09  6:06 UTC (permalink / raw)
  To: Paul Pearce; +Cc: netdev, dborkman, edumazet, Jiri Pirko

On Tue, Jan 8, 2013 at 9:15 PM, Paul Pearce <pearce@cs.berkeley.edu> wrote:
> On Tue, 2013-01-08 at 19:51 +0100, Daniel Borkmann wrote:
>> VLAN packets that are locally injected through taps will loose their
>> skb->vlan_tci value when they pass dev_hard_start_xmit and get looped
>> back to a packet sniffer via dev_queue_xmit_nit. Besides others, this
>> meta data is used in Linux socket filtering for VLANs. Tested with a
>> VLAN ancillary ops filter.
>>
>> Patch is based on a previous version by Jiri Pirko.
>
> I think there may be issues with the patch beyond Eric's comments. It
> seems to trash packet contents.
>
> I applied this patch to Fedora flavored kernel 3.6.11-1.fc16.x86_64.
> vlan tagged packets injected via libpcap's pcap_inject() came out
> mangled at the packet filters.
>

The proposed patch tries to fix the issue that arose after the
following commit :

commit b40863c667c16b7a73d4f034a8eab67029b5b15a
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Sep 18 20:44:49 2012 +0000

    net: more accurate network taps in transmit path


I do not believe 3.6.11 kernel has this change. 3.6.11 should not need
the patch.

ani

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09  6:06 ` Ani Sinha
@ 2013-01-09  6:27   ` Eric Dumazet
  2013-01-09  6:34     ` Ani Sinha
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2013-01-09  6:27 UTC (permalink / raw)
  To: Ani Sinha; +Cc: Paul Pearce, netdev, dborkman, edumazet, Jiri Pirko

On Tue, 2013-01-08 at 22:06 -0800, Ani Sinha wrote:

> The proposed patch tries to fix the issue that arose after the
> following commit :
> 
> commit b40863c667c16b7a73d4f034a8eab67029b5b15a
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Tue Sep 18 20:44:49 2012 +0000
> 
>     net: more accurate network taps in transmit path
> 
> 
> I do not believe 3.6.11 kernel has this change. 3.6.11 should not need
> the patch.

Thats irrelevant. This only shows that user land was depending on a
prior undocumented behavior.

It seems a libpcap issue to me. Kernel side provides all needed bits.

When I want "tcpdump src port 2030", filter is :

(000) ldh      [12]
(001) jeq      #0x86dd          jt 2	jf 8
(002) ldb      [20]
(003) jeq      #0x84            jt 6	jf 4
(004) jeq      #0x6             jt 6	jf 5
(005) jeq      #0x11            jt 6	jf 19
(006) ldh      [54]
(007) jeq      #0x7ee           jt 18	jf 19
(008) jeq      #0x800           jt 9	jf 19
(009) ldb      [23]
(010) jeq      #0x84            jt 13	jf 11
(011) jeq      #0x6             jt 13	jf 12
(012) jeq      #0x11            jt 13	jf 19
(013) ldh      [20]
(014) jset     #0x1fff          jt 19	jf 15
(015) ldxb     4*([14]&0xf)
(016) ldh      [x + 14]
(017) jeq      #0x7ee           jt 18	jf 19
(018) ret      #96
(019) ret      #0

See how it handles both IPv4 and IPv6, and various protocols
automatically ?

If I only wanted "udp and src port 2030" it would give :

(000) ldh      [12]
(001) jeq      #0x86dd          jt 2	jf 6
(002) ldb      [20]
(003) jeq      #0x11            jt 4	jf 15
(004) ldh      [54]
(005) jeq      #0x7ee           jt 14	jf 15
(006) jeq      #0x800           jt 7	jf 15
(007) ldb      [23]
(008) jeq      #0x11            jt 9	jf 15
(009) ldh      [20]
(010) jset     #0x1fff          jt 15	jf 11
(011) ldxb     4*([14]&0xf)
(012) ldh      [x + 14]
(013) jeq      #0x7ee           jt 14	jf 15
(014) ret      #96
(015) ret      #0



So when I want "tcpdump vlan 100" it generates :

(000) ldh      [12]
(001) jeq      #0x8100          jt 2	jf 6
(002) ldh      [14]
(003) and      #0xfff
(004) jeq      #0x64            jt 5	jf 6
(005) ret      #96
(006) ret      #0

What's wrong instructing libpcap to extend the filter to be able to 
get the correct result, vlan id being in skb->vlan_id (vlan accel on),
or in the packet itself (vlan accel off)

This way, you could chose if you want to get only accelerated vlan,
or non accelerated vlan, or both. And you need no kernel hacking.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09  6:27   ` Eric Dumazet
@ 2013-01-09  6:34     ` Ani Sinha
  2013-01-09 19:27       ` Ani Sinha
  0 siblings, 1 reply; 18+ messages in thread
From: Ani Sinha @ 2013-01-09  6:34 UTC (permalink / raw)
  To: Eric Dumazet, tcpdump-workers
  Cc: Paul Pearce, netdev, dborkman, edumazet, Jiri Pirko

+tcpdump-workers


On Tue, Jan 8, 2013 at 10:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-01-08 at 22:06 -0800, Ani Sinha wrote:
>
>> The proposed patch tries to fix the issue that arose after the
>> following commit :
>>
>> commit b40863c667c16b7a73d4f034a8eab67029b5b15a
>> Author: Eric Dumazet <edumazet@google.com>
>> Date:   Tue Sep 18 20:44:49 2012 +0000
>>
>>     net: more accurate network taps in transmit path
>>
>>
>> I do not believe 3.6.11 kernel has this change. 3.6.11 should not need
>> the patch.
>
> Thats irrelevant. This only shows that user land was depending on a
> prior undocumented behavior.
>
> It seems a libpcap issue to me. Kernel side provides all needed bits.
>
> When I want "tcpdump src port 2030", filter is :
>
> (000) ldh      [12]
> (001) jeq      #0x86dd          jt 2    jf 8
> (002) ldb      [20]
> (003) jeq      #0x84            jt 6    jf 4
> (004) jeq      #0x6             jt 6    jf 5
> (005) jeq      #0x11            jt 6    jf 19
> (006) ldh      [54]
> (007) jeq      #0x7ee           jt 18   jf 19
> (008) jeq      #0x800           jt 9    jf 19
> (009) ldb      [23]
> (010) jeq      #0x84            jt 13   jf 11
> (011) jeq      #0x6             jt 13   jf 12
> (012) jeq      #0x11            jt 13   jf 19
> (013) ldh      [20]
> (014) jset     #0x1fff          jt 19   jf 15
> (015) ldxb     4*([14]&0xf)
> (016) ldh      [x + 14]
> (017) jeq      #0x7ee           jt 18   jf 19
> (018) ret      #96
> (019) ret      #0
>
> See how it handles both IPv4 and IPv6, and various protocols
> automatically ?
>
> If I only wanted "udp and src port 2030" it would give :
>
> (000) ldh      [12]
> (001) jeq      #0x86dd          jt 2    jf 6
> (002) ldb      [20]
> (003) jeq      #0x11            jt 4    jf 15
> (004) ldh      [54]
> (005) jeq      #0x7ee           jt 14   jf 15
> (006) jeq      #0x800           jt 7    jf 15
> (007) ldb      [23]
> (008) jeq      #0x11            jt 9    jf 15
> (009) ldh      [20]
> (010) jset     #0x1fff          jt 15   jf 11
> (011) ldxb     4*([14]&0xf)
> (012) ldh      [x + 14]
> (013) jeq      #0x7ee           jt 14   jf 15
> (014) ret      #96
> (015) ret      #0
>
>
>
> So when I want "tcpdump vlan 100" it generates :
>
> (000) ldh      [12]
> (001) jeq      #0x8100          jt 2    jf 6
> (002) ldh      [14]
> (003) and      #0xfff
> (004) jeq      #0x64            jt 5    jf 6
> (005) ret      #96
> (006) ret      #0
>
> What's wrong instructing libpcap to extend the filter to be able to
> get the correct result, vlan id being in skb->vlan_id (vlan accel on),
> or in the packet itself (vlan accel off)
>
> This way, you could chose if you want to get only accelerated vlan,
> or non accelerated vlan, or both. And you need no kernel hacking.
>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09  6:34     ` Ani Sinha
@ 2013-01-09 19:27       ` Ani Sinha
  2013-01-09 19:51         ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Ani Sinha @ 2013-01-09 19:27 UTC (permalink / raw)
  To: Eric Dumazet, tcpdump-workers; +Cc: netdev, dborkman, Jiri Pirko, edumazet

>> Thats irrelevant. This only shows that user land was depending on a
>> prior undocumented behavior.

Why do you say that? The following patch from Pirko ensured that on
both RX and TX regardless whether the driver/hw supported vlan
acceleration, the outermost vlan tags will always be extracted out of
the packet and put in skb aux data :

commit bcc6d47903612c3861201cc3a866fb604f26b8b2
Author: Jiri Pirko <jpirko@redhat.com>
Date:   Thu Apr 7 19:48:33 2011 +0000

    net: vlan: make non-hw-accel rx path similar to hw-accel

Now this meant that the filter code should always look into the aux
data for vlan tagging, not in the packet, regardless of hw
acceleration availability. Your patch
b40863c667c16b7a73d4f034a8eab67029b5b15a broke this symmetric
semantics - now on TX on the network tap, we do not have the vlan tags
in the skb aux data.

In my opinion, for a given kernel, the filter code should either look
into the packet offset or in the packet aux data for vlan tags but not
both. Otherwise the filter code becomes incredibly complex since an
inline vlan tag in the packet changes offsets of all headers coming
afterwords and I don't even know if filter code can be correctly
generated in this case. tcpdump-workers folks are CC's here and they
clearly have more experience with libpcap filter code that I do. Hence
I leave it up to them to provide inputs here.

>> What's wrong instructing libpcap to extend the filter to be able to
>> get the correct result, vlan id being in skb->vlan_id (vlan accel on),
>> or in the packet itself (vlan accel off)
>>
>> This way, you could chose if you want to get only accelerated vlan,
>> or non accelerated vlan, or both. And you need no kernel hacking.

This is wrong. Accelerated or not, the kernel code was organized to
have the tags in the packet aux data. So I think this is how user land
should be coded as well.

ani
_______________________________________________
tcpdump-workers mailing list
tcpdump-workers@lists.tcpdump.org
https://lists.sandelman.ca/mailman/listinfo/tcpdump-workers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09 19:27       ` Ani Sinha
@ 2013-01-09 19:51         ` Eric Dumazet
  2013-01-09 20:01           ` Ani Sinha
  2013-01-11  1:47           ` [tcpdump-workers] " Michael Richardson
  0 siblings, 2 replies; 18+ messages in thread
From: Eric Dumazet @ 2013-01-09 19:51 UTC (permalink / raw)
  To: Ani Sinha
  Cc: tcpdump-workers, Paul Pearce, netdev, dborkman, edumazet, Jiri Pirko

On Wed, 2013-01-09 at 11:27 -0800, Ani Sinha wrote:

> This is wrong. Accelerated or not, the kernel code was organized to
> have the tags in the packet aux data. So I think this is how user land
> should be coded as well.

You have your opinion, thats good.

My opinion as a kernel developer is that the network tap is here to have
a copy of the exact frame given to the _device_.

Because in the end, users will complain to netdev, giving us tcpdump
traces. And if these traces have nothing to do with what is given to the
device, they are almost useless.

If you want other taps, and catch frames before/after various netfilter
hooks, segmentations, vlan accel, tunnels, or before GRO layer, thats a
totally different request.

A packet can be modified by a lot of layers in the kernel.

And yes, BPF filters can be incredibly complex, but it appears kernel is
not a piece of cake.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09 19:51         ` Eric Dumazet
@ 2013-01-09 20:01           ` Ani Sinha
  2013-01-09 20:06             ` Ani Sinha
  2013-01-11  1:47           ` [tcpdump-workers] " Michael Richardson
  1 sibling, 1 reply; 18+ messages in thread
From: Ani Sinha @ 2013-01-09 20:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: tcpdump-workers, Paul Pearce, netdev, dborkman, edumazet, Jiri Pirko

On Wed, Jan 9, 2013 at 11:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2013-01-09 at 11:27 -0800, Ani Sinha wrote:
>
>> This is wrong. Accelerated or not, the kernel code was organized to
>> have the tags in the packet aux data. So I think this is how user land
>> should be coded as well.
>
> You have your opinion, thats good.
>
> My opinion as a kernel developer is that the network tap is here to have
> a copy of the exact frame given to the _device_.
>

It is fine by me if that is how you see it. In that case. the
behaviour can me made symmetric on both TX and RX. Tap processing in
__netif_receive_skb() can be done before vlan_untag() so that taps see
the exact frame received from the _device_ as you put it.

ani

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09 20:01           ` Ani Sinha
@ 2013-01-09 20:06             ` Ani Sinha
  0 siblings, 0 replies; 18+ messages in thread
From: Ani Sinha @ 2013-01-09 20:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: tcpdump-workers, Paul Pearce, netdev, dborkman, edumazet, Jiri Pirko

On Wed, Jan 9, 2013 at 12:01 PM, Ani Sinha <ani@aristanetworks.com> wrote:
> On Wed, Jan 9, 2013 at 11:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2013-01-09 at 11:27 -0800, Ani Sinha wrote:
>>
>>> This is wrong. Accelerated or not, the kernel code was organized to
>>> have the tags in the packet aux data. So I think this is how user land
>>> should be coded as well.
>>
>> You have your opinion, thats good.
>>
>> My opinion as a kernel developer is that the network tap is here to have
>> a copy of the exact frame given to the _device_.
>>
>
> It is fine by me if that is how you see it. In that case. the
> behaviour can me made symmetric on both TX and RX. Tap processing in
> __netif_receive_skb() can be done before vlan_untag() so that taps see
> the exact frame received from the _device_ as you put it.

Although for accelerated vlan tags, it will be in the meta data
anyways. All I am asking is, let's have the same behaviour on both TX
and RX. If  the tag in the packet let's have it that way in both ways
in what the tap sees.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tcpdump-workers] [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-09 19:51         ` Eric Dumazet
  2013-01-09 20:01           ` Ani Sinha
@ 2013-01-11  1:47           ` Michael Richardson
  2013-01-11  2:37             ` Paul Pearce
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Richardson @ 2013-01-11  1:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ani Sinha, Jiri Pirko, netdev, edumazet, tcpdump-workers, dborkman

[-- Attachment #1: Type: text/plain, Size: 1549 bytes --]


>>>>> "Eric" == Eric Dumazet <eric.dumazet@gmail.com> writes:
    Eric> On Wed, 2013-01-09 at 11:27 -0800, Ani Sinha wrote:

    >> This is wrong. Accelerated or not, the kernel code was organized to
    >> have the tags in the packet aux data. So I think this is how user land
    >> should be coded as well.

    Eric> You have your opinion, thats good.

    Eric> My opinion as a kernel developer is that the network tap is here to have
    Eric> a copy of the exact frame given to the _device_.

Good: as someone who spends lots of time with tcpdump doing both network
and protocol diagnostics, it's really important to see exactly there.
If that means turning off some hardware offload in order to get the
intact 1p header, then that may be fine for many situations.
(At 10G, on a live router... well...)

The problem is that now we need to know, on a per device basis (based
upon the current configuration) if the VLAN tag was removed by the
hardware or not.  It's not enough to try with vlan tag and not.

    Eric> If you want other taps, and catch frames before/after various netfilter
    Eric> hooks, segmentations, vlan accel, tunnels, or before GRO layer, thats a
    Eric> totally different request.

Yes!!!! We need all of these tap points too... 

-- 
]               Never tell me the odds!                 | ipv6 mesh networks [ 
]   Michael Richardson, Sandelman Software Works        | network architect  [ 
]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [ 
	



[-- Attachment #2: Type: application/pgp-signature, Size: 307 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tcpdump-workers] [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-11  1:47           ` [tcpdump-workers] " Michael Richardson
@ 2013-01-11  2:37             ` Paul Pearce
  2013-01-11  8:46               ` Daniel Borkmann
  2013-02-15  8:17               ` Eric W. Biederman
  0 siblings, 2 replies; 18+ messages in thread
From: Paul Pearce @ 2013-01-11  2:37 UTC (permalink / raw)
  To: Michael Richardson
  Cc: Eric Dumazet, Ani Sinha, Jiri Pirko, netdev, edumazet,
	tcpdump-workers, dborkman

>> My opinion as a kernel developer is that the network tap is here to have
>> a copy of the exact frame given to the _device_.

> Good: as someone who spends lots of time with tcpdump doing both network
> and protocol diagnostics, it's really important to see exactly there.
> If that means turning off some hardware offload in order to get the
> intact 1p header, then that may be fine for many situations.
> (At 10G, on a live router... well...)

I agree as well.

But I think Ani's point was that for RX packets, as of commit
bcc6d47903612c3861201cc3a866fb604f26b8b2, the filters are not
getting exactly what's "on the wire." Independent of hardware
acceleration the vlan headers are being stripped off and skb->vlan_tci
is being set. That's was the origin of this whole mess.

The msg from that commit reads in part:
> Vlan untagging happens early in __netif_receive_skb so the rest of
> code (ptype_all handlers, rx_handlers) see the skb like it was
> untagged by hw.

His confusion (which I share) is why it's acceptable to have this
behavior of removing headers and setting skb->vlan_tci (regardless of
hardware acceleration) on the RX path but not also set skb->vlan_tci
on the TX path.

Indepdent of proposed userspace or PACKET_AUXDATA solutions,
clarification on the RX skb->vlan_tci behavior would be appreciated.

My knowledge of this code is quite limited so it's entirely possible
I'm off base here. If so please tell me.
-Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tcpdump-workers] [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-11  2:37             ` Paul Pearce
@ 2013-01-11  8:46               ` Daniel Borkmann
  2013-02-15  8:17               ` Eric W. Biederman
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2013-01-11  8:46 UTC (permalink / raw)
  To: Paul Pearce
  Cc: Michael Richardson, Eric Dumazet, Ani Sinha, Jiri Pirko, netdev,
	edumazet, tcpdump-workers

On 01/11/2013 03:37 AM, Paul Pearce wrote:
>>> My opinion as a kernel developer is that the network tap is here to have
>>> a copy of the exact frame given to the _device_.

Agreed.

>> Good: as someone who spends lots of time with tcpdump doing both network
>> and protocol diagnostics, it's really important to see exactly there.
>> If that means turning off some hardware offload in order to get the
>> intact 1p header, then that may be fine for many situations.
>> (At 10G, on a live router... well...)
>
> I agree as well.
>
> But I think Ani's point was that for RX packets, as of commit
> bcc6d47903612c3861201cc3a866fb604f26b8b2, the filters are not
> getting exactly what's "on the wire." Independent of hardware
> acceleration the vlan headers are being stripped off and skb->vlan_tci
> is being set. That's was the origin of this whole mess.
>
> The msg from that commit reads in part:
>> Vlan untagging happens early in __netif_receive_skb so the rest of
>> code (ptype_all handlers, rx_handlers) see the skb like it was
>> untagged by hw.
>
> His confusion (which I share) is why it's acceptable to have this
> behavior of removing headers and setting skb->vlan_tci (regardless of
> hardware acceleration) on the RX path but not also set skb->vlan_tci
> on the TX path.
>
> Indepdent of proposed userspace or PACKET_AUXDATA solutions,
> clarification on the RX skb->vlan_tci behavior would be appreciated.
>
> My knowledge of this code is quite limited so it's entirely possible
> I'm off base here. If so please tell me.

While we're at the topic, though it's slightly unrelated this particular
problem, but related to capturing VLANs and ``what's seen on the wire'',
since it was mentioned.

For different NICs/drivers you might get different default behaviours, and
mostly it's the case (I assume, correct me if I'm wrong) that libpcap has
to ``un-untag'' the VLAN headers in user space, doing a memmove(3) for each
stripped VLAN packet in order to ``fix'' this.

Because of this hack, I even got a report of a user recently, that in
Wireshark, he saw a QinQ header, although it should just have been one
VLAN encapsulation (AR8131 driver with ethtool -K eth0 rxvlan off) as he
saw with netsniff-ng (no memmove(3) done there). (I didn't further follow
or verify this report though.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tcpdump-workers] [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-11  2:37             ` Paul Pearce
  2013-01-11  8:46               ` Daniel Borkmann
@ 2013-02-15  8:17               ` Eric W. Biederman
  1 sibling, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2013-02-15  8:17 UTC (permalink / raw)
  To: Paul Pearce
  Cc: Michael Richardson, Eric Dumazet, Ani Sinha, Jiri Pirko, netdev,
	edumazet, tcpdump-workers, dborkman

Paul Pearce <pearce@cs.berkeley.edu> writes:

>>> My opinion as a kernel developer is that the network tap is here to have
>>> a copy of the exact frame given to the _device_.
>
>> Good: as someone who spends lots of time with tcpdump doing both network
>> and protocol diagnostics, it's really important to see exactly there.
>> If that means turning off some hardware offload in order to get the
>> intact 1p header, then that may be fine for many situations.
>> (At 10G, on a live router... well...)
>
> I agree as well.
>
> But I think Ani's point was that for RX packets, as of commit
> bcc6d47903612c3861201cc3a866fb604f26b8b2, the filters are not
> getting exactly what's "on the wire." Independent of hardware
> acceleration the vlan headers are being stripped off and skb->vlan_tci
> is being set. That's was the origin of this whole mess.

The mess goes back much farther than that.  That commit just flushed
a lot of the mess out into the open, and made it apparent the kernel
had insufficient facilities for dealing with packets whose vlan
tags had been stripped and that libpcap had not been handling stripped
vlan tags.

> The msg from that commit reads in part:
>> Vlan untagging happens early in __netif_receive_skb so the rest of
>> code (ptype_all handlers, rx_handlers) see the skb like it was
>> untagged by hw.
>
> His confusion (which I share) is why it's acceptable to have this
> behavior of removing headers and setting skb->vlan_tci (regardless of
> hardware acceleration) on the RX path but not also set skb->vlan_tci
> on the TX path.

On all paths the kernel will now set a flag VLAN_TAG_PRESENT if the
vlan_tci is stripped off and used.  So there is no pressing need for a
kernel change.  recvmsg and BPF filters have all of the information they
need to figure out what is going on.  So at this point this is a libpcap
problem not a kernel problem.

On the RX path always stripping the header allowed the vlan processing
code to be simplified and some bugs to be fixed.

Just reading through the code a bit more it looks like stripping the
vlan headers on TX if the network device does not support vlan header
accelleration is a performance loss.  There are other cases besides
AF_PACKET in particular vlan_dev_hard_header that will insert the vlan
header on a packet before the packet is transmitted.

> Indepdent of proposed userspace or PACKET_AUXDATA solutions,
> clarification on the RX skb->vlan_tci behavior would be appreciated.

There are two variables now available in AUXDATA and in the BPF filters
for packets.  VLAN_TAG_PRESENT and VLAN_TAG.

Packets that have their vlan tags stripped have VLAN_TAG_PRESENT set
and the tag is available in VLAN_TAG.

> My knowledge of this code is quite limited so it's entirely possible
> I'm off base here. If so please tell me.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-08 20:22     ` Jiri Pirko
@ 2013-01-08 20:42       ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2013-01-08 20:42 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Daniel Borkmann, David Miller, netdev, Ani Sinha, Jiri Pirko

On Tue, 2013-01-08 at 21:22 +0100, Jiri Pirko wrote:

> 
> The issue is that for exmaple in af_packet the function packet_rcv()
> expects vlan_tci to be filled out as that is on RX path ensured by
> __netif_receive_skb(). However on TX path, dev_queue_xmit_nit() is
> called with vlan_tci cleaned in case the device does not have TX vlan
> accel enabled. This patch is trying to fix this difference.

I perfectly understood that, and I repeat :

I want to see the difference.

A filter cannot expect skb->vlan_tci is set for all packets.

If a device doesn't have TX vlan accel, vlan_tci is 0.

packet sniffing is already slow, we don't want to force an extra copy
with vlan_untag() killer.

If you want to fix/extend af_packet, do this in af_packet.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-08 20:04   ` Eric Dumazet
@ 2013-01-08 20:22     ` Jiri Pirko
  2013-01-08 20:42       ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Jiri Pirko @ 2013-01-08 20:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Daniel Borkmann, David Miller, netdev, Ani Sinha, Jiri Pirko

Tue, Jan 08, 2013 at 09:04:52PM CET, eric.dumazet@gmail.com wrote:
>On Tue, 2013-01-08 at 19:51 +0100, Daniel Borkmann wrote:
>> VLAN packets that are locally injected through taps will loose their
>> skb->vlan_tci value when they pass dev_hard_start_xmit and get looped
>> back to a packet sniffer via dev_queue_xmit_nit. Besides others, this
>> meta data is used in Linux socket filtering for VLANs. Tested with a
>> VLAN ancillary ops filter.
>> 
>> Patch is based on a previous version by Jiri Pirko.
>> 
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: Ani Sinha <ani@aristanetworks.com>
>> Cc: Jiri Pirko <jpirko@redhat.com>
>> Reported-by: Paul Pearce <pearce@cs.berkeley.edu>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> ---
>>  net/core/dev.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>> 
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 515473e..723dcd0 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -1775,6 +1775,19 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
>>  	struct packet_type *ptype;
>>  	struct sk_buff *skb2 = NULL;
>>  	struct packet_type *pt_prev = NULL;
>> +	struct ethhdr *ehdr;
>> +
>> +	/* Network taps could make use of skb->vlan_tci, which got wiped
>> +	 * out. Hence, we need to reset it correctly.
>> +	 */
>> +	skb_reset_mac_header(skb);
>> +	ehdr = eth_hdr(skb);
>> +
>> +	if (ehdr->h_proto == __constant_htons(ETH_P_8021Q)) {
>> +		skb2 = vlan_untag(skb);
>> +		if (likely(skb2))
>> +			skb = skb2;
>> +	}
>>  
>>  	rcu_read_lock();
>>  	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>
>This patch is wrong (it adds a leak), and not needed.
>
>If a packet has no vlan_tci, its for a good reason.
>
>We want sniffer see the packet content as is.


The issue is that for exmaple in af_packet the function packet_rcv()
expects vlan_tci to be filled out as that is on RX path ensured by
__netif_receive_skb(). However on TX path, dev_queue_xmit_nit() is
called with vlan_tci cleaned in case the device does not have TX vlan
accel enabled. This patch is trying to fix this difference.

>
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-08 18:51 ` [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value Daniel Borkmann
  2013-01-08 19:54   ` Ani Sinha
  2013-01-08 20:04   ` Eric Dumazet
@ 2013-01-08 20:14   ` Jiri Pirko
  2 siblings, 0 replies; 18+ messages in thread
From: Jiri Pirko @ 2013-01-08 20:14 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David Miller, netdev, Eric Dumazet, Ani Sinha, Jiri Pirko

Tue, Jan 08, 2013 at 07:51:32PM CET, dborkman@redhat.com wrote:
>VLAN packets that are locally injected through taps will loose their
>skb->vlan_tci value when they pass dev_hard_start_xmit and get looped
>back to a packet sniffer via dev_queue_xmit_nit. Besides others, this
>meta data is used in Linux socket filtering for VLANs. Tested with a
>VLAN ancillary ops filter.
>
>Patch is based on a previous version by Jiri Pirko.
>
>Cc: Eric Dumazet <eric.dumazet@gmail.com>
>Cc: Ani Sinha <ani@aristanetworks.com>
>Cc: Jiri Pirko <jpirko@redhat.com>
>Reported-by: Paul Pearce <pearce@cs.berkeley.edu>
>Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>---
> net/core/dev.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 515473e..723dcd0 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -1775,6 +1775,19 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
> 	struct packet_type *ptype;
> 	struct sk_buff *skb2 = NULL;
> 	struct packet_type *pt_prev = NULL;
>+	struct ethhdr *ehdr;
>+
>+	/* Network taps could make use of skb->vlan_tci, which got wiped
>+	 * out. Hence, we need to reset it correctly.
>+	 */
>+	skb_reset_mac_header(skb);
>+	ehdr = eth_hdr(skb);
>+
>+	if (ehdr->h_proto == __constant_htons(ETH_P_8021Q)) {
>+		skb2 = vlan_untag(skb);
>+		if (likely(skb2))
>+			skb = skb2;
>+	}

	Hmm, nitpick, I think that better would be to do:
		skb = vlan_untag(skb);
		if (unlikely(!skb))
			return;
	
	I believe that better is to deliver skbs in consistent way and
	to do not deliver at all in case of -ENOMEM

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-08 18:51 ` [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value Daniel Borkmann
  2013-01-08 19:54   ` Ani Sinha
@ 2013-01-08 20:04   ` Eric Dumazet
  2013-01-08 20:22     ` Jiri Pirko
  2013-01-08 20:14   ` Jiri Pirko
  2 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2013-01-08 20:04 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David Miller, netdev, Ani Sinha, Jiri Pirko

On Tue, 2013-01-08 at 19:51 +0100, Daniel Borkmann wrote:
> VLAN packets that are locally injected through taps will loose their
> skb->vlan_tci value when they pass dev_hard_start_xmit and get looped
> back to a packet sniffer via dev_queue_xmit_nit. Besides others, this
> meta data is used in Linux socket filtering for VLANs. Tested with a
> VLAN ancillary ops filter.
> 
> Patch is based on a previous version by Jiri Pirko.
> 
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Ani Sinha <ani@aristanetworks.com>
> Cc: Jiri Pirko <jpirko@redhat.com>
> Reported-by: Paul Pearce <pearce@cs.berkeley.edu>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
>  net/core/dev.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 515473e..723dcd0 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1775,6 +1775,19 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
>  	struct packet_type *ptype;
>  	struct sk_buff *skb2 = NULL;
>  	struct packet_type *pt_prev = NULL;
> +	struct ethhdr *ehdr;
> +
> +	/* Network taps could make use of skb->vlan_tci, which got wiped
> +	 * out. Hence, we need to reset it correctly.
> +	 */
> +	skb_reset_mac_header(skb);
> +	ehdr = eth_hdr(skb);
> +
> +	if (ehdr->h_proto == __constant_htons(ETH_P_8021Q)) {
> +		skb2 = vlan_untag(skb);
> +		if (likely(skb2))
> +			skb = skb2;
> +	}
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(ptype, &ptype_all, list) {

This patch is wrong (it adds a leak), and not needed.

If a packet has no vlan_tci, its for a good reason.

We want sniffer see the packet content as is.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-08 18:51 ` [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value Daniel Borkmann
@ 2013-01-08 19:54   ` Ani Sinha
  2013-01-08 20:04   ` Eric Dumazet
  2013-01-08 20:14   ` Jiri Pirko
  2 siblings, 0 replies; 18+ messages in thread
From: Ani Sinha @ 2013-01-08 19:54 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David Miller, netdev, Eric Dumazet, Jiri Pirko

Agreed with the fix. This fixes the issue introduced by this change :l


On Tue, Jan 8, 2013 at 10:51 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> VLAN packets that are locally injected through taps will loose their
> skb->vlan_tci value when they pass dev_hard_start_xmit and get looped
> back to a packet sniffer via dev_queue_xmit_nit. Besides others, this
> meta data is used in Linux socket filtering for VLANs. Tested with a
> VLAN ancillary ops filter.
>
> Patch is based on a previous version by Jiri Pirko.
>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Ani Sinha <ani@aristanetworks.com>
> Cc: Jiri Pirko <jpirko@redhat.com>
> Reported-by: Paul Pearce <pearce@cs.berkeley.edu>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Acked-by: Ani Sinha <ani@aristanetworks.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value
  2013-01-08 18:51 [PATCH net 0/2] net: dev_queue_xmit_nit fixes Daniel Borkmann
@ 2013-01-08 18:51 ` Daniel Borkmann
  2013-01-08 19:54   ` Ani Sinha
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Daniel Borkmann @ 2013-01-08 18:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Daniel Borkmann, Eric Dumazet, Ani Sinha, Jiri Pirko

VLAN packets that are locally injected through taps will loose their
skb->vlan_tci value when they pass dev_hard_start_xmit and get looped
back to a packet sniffer via dev_queue_xmit_nit. Besides others, this
meta data is used in Linux socket filtering for VLANs. Tested with a
VLAN ancillary ops filter.

Patch is based on a previous version by Jiri Pirko.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ani Sinha <ani@aristanetworks.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Reported-by: Paul Pearce <pearce@cs.berkeley.edu>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/core/dev.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 515473e..723dcd0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1775,6 +1775,19 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 	struct packet_type *ptype;
 	struct sk_buff *skb2 = NULL;
 	struct packet_type *pt_prev = NULL;
+	struct ethhdr *ehdr;
+
+	/* Network taps could make use of skb->vlan_tci, which got wiped
+	 * out. Hence, we need to reset it correctly.
+	 */
+	skb_reset_mac_header(skb);
+	ehdr = eth_hdr(skb);
+
+	if (ehdr->h_proto == __constant_htons(ETH_P_8021Q)) {
+		skb2 = vlan_untag(skb);
+		if (likely(skb2))
+			skb = skb2;
+	}
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-02-15  8:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-09  5:15 [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value Paul Pearce
2013-01-09  6:06 ` Ani Sinha
2013-01-09  6:27   ` Eric Dumazet
2013-01-09  6:34     ` Ani Sinha
2013-01-09 19:27       ` Ani Sinha
2013-01-09 19:51         ` Eric Dumazet
2013-01-09 20:01           ` Ani Sinha
2013-01-09 20:06             ` Ani Sinha
2013-01-11  1:47           ` [tcpdump-workers] " Michael Richardson
2013-01-11  2:37             ` Paul Pearce
2013-01-11  8:46               ` Daniel Borkmann
2013-02-15  8:17               ` Eric W. Biederman
  -- strict thread matches above, loose matches on Subject: below --
2013-01-08 18:51 [PATCH net 0/2] net: dev_queue_xmit_nit fixes Daniel Borkmann
2013-01-08 18:51 ` [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value Daniel Borkmann
2013-01-08 19:54   ` Ani Sinha
2013-01-08 20:04   ` Eric Dumazet
2013-01-08 20:22     ` Jiri Pirko
2013-01-08 20:42       ` Eric Dumazet
2013-01-08 20:14   ` Jiri Pirko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.