netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Bad checksum on bridge with IP options
@ 2014-05-11 14:41 David Newall
  2014-05-11 19:42 ` Lukas Tribus
  2014-05-12 13:23 ` David Newall
  0 siblings, 2 replies; 46+ messages in thread
From: David Newall @ 2014-05-11 14:41 UTC (permalink / raw)
  To: Netdev

I've been chasing a ping problem with record-route set, and it looks 
like a bug.  The problem also occurs with timestamp option set. 
Everything works find when using just the nic, or bonded nics, but 
breaks when I use a bridge.  This is 100% repeatable.

This fault has been observed on amd64 architecture running Ubuntu 13.10 
with various Canonical supplied kernels, and running Ubuntu 14.04 with 
Canonical supplied kernel 3.13.0-24-generic.

To demonstrate:

----8<---- INITIAL STATE OF INTERFACES ----8<----
root@konrad:~# ifconfig
lo        Link encap:Local Loopback
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1/128 Scope:Host
           UP LOOPBACK RUNNING  MTU:65536  Metric:1
           RX packets:1464473 errors:0 dropped:0 overruns:0 frame:0
           TX packets:1464473 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:954752075 (954.7 MB)  TX bytes:954752075 (954.7 MB)

----8<----8<----  BRING UP eth0  ----8<----8<-----
root@konrad:~# ifconfig eth0 192.168.0.9

----8<----8<----8<-- IT WORKS -8<----8<----8<-----
root@konrad:~# ping -nR 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(124) bytes of data.
64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=3.21 ms
RR:     192.168.0.9
         192.168.0.1
         192.168.0.9

64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=0.396 ms      (same route)
^C
--- 192.168.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.396/1.804/3.212/1.408 ms

----8<----8<---- BRING UP BRIDGE ----8<----8<-----
root@konrad:~# ifconfig eth0 0.0.0.0
root@konrad:~# brctl addbr br0
root@konrad:~# brctl addif br0 eth0
root@konrad:~# ifconfig br0 192.168.0.9

----8<----8<----8<-- BROKEN ---8<----8<----8<-----
root@konrad:~# ping -nR 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(124) bytes of data.
^C
--- 192.168.0.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1006ms

root@konrad:~# ping -nTtsonly 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(124) bytes of data.
^C
--- 192.168.0.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1006ms

----8<----8<----8<---- --- ----8<----8<----8<----


Capturing ICMP packets from the "any" interface with tcpdump provides a 
clue.  ICMP replies are being changed when forwarded from the physical 
NIC to the bridge interface.  When RR option is set, an extra address is 
appended to the recorded route (0.0.0.0).  When TS option is set, the 
last set time stamp is overwritten, probably with the preceding 
timestamp, and the (option) pointer is incremented by 4.

The following decoded ICMP reply packets reveal the changes

----8<----8<---- RECEIVED ON eth0 ---8<----8<----
Frame 3: 140 bytes on wire (1120 bits), 140 bytes captured (1120 bits)
     Encapsulation type: Linux cooked-mode capture (25)
     Arrival Time: May 11, 2014 23:06:25.953831000 CST
     [Time shift for this packet: 0.000000000 seconds]
     Epoch Time: 1399815385.953831000 seconds
     [Time delta from previous captured frame: 0.000436000 seconds]
     [Time delta from previous displayed frame: 0.000436000 seconds]
     [Time since reference or first frame: 0.000452000 seconds]
     Frame Number: 3
     Frame Length: 140 bytes (1120 bits)
     Capture Length: 140 bytes (1120 bits)
     [Frame is marked: False]
     [Frame is ignored: False]
     [Protocols in frame: sll:ip:icmp:data]
Linux cooked capture
     Packet type: Unicast to us (0)
     Link-layer address type: 1
     Link-layer address length: 6
     Source: c4:04:15:b4:84:84 (c4:04:15:b4:84:84)
     Protocol: IP (0x0800)
Internet Protocol Version 4, Src: 192.168.0.1 (192.168.0.1), Dst: 192.168.0.9 (192.168.0.9)
     Version: 4
     Header length: 60 bytes
     Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
         0000 00.. = Differentiated Services Codepoint: Default (0x00)
         .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
     Total Length: 124
     Identification: 0xfc40 (64576)
     Flags: 0x02 (Don't Fragment)
         0... .... = Reserved bit: Not set
         .1.. .... = Don't fragment: Set
         ..0. .... = More fragments: Not set
     Fragment offset: 0
     Time to live: 64
     Protocol: ICMP (1)
     Header checksum: 0x443d [correct]
         [Good: True]
         [Bad: False]
     Source: 192.168.0.1 (192.168.0.1)
     Destination: 192.168.0.9 (192.168.0.9)
     [Source GeoIP: Unknown]
     [Destination GeoIP: Unknown]
     Options: (40 bytes), Record Route, End of Options List (EOL)
         Record Route (39 bytes)
             Type: 7
                 0... .... = Copy on fragmentation: No
                 .00. .... = Class: Control (0)
                 ...0 0111 = Number: Record route (7)
             Length: 39
             Pointer: 12
             Recorded Route: 192.168.0.9 (192.168.0.9)
             Recorded Route: 192.168.0.1 (192.168.0.1)
             Empty Route: 0.0.0.0 <- (next)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
         End of Options List (EOL)
             Type: 0
                 0... .... = Copy on fragmentation: No
                 .00. .... = Class: Control (0)
                 ...0 0000 = Number: End of Option List (EOL) (0)
Internet Control Message Protocol
     Type: 0 (Echo (ping) reply)
     Code: 0
     Checksum: 0x66b0 [correct]
     Identifier (BE): 31519 (0x7b1f)
     Identifier (LE): 8059 (0x1f7b)
     Sequence number (BE): 1 (0x0001)
     Sequence number (LE): 256 (0x0100)
     [Request frame: 2]
     [Response time: 0.436 ms]
     Timestamp from icmp data: May 11, 2014 23:06:25.000000000 CST
     [Timestamp from icmp data (relative): 0.953831000 seconds]
     Data (48 bytes)

0000  08 8c 0e 00 00 00 00 00 10 11 12 13 14 15 16 17   ................
0010  18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27   ........ !"#$%&'
0020  28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37   ()*+,-./01234567
         Data: 088c0e0000000000101112131415161718191a1b1c1d1e1f...
         [Length: 48]

----8<----8<----  SENT TO BRIDGE ----8<----8<----
Frame 4: 140 bytes on wire (1120 bits), 140 bytes captured (1120 bits)
     Encapsulation type: Linux cooked-mode capture (25)
     Arrival Time: May 11, 2014 23:06:25.953831000 CST
     [Time shift for this packet: 0.000000000 seconds]
     Epoch Time: 1399815385.953831000 seconds
     [Time delta from previous captured frame: 0.000000000 seconds]
     [Time delta from previous displayed frame: 0.000000000 seconds]
     [Time since reference or first frame: 0.000452000 seconds]
     Frame Number: 4
     Frame Length: 140 bytes (1120 bits)
     Capture Length: 140 bytes (1120 bits)
     [Frame is marked: False]
     [Frame is ignored: False]
     [Protocols in frame: sll:ip:icmp:data]
Linux cooked capture
     Packet type: Unicast to us (0)
     Link-layer address type: 1
     Link-layer address length: 6
     Source: c4:04:15:b4:84:84 (c4:04:15:b4:84:84)
     Protocol: IP (0x0800)
Internet Protocol Version 4, Src: 192.168.0.1 (192.168.0.1), Dst: 192.168.0.9 (192.168.0.9)
     Version: 4
     Header length: 60 bytes
     Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
         0000 00.. = Differentiated Services Codepoint: Default (0x00)
         .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
     Total Length: 124
     Identification: 0xfc40 (64576)
     Flags: 0x02 (Don't Fragment)
         0... .... = Reserved bit: Not set
         .1.. .... = Don't fragment: Set
         ..0. .... = More fragments: Not set
     Fragment offset: 0
     Time to live: 64
     Protocol: ICMP (1)
     Header checksum: 0x443d [incorrect, should be 0x403d (may be caused by "IP checksum offload"?)]
         [Good: False]
         [Bad: True]
             [Expert Info (Error/Checksum): Bad checksum]
                 [Message: Bad checksum]
                 [Severity level: Error]
                 [Group: Checksum]
     Source: 192.168.0.1 (192.168.0.1)
     Destination: 192.168.0.9 (192.168.0.9)
     [Source GeoIP: Unknown]
     [Destination GeoIP: Unknown]
     Options: (40 bytes), Record Route, End of Options List (EOL)
         Record Route (39 bytes)
             Type: 7
                 0... .... = Copy on fragmentation: No
                 .00. .... = Class: Control (0)
                 ...0 0111 = Number: Record route (7)
             Length: 39
             Pointer: 16  ******************************************** CHANGED
             Recorded Route: 192.168.0.9 (192.168.0.9)
             Recorded Route: 192.168.0.1 (192.168.0.1)
             Recorded Route: 0.0.0.0 (0.0.0.0) *********************** CHANGED
             Empty Route: 0.0.0.0 <- (next)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
             Empty Route: 0.0.0.0 (0.0.0.0)
         End of Options List (EOL)
             Type: 0
                 0... .... = Copy on fragmentation: No
                 .00. .... = Class: Control (0)
                 ...0 0000 = Number: End of Option List (EOL) (0)
Internet Control Message Protocol
     Type: 0 (Echo (ping) reply)
     Code: 0
     Checksum: 0x66b0 [correct]
     Identifier (BE): 31519 (0x7b1f)
     Identifier (LE): 8059 (0x1f7b)
     Sequence number (BE): 1 (0x0001)
     Sequence number (LE): 256 (0x0100)
     Timestamp from icmp data: May 11, 2014 23:06:25.000000000 CST
     [Timestamp from icmp data (relative): 0.953831000 seconds]
     Data (48 bytes)

0000  08 8c 0e 00 00 00 00 00 10 11 12 13 14 15 16 17   ................
0010  18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27   ........ !"#$%&'
0020  28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37   ()*+,-./01234567
         Data: 088c0e0000000000101112131415161718191a1b1c1d1e1f...
         [Length: 48]

----8<----8<---- RECEIVED ON eth0 ---8<----8<----
Frame 11: 140 bytes on wire (1120 bits), 140 bytes captured (1120 bits)
     Encapsulation type: Linux cooked-mode capture (25)
     Arrival Time: May 11, 2014 23:06:35.889471000 CST
     [Time shift for this packet: 0.000000000 seconds]
     Epoch Time: 1399815395.889471000 seconds
     [Time delta from previous captured frame: 0.000428000 seconds]
     [Time delta from previous displayed frame: 0.000428000 seconds]
     [Time since reference or first frame: 9.936092000 seconds]
     Frame Number: 11
     Frame Length: 140 bytes (1120 bits)
     Capture Length: 140 bytes (1120 bits)
     [Frame is marked: False]
     [Frame is ignored: False]
     [Protocols in frame: sll:ip:icmp:data]
Linux cooked capture
     Packet type: Unicast to us (0)
     Link-layer address type: 1
     Link-layer address length: 6
     Source: c4:04:15:b4:84:84 (c4:04:15:b4:84:84)
     Protocol: IP (0x0800)
Internet Protocol Version 4, Src: 192.168.0.1 (192.168.0.1), Dst: 192.168.0.9 (192.168.0.9)
     Version: 4
     Header length: 60 bytes
     Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
         0000 00.. = Differentiated Services Codepoint: Default (0x00)
         .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
     Total Length: 124
     Identification: 0xfc50 (64592)
     Flags: 0x02 (Don't Fragment)
         0... .... = Reserved bit: Not set
         .1.. .... = Don't fragment: Set
         ..0. .... = More fragments: Not set
     Fragment offset: 0
     Time to live: 64
     Protocol: ICMP (1)
     Header checksum: 0xdc28 [correct]
         [Good: True]
         [Bad: False]
     Source: 192.168.0.1 (192.168.0.1)
     Destination: 192.168.0.9 (192.168.0.9)
     [Source GeoIP: Unknown]
     [Destination GeoIP: Unknown]
     Options: (40 bytes), Time Stamp
         Time Stamp (40 bytes)
             Type: 68
                 0... .... = Copy on fragmentation: No
                 .10. .... = Class: Debugging and measurement (2)
                 ...0 0100 = Number: Time stamp (4)
             Length: 40
             Pointer: 9
             Overflow: 0
             Flag: Time stamps only
             Time stamp = 48995889
             Time stamp = 518240
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
Internet Control Message Protocol
     Type: 0 (Echo (ping) reply)
     Code: 0
     Checksum: 0xacaa [correct]
     Identifier (BE): 31520 (0x7b20)
     Identifier (LE): 8315 (0x207b)
     Sequence number (BE): 1 (0x0001)
     Sequence number (LE): 256 (0x0100)
     [Request frame: 10]
     [Response time: 0.428 ms]
     Timestamp from icmp data: May 11, 2014 23:06:35.000000000 CST
     [Timestamp from icmp data (relative): 0.889471000 seconds]
     Data (48 bytes)

0000  b9 90 0d 00 00 00 00 00 10 11 12 13 14 15 16 17   ................
0010  18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27   ........ !"#$%&'
0020  28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37   ()*+,-./01234567
         Data: b9900d0000000000101112131415161718191a1b1c1d1e1f...
         [Length: 48]

----8<----8<----  SENT TO BRIDGE ----8<----8<----
Frame 12: 140 bytes on wire (1120 bits), 140 bytes captured (1120 bits)
     Encapsulation type: Linux cooked-mode capture (25)
     Arrival Time: May 11, 2014 23:06:35.889471000 CST
     [Time shift for this packet: 0.000000000 seconds]
     Epoch Time: 1399815395.889471000 seconds
     [Time delta from previous captured frame: 0.000000000 seconds]
     [Time delta from previous displayed frame: 0.000000000 seconds]
     [Time since reference or first frame: 9.936092000 seconds]
     Frame Number: 12
     Frame Length: 140 bytes (1120 bits)
     Capture Length: 140 bytes (1120 bits)
     [Frame is marked: False]
     [Frame is ignored: False]
     [Protocols in frame: sll:ip:icmp:data]
Linux cooked capture
     Packet type: Unicast to us (0)
     Link-layer address type: 1
     Link-layer address length: 6
     Source: c4:04:15:b4:84:84 (c4:04:15:b4:84:84)
     Protocol: IP (0x0800)
Internet Protocol Version 4, Src: 192.168.0.1 (192.168.0.1), Dst: 192.168.0.9 (192.168.0.9)
     Version: 4
     Header length: 60 bytes
     Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
         0000 00.. = Differentiated Services Codepoint: Default (0x00)
         .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
     Total Length: 124
     Identification: 0xfc50 (64592)
     Flags: 0x02 (Don't Fragment)
         0... .... = Reserved bit: Not set
         .1.. .... = Don't fragment: Set
         ..0. .... = More fragments: Not set
     Fragment offset: 0
     Time to live: 64
     Protocol: ICMP (1)
     Header checksum: 0xdc28 [incorrect, should be 0x1f74 (may be caused by "IP checksum offload"?)]
         [Good: False]
         [Bad: True]
             [Expert Info (Error/Checksum): Bad checksum]
                 [Message: Bad checksum]
                 [Severity level: Error]
                 [Group: Checksum]
     Source: 192.168.0.1 (192.168.0.1)
     Destination: 192.168.0.9 (192.168.0.9)
     [Source GeoIP: Unknown]
     [Destination GeoIP: Unknown]
     Options: (40 bytes), Time Stamp
         Time Stamp (40 bytes)
             Type: 68
                 0... .... = Copy on fragmentation: No
                 .10. .... = Class: Debugging and measurement (2)
                 ...0 0100 = Number: Time stamp (4)
             Length: 40
             Pointer: 13 ********************************************* CHANGED
             Overflow: 0
             Flag: Time stamps only
             Time stamp = 48995889
             Time stamp = 48995889 *********************************** CHANGED
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
             Time stamp = 0
Internet Control Message Protocol
     Type: 0 (Echo (ping) reply)
     Code: 0
     Checksum: 0xacaa [correct]
     Identifier (BE): 31520 (0x7b20)
     Identifier (LE): 8315 (0x207b)
     Sequence number (BE): 1 (0x0001)
     Sequence number (LE): 256 (0x0100)
     Timestamp from icmp data: May 11, 2014 23:06:35.000000000 CST
     [Timestamp from icmp data (relative): 0.889471000 seconds]
     Data (48 bytes)

0000  b9 90 0d 00 00 00 00 00 10 11 12 13 14 15 16 17   ................
0010  18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27   ........ !"#$%&'
0020  28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37   ()*+,-./01234567
         Data: b9900d0000000000101112131415161718191a1b1c1d1e1f...
         [Length: 48]

----8<----8<----8<---- --- ----8<----8<----8<----

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: Bad checksum on bridge with IP options
  2014-05-11 14:41 Bad checksum on bridge with IP options David Newall
@ 2014-05-11 19:42 ` Lukas Tribus
  2014-05-12  8:14   ` David Newall
  2014-05-12 13:23 ` David Newall
  1 sibling, 1 reply; 46+ messages in thread
From: Lukas Tribus @ 2014-05-11 19:42 UTC (permalink / raw)
  To: David Newall, Netdev

Hi,


> I've been chasing a ping problem with record-route set, and it looks
> like a bug. The problem also occurs with timestamp option set.
> Everything works find when using just the nic, or bonded nics, but
> breaks when I use a bridge. This is 100% repeatable.
>
> This fault has been observed on amd64 architecture running Ubuntu 13.10
> with various Canonical supplied kernels, and running Ubuntu 14.04 with
> Canonical supplied kernel 3.13.0-24-generic.

I think you will have to reproduce this on a vanilla kernel.org kernel
like 3.14.3 and 3.15-rc5 [1] ...


Lukas


[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/

 		 	   		  

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-11 19:42 ` Lukas Tribus
@ 2014-05-12  8:14   ` David Newall
  2014-05-12 10:15     ` Lukas Tribus
  0 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-12  8:14 UTC (permalink / raw)
  To: Lukas Tribus, Netdev

On 12/05/14 05:12, Lukas Tribus wrote:
> I think you will have to reproduce this on a vanilla kernel.org kernel
> like 3.14.3 and 3.15-rc5 [1] ...

Didn't know about those packages.  And to think of how much time I've 
wasted by downloading and compiling for myself...

Of course the bug does exist in a vanilla kernel, too, loaded from 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-rc8-trusty/linux-image-3.14.0-031400rc8-generic_3.14.0-031400rc8.201403242335_amd64.deb.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: Bad checksum on bridge with IP options
  2014-05-12  8:14   ` David Newall
@ 2014-05-12 10:15     ` Lukas Tribus
  2014-05-12 10:25       ` David Newall
  0 siblings, 1 reply; 46+ messages in thread
From: Lukas Tribus @ 2014-05-12 10:15 UTC (permalink / raw)
  To: David Newall, Netdev

Hi,


> Didn't know about those packages. And to think of how much time I've
> wasted by downloading and compiling for myself...
>
> Of course the bug does exist in a vanilla kernel, too, loaded from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-rc8-trusty/[...]

3.15-rc5 is the kernel you should test.


Lukas

 		 	   		  

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-12 10:15     ` Lukas Tribus
@ 2014-05-12 10:25       ` David Newall
  2014-05-12 10:31         ` Lukas Tribus
  0 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-12 10:25 UTC (permalink / raw)
  To: Lukas Tribus, Netdev

On 12/05/14 19:45, Lukas Tribus wrote:
> 3.15-rc5 is the kernel you should test.

3.14-rc8 was the latest build with 14.04 configuration.  It's a vanilla 
kernel and I don't suppose it's so old that anyone would be unsure about 
a change which might have fixed the problem since then.

This is a bug.  Let's not stand on ceremony.  Let's fix it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: Bad checksum on bridge with IP options
  2014-05-12 10:25       ` David Newall
@ 2014-05-12 10:31         ` Lukas Tribus
  2014-05-12 10:48           ` David Newall
  0 siblings, 1 reply; 46+ messages in thread
From: Lukas Tribus @ 2014-05-12 10:31 UTC (permalink / raw)
  To: David Newall, Netdev

> 3.14-rc8 was the latest build with 14.04 configuration.

So what?



> I don't suppose it's so old that anyone would be unsure about
> a change which might have fixed the problem since then.

I disagree.



> This is a bug. Let's not stand on ceremony. Let's fix it.

So you are not willing to test latest kernels, but expect the
developers here to fix the bug?



Lukas 		 	   		  

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-12 10:31         ` Lukas Tribus
@ 2014-05-12 10:48           ` David Newall
  0 siblings, 0 replies; 46+ messages in thread
From: David Newall @ 2014-05-12 10:48 UTC (permalink / raw)
  To: Lukas Tribus, Netdev

On 12/05/14 20:01, Lukas Tribus wrote:
>> >I don't suppose it's so old that anyone would be unsure about
>> >a change which might have fixed the problem since then.
> I disagree.


Oh, excellent.  You must know about a very recent patch that would fix 
the problem.  Please, point me in the right direction.

>> >This is a bug. Let's not stand on ceremony. Let's fix it.
> So you are not willing to test latest kernels, but expect the
> developers here to fix the bug?

Yes, I do expect developers to look at the bug even though I haven't 
checked the absolutely, positively, hottest-off-the-press, latest 
kernel.  That's what developers of serious software do: they treat 
serious bug reports seriously and with respect.  They don't engage in 
bloody-mindedness.  They don't grasp for excuses to ignore problems.  
Not unless they don't care about what they develop, and I happen to know 
that Linux developers care very passionately about the software.

Developers aren't going to take my word that the problem exists, but 
will try to reproduce it for themselves.  The purpose of trying vanilla 
kernels is to pre-qualify bugs, so that they don't waste their time 
chasing phantoms.  I've done that.  I've pre-qualified it; there's a bug.

Do you have an actual reason to think that this bug, which clearly has 
been undiagnosed for quite some time, has been fixed.  Because that 
would justify telling me to try a newer kernel.  If you don't, you're 
just being officious.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-11 14:41 Bad checksum on bridge with IP options David Newall
  2014-05-11 19:42 ` Lukas Tribus
@ 2014-05-12 13:23 ` David Newall
  2014-05-12 13:51   ` Florian Westphal
  2014-05-12 18:54   ` Lukas Tribus
  1 sibling, 2 replies; 46+ messages in thread
From: David Newall @ 2014-05-12 13:23 UTC (permalink / raw)
  To: Netdev

I've got a patch which fixes the faulty checksums, and now ping works 
with RR or TS set.  I'm not quite sure, though, that it fixes the right 
thing.  I wonder if the problem is less that the checksum becomes wrong, 
and more that the route and timestamps ought not to be changed by the 
bridge interface.

Anyway, for discussion, here's the patch:

--- br_netfilter.c.orig	2014-05-12 22:10:59.809988125 +0930
+++ br_netfilter.c	2014-05-12 22:27:46.769299379 +0930
@@ -261,8 +261,10 @@
  static int br_parse_ip_options(struct sk_buff *skb)
  {
  	struct ip_options *opt;
-	const struct iphdr *iph;
+	struct iphdr *iph;
  	struct net_device *dev = skb->dev;
+	__sum16 oldsum;
+	int err;
  	u32 len;
  
  	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
@@ -298,8 +300,15 @@
  	if (iph->ihl == 5)
  		return 0;
  
+	oldsum = iph->check;
  	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
-	if (ip_options_compile(dev_net(dev), opt, skb))
+	err = ip_options_compile(dev_net(dev), opt, skb);
+	ip_send_check(iph);
+	if (iph->check != oldsum)
+		LIMIT_NETDEBUG(KERN_ERR
+			pr_fmt("br_parse_ip_options: bad sum %x replaced by %x\n"),
+			                       oldsum, iph->check);
+	if (err)
  		goto inhdr_error;
  
  	/* Check correct handling of SRR option */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-12 13:23 ` David Newall
@ 2014-05-12 13:51   ` Florian Westphal
  2014-05-12 14:19     ` David Newall
  2014-05-12 18:54   ` Lukas Tribus
  1 sibling, 1 reply; 46+ messages in thread
From: Florian Westphal @ 2014-05-12 13:51 UTC (permalink / raw)
  To: David Newall; +Cc: Netdev

David Newall <davidn@davidnewall.com> wrote:
> I've got a patch which fixes the faulty checksums, and now ping
> works with RR or TS set.  I'm not quite sure, though, that it fixes
> the right thing.  I wonder if the problem is less that the checksum
> becomes wrong, and more that the route and timestamps ought not to
> be changed by the bridge interface.

Agree, bridge should not alter ip options.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-12 13:51   ` Florian Westphal
@ 2014-05-12 14:19     ` David Newall
  0 siblings, 0 replies; 46+ messages in thread
From: David Newall @ 2014-05-12 14:19 UTC (permalink / raw)
  To: Florian Westphal, Lennert Buytenhek, Bart De Schuymer; +Cc: Netdev

On 12/05/14 23:21, Florian Westphal wrote:
> Agree, bridge should not alter ip options.

It would be easy to remove the call to ip_options_compile instead of 
recalculating checksum after it, but I suspect there may be good reasons 
why this, too, would be wrong.  The source file is br_netfilter.c, 
suggesting that a change in options is needed in some situations.

In the situation that caught my attention, it obviously does it wrong 
(probably didn't add 0.0.0.0 to the route record, probably just 
incremented the pointer; and seriously damaged the timestamps as well as 
an incremented pointer without actually adding a value.)

I'm in a quandary.

Is it possible that bridge has exceeded it's mandate?  I can't find it 
now, but I saw a comment that it just copies packets unchanged. I think 
it's use now goes further than that would allow.

I welcome words of advice.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: Bad checksum on bridge with IP options
  2014-05-12 13:23 ` David Newall
  2014-05-12 13:51   ` Florian Westphal
@ 2014-05-12 18:54   ` Lukas Tribus
  2014-05-12 23:46     ` David Newall
  1 sibling, 1 reply; 46+ messages in thread
From: Lukas Tribus @ 2014-05-12 18:54 UTC (permalink / raw)
  To: David Newall, Eric Dumazet, Bandan Das; +Cc: Netdev, fw

> I've got a patch which fixes the faulty checksums, and now ping works
> with RR or TS set. I'm not quite sure, though, that it fixes the right
> thing. I wonder if the problem is less that the checksum becomes wrong,
> and more that the route and timestamps ought not to be changed by the
> bridge interface.

Looks like your testcase:
- works in 2.6.36 and older
- crashes starting with 2.6.37 (-rc1), likely due to Bandan's commit 462fb2af9788a82 (bridge : Sanitize skb before it enters the IP stack) [1]
- crash fix is in 2.6.38.4, likely due to Eric's commit f8e9881c2aef1e9 (bridge: reset IPCB in br_parse_ip_options) [2]
- doesn't work post-crashfix



[1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=462fb2af9788a82a534f8184abfde31574e1cfa0
[2] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f8e9881c2aef1e982e5abc25c046820cd0b7cf64
 		 	   		  

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-12 18:54   ` Lukas Tribus
@ 2014-05-12 23:46     ` David Newall
  2014-05-14 13:08       ` David Newall
  0 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-12 23:46 UTC (permalink / raw)
  To: Lukas Tribus, Eric Dumazet, Bandan Das; +Cc: Netdev, fw

On 13/05/14 04:24, Lukas Tribus wrote:
> Looks like your testcase:
> - works in 2.6.36 and older
> - crashes starting with 2.6.37 (-rc1), likely due to Bandan's commit 462fb2af9788a82 (bridge : Sanitize skb before it enters the IP stack) [1]
> - crash fix is in 2.6.38.4, likely due to Eric's commit f8e9881c2aef1e9 (bridge: reset IPCB in br_parse_ip_options) [2]
> - doesn't work post-crashfix
>
>
>
> [1]http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=462fb2af9788a82a534f8184abfde31574e1cfa0
> [2]http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f8e9881c2aef1e982e5abc25c046820cd0b7cf64
>   		 	   		
Thanks, Lukas, for researching those changes.  It explains why 
ip_options_compile() is being called.

Discussion for [1] starts at https://lkml.org/lkml/2010/8/30/391. 
Briefly, "if we recieve (sic) a packet greater in size than the 
bridge device MTU, we call ip_fragment which in turn will lead to 
icmp_send calling ip_options_echo if the DF flag is set."

Eric Dumazet said (in that discussion), "once again, the IP stack -> 
bridge -> IP stack flow bites us."  Such enduring insight. He also said, 
"we can correct every bug we find ... or just make bridge not touch IPCB."

Assuming now is not the time to stop bridge from touching IPCB, 
recalculating the checksum would seem appropriate, but insufficient as 
at least the RR and TS options aren't being set correctly; perhaps 
others.  I think calling ip_forward_options will fix that, and, 
conveniently, will also recalculate the checksum if the options changed.

I'm thinking that the following changes might do the trick (but haven't 
yet tested them; the complete kernel needs to be recompiled, and my 
machine is still grinding away):

--- br_netfilter.c.orig 2014-05-12 22:10:59.809988125 +0930
+++ br_netfilter.c      2014-05-13 08:08:48.330396347 +0930
@@ -312,6 +312,9 @@
                         goto drop;
         }

+       if (unlikely(opt->is_changed && opt->optlen))
+               ip_forward_options(skb);
+
         return 0;

  inhdr_error:
--- ../ipv4/ip_options.c.orig   2014-05-13 05:40:10.408914495 +0930
+++ ../ipv4/ip_options.c        2014-05-13 08:29:01.482130038 +0930
@@ -601,6 +601,7 @@
                 ip_send_check(ip_hdr(skb));
         }
  }
+EXPORT_SYMBOL(ip_forward_options);

  int ip_options_rcv_srr(struct sk_buff *skb)
  {

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Bad checksum on bridge with IP options
  2014-05-12 23:46     ` David Newall
@ 2014-05-14 13:08       ` David Newall
  2014-05-16 14:33         ` Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack) David Newall
  0 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-14 13:08 UTC (permalink / raw)
  To: Lukas Tribus, Eric Dumazet, Netdev; +Cc: fw

My thought of calling ip_forward_options to set RR and TS addresses (not 
to mention checksum) is not working out, because skb_rtable returns 
NULL.  I think I'm supposed to call skb_set_dst, but how
I get dst is eluding me.

Leaving broken values in the options, just recalculating the checksum, 
is starting to look very attractive right now, but I'd rather not.

How do I proceed?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-14 13:08       ` David Newall
@ 2014-05-16 14:33         ` David Newall
  2014-05-16 15:19           ` Eric Dumazet
                             ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: David Newall @ 2014-05-16 14:33 UTC (permalink / raw)
  To: Lukas Tribus, Eric Dumazet, Netdev; +Cc: fw

We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge 
: Sanitize skb before it enters the IP stack) because it corrupts IP 
packets with RR or TS options set, only partially updating the IP header 
and leaving an incorrect checksum.

The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:

The bridge layer can overwrite the IPCB using the
BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
if we recieve a packet greater in size than the bridge
device MTU, we call ip_fragment which in turn will lead to
icmp_send calling ip_options_echo if the DF flag is set.
ip_options_echo will then incorrectly try to parse the IPCB as
IP options resulting in a buffer overflow.
This change refills the CB area back with IP
options before ip_fragment calls icmp_send. If we fail parsing,
we zero out the IPCB area to guarantee that the stack does
not get corrupted.

A bridge should not fragment packets; an *ethernet* bridge should not 
need to.  Fragmenting packets is the job of higher level protocol.

--- br_netfilter.c	2014-01-20 13:10:07.000000000 +1030
+++ br_netfilter.c.prop	2014-05-16 23:07:57.975386905 +0930
@@ -253,73 +253,6 @@
  		skb->protocol = htons(ETH_P_PPP_SES);
  }
  
-/* When handing a packet over to the IP layer
- * check whether we have a skb that is in the
- * expected format
- */
-
-static int br_parse_ip_options(struct sk_buff *skb)
-{
-	struct ip_options *opt;
-	const struct iphdr *iph;
-	struct net_device *dev = skb->dev;
-	u32 len;
-
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	opt = &(IPCB(skb)->opt);
-
-	/* Basic sanity checks */
-	if (iph->ihl < 5 || iph->version != 4)
-		goto inhdr_error;
-
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
-
-	len = ntohs(iph->tot_len);
-	if (skb->len < len) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
-		goto drop;
-	} else if (len < (iph->ihl*4))
-		goto inhdr_error;
-
-	if (pskb_trim_rcsum(skb, len)) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
-		goto drop;
-	}
-
-	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-	if (iph->ihl == 5)
-		return 0;
-
-	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
-	if (ip_options_compile(dev_net(dev), opt, skb))
-		goto inhdr_error;
-
-	/* Check correct handling of SRR option */
-	if (unlikely(opt->srr)) {
-		struct in_device *in_dev = __in_dev_get_rcu(dev);
-		if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
-			goto drop;
-
-		if (ip_options_rcv_srr(skb))
-			goto drop;
-	}
-
-	return 0;
-
-inhdr_error:
-	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
-drop:
-	return -1;
-}
-
  /* Fill in the header for fragmented IP packets handled by
   * the IPv4 connection tracking code.
   */
@@ -679,6 +612,7 @@
  {
  	struct net_bridge_port *p;
  	struct net_bridge *br;
+	const struct iphdr *iph;
  	__u32 len = nf_bridge_encap_header_len(skb);
  
  	if (unlikely(!pskb_may_pull(skb, len)))
@@ -704,9 +638,29 @@
  		return NF_ACCEPT;
  
  	nf_bridge_pull_encap_header_rcsum(skb);
+
+	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+		goto inhdr_error;
  
-	if (br_parse_ip_options(skb))
-		return NF_DROP;
+	iph = ip_hdr(skb);
+	if (iph->ihl < 5 || iph->version != 4)
+		goto inhdr_error;
+
+	if (!pskb_may_pull(skb, 4 * iph->ihl))
+		goto inhdr_error;
+
+	iph = ip_hdr(skb);
+	if (ip_fast_csum((__u8 *) iph, iph->ihl) != 0)
+		goto inhdr_error;
+
+	len = ntohs(iph->tot_len);
+	if (skb->len < len || len < 4 * iph->ihl)
+		goto inhdr_error;
+
+	pskb_trim_rcsum(skb, len);
+
+	/* BUG: Should really parse the IP options here. */
+	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
  
  	nf_bridge_put(skb->nf_bridge);
  	if (!nf_bridge_alloc(skb))
@@ -720,6 +674,10 @@
  		br_nf_pre_routing_finish);
  
  	return NF_STOLEN;
+
+inhdr_error:
+//      IP_INC_STATS_BH(IpInHdrErrors);
+	return NF_DROP;
  }
  
  
@@ -806,9 +764,6 @@
  		nf_bridge->mask |= BRNF_PKT_TYPE;
  	}
  
-	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
-		return NF_DROP;
-
  	/* The physdev module checks on this */
  	nf_bridge->mask |= BRNF_BRIDGED;
  	nf_bridge->physoutdev = skb->dev;
@@ -862,19 +817,14 @@
  #if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV4)
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
  {
-	int ret;
-
  	if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
  	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
  	    !skb_is_gso(skb)) {
-		if (br_parse_ip_options(skb))
-			/* Drop invalid packet */
-			return NF_DROP;
-		ret = ip_fragment(skb, br_dev_queue_push_xmit);
+		/* BUG: Should really parse the IP options here. */
+		memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+		return ip_fragment(skb, br_dev_queue_push_xmit);
  	} else
-		ret = br_dev_queue_push_xmit(skb);
-
-	return ret;
+		return br_dev_queue_push_xmit(skb);
  }
  #else
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-16 14:33         ` Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack) David Newall
@ 2014-05-16 15:19           ` Eric Dumazet
  2014-05-16 15:23             ` David Newall
  2014-05-16 15:24             ` David Newall
  2014-05-19 12:58           ` David Newall
  2014-05-22 20:06           ` Bandan Das
  2 siblings, 2 replies; 46+ messages in thread
From: Eric Dumazet @ 2014-05-16 15:19 UTC (permalink / raw)
  To: David Newall; +Cc: Lukas Tribus, Netdev, fw

On Sat, 2014-05-17 at 00:03 +0930, David Newall wrote:
> We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge 
> : Sanitize skb before it enters the IP stack) because it corrupts IP 
> packets with RR or TS options set, only partially updating the IP header 
> and leaving an incorrect checksum.
> 
> The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:
> 
> The bridge layer can overwrite the IPCB using the
> BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
> if we recieve a packet greater in size than the bridge
> device MTU, we call ip_fragment which in turn will lead to
> icmp_send calling ip_options_echo if the DF flag is set.
> ip_options_echo will then incorrectly try to parse the IPCB as
> IP options resulting in a buffer overflow.
> This change refills the CB area back with IP
> options before ip_fragment calls icmp_send. If we fail parsing,
> we zero out the IPCB area to guarantee that the stack does
> not get corrupted.
> 
> A bridge should not fragment packets; an *ethernet* bridge should not 
> need to.  Fragmenting packets is the job of higher level protocol.
> 
> --- br_netfilter.c	2014-01-20 13:10:07.000000000 +1030
> +++ br_netfilter.c.prop	2014-05-16 23:07:57.975386905 +0930


Is this meant to be an official patch ?

Please 

1) CC patch author (Bandan Das <bandan.das@stratus.com>) so he has a way
to comment if he no longer follows netdev ?

2) Read Documentation/SubmittingPatches 

>   	return NF_STOLEN;
> +
> +inhdr_error:
> +//      IP_INC_STATS_BH(IpInHdrErrors);

Don't leave this // 

Thanks

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-16 15:19           ` Eric Dumazet
@ 2014-05-16 15:23             ` David Newall
  2014-05-16 15:24             ` David Newall
  1 sibling, 0 replies; 46+ messages in thread
From: David Newall @ 2014-05-16 15:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Lukas Tribus, Netdev, fw

On 17/05/14 00:49, Eric Dumazet wrote:
> Is this meant to be an official patch ?

No.  I anticipate there will be discussion before we reach an official 
patch.
> Please
>
> 1) CC patch author (Bandan Das<bandan.das@stratus.com>) so he has a way
> to comment if he no longer follows netdev ?

His email bounces.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-16 15:19           ` Eric Dumazet
  2014-05-16 15:23             ` David Newall
@ 2014-05-16 15:24             ` David Newall
  1 sibling, 0 replies; 46+ messages in thread
From: David Newall @ 2014-05-16 15:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Lukas Tribus, Netdev, fw

On 17/05/14 00:49, Eric Dumazet wrote:
>> >   	return NF_STOLEN;
>> >+
>> >+inhdr_error:
>> >+//      IP_INC_STATS_BH(IpInHdrErrors);
> Don't leave this //

Thanks.  That was in the original code, prior to the commit I want 
reversed, but I'll take it out.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-16 14:33         ` Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack) David Newall
  2014-05-16 15:19           ` Eric Dumazet
@ 2014-05-19 12:58           ` David Newall
  2014-05-19 14:01             ` Florian Westphal
  2014-05-22 20:06           ` Bandan Das
  2 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-19 12:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Netdev, Linux Kernel Mailing List, bridge

Having received no feedback of substance from netdev, I now address my 
previous email to a wider audience for discussion and in preparation for 
submitting a patch based closely on that below.

This email is not addressed to Bandan Das <bandan.das@stratus.com>, who 
is the author of the commit I propose reverting, as his email address is 
no longer current.  I believe I have otherwise addressed all appropriate 
recipients and will circulate a formal patch to the same recipients if 
no adverse comments are received.  (That would surprise me.)


-------- Original Message --------
Subject: 	Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : 
Sanitize skb before it enters the IP stack)
Date: 	Sat, 17 May 2014 00:03:16 +0930
From: 	David Newall <davidn@davidnewall.com>
To: 	Lukas Tribus <luky-37@hotmail.com>, Eric Dumazet 
<eric.dumazet@gmail.com>, Netdev <netdev@vger.kernel.org>
CC: 	fw@strlen.de <fw@strlen.de>



We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge
: Sanitize skb before it enters the IP stack) because it corrupts IP
packets with RR or TS options set, only partially updating the IP header
and leaving an incorrect checksum.

The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:

The bridge layer can overwrite the IPCB using the
BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
if we recieve a packet greater in size than the bridge
device MTU, we call ip_fragment which in turn will lead to
icmp_send calling ip_options_echo if the DF flag is set.
ip_options_echo will then incorrectly try to parse the IPCB as
IP options resulting in a buffer overflow.
This change refills the CB area back with IP
options before ip_fragment calls icmp_send. If we fail parsing,
we zero out the IPCB area to guarantee that the stack does
not get corrupted.

A bridge should not fragment packets; an *ethernet* bridge should not
need to.  Fragmenting packets is the job of higher level protocol.

--- br_netfilter.c	2014-01-20 13:10:07.000000000 +1030
+++ br_netfilter.c.prop	2014-05-16 23:07:57.975386905 +0930
@@ -253,73 +253,6 @@
  		skb->protocol = htons(ETH_P_PPP_SES);
  }
  
-/* When handing a packet over to the IP layer
- * check whether we have a skb that is in the
- * expected format
- */
-
-static int br_parse_ip_options(struct sk_buff *skb)
-{
-	struct ip_options *opt;
-	const struct iphdr *iph;
-	struct net_device *dev = skb->dev;
-	u32 len;
-
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	opt = &(IPCB(skb)->opt);
-
-	/* Basic sanity checks */
-	if (iph->ihl < 5 || iph->version != 4)
-		goto inhdr_error;
-
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
-
-	len = ntohs(iph->tot_len);
-	if (skb->len < len) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
-		goto drop;
-	} else if (len < (iph->ihl*4))
-		goto inhdr_error;
-
-	if (pskb_trim_rcsum(skb, len)) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
-		goto drop;
-	}
-
-	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-	if (iph->ihl == 5)
-		return 0;
-
-	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
-	if (ip_options_compile(dev_net(dev), opt, skb))
-		goto inhdr_error;
-
-	/* Check correct handling of SRR option */
-	if (unlikely(opt->srr)) {
-		struct in_device *in_dev = __in_dev_get_rcu(dev);
-		if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
-			goto drop;
-
-		if (ip_options_rcv_srr(skb))
-			goto drop;
-	}
-
-	return 0;
-
-inhdr_error:
-	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
-drop:
-	return -1;
-}
-
  /* Fill in the header for fragmented IP packets handled by
   * the IPv4 connection tracking code.
   */
@@ -679,6 +612,7 @@
  {
  	struct net_bridge_port *p;
  	struct net_bridge *br;
+	const struct iphdr *iph;
  	__u32 len = nf_bridge_encap_header_len(skb);
  
  	if (unlikely(!pskb_may_pull(skb, len)))
@@ -704,9 +638,29 @@
  		return NF_ACCEPT;
  
  	nf_bridge_pull_encap_header_rcsum(skb);
+
+	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+		goto inhdr_error;
  
-	if (br_parse_ip_options(skb))
-		return NF_DROP;
+	iph = ip_hdr(skb);
+	if (iph->ihl < 5 || iph->version != 4)
+		goto inhdr_error;
+
+	if (!pskb_may_pull(skb, 4 * iph->ihl))
+		goto inhdr_error;
+
+	iph = ip_hdr(skb);
+	if (ip_fast_csum((__u8 *) iph, iph->ihl) != 0)
+		goto inhdr_error;
+
+	len = ntohs(iph->tot_len);
+	if (skb->len < len || len < 4 * iph->ihl)
+		goto inhdr_error;
+
+	pskb_trim_rcsum(skb, len);
+
+	/* BUG: Should really parse the IP options here. */
+	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
  
  	nf_bridge_put(skb->nf_bridge);
  	if (!nf_bridge_alloc(skb))
@@ -720,6 +674,10 @@
  		br_nf_pre_routing_finish);
  
  	return NF_STOLEN;
+
+inhdr_error:
+//      IP_INC_STATS_BH(IpInHdrErrors);
+	return NF_DROP;
  }
  
  
@@ -806,9 +764,6 @@
  		nf_bridge->mask |= BRNF_PKT_TYPE;
  	}
  
-	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
-		return NF_DROP;
-
  	/* The physdev module checks on this */
  	nf_bridge->mask |= BRNF_BRIDGED;
  	nf_bridge->physoutdev = skb->dev;
@@ -862,19 +817,14 @@
  #if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV4)
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
  {
-	int ret;
-
  	if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
  	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
  	    !skb_is_gso(skb)) {
-		if (br_parse_ip_options(skb))
-			/* Drop invalid packet */
-			return NF_DROP;
-		ret = ip_fragment(skb, br_dev_queue_push_xmit);
+		/* BUG: Should really parse the IP options here. */
+		memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+		return ip_fragment(skb, br_dev_queue_push_xmit);
  	} else
-		ret = br_dev_queue_push_xmit(skb);
-
-	return ret;
+		return br_dev_queue_push_xmit(skb);
  }
  #else
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-19 12:58           ` David Newall
@ 2014-05-19 14:01             ` Florian Westphal
  2014-05-19 14:19               ` David Newall
  0 siblings, 1 reply; 46+ messages in thread
From: Florian Westphal @ 2014-05-19 14:01 UTC (permalink / raw)
  To: David Newall; +Cc: Stephen Hemminger, Netdev, Linux Kernel Mailing List, bridge

David Newall <davidn@davidnewall.com> wrote:
> Having received no feedback of substance from netdev, I now address
> my previous email to a wider audience for discussion and in
> preparation for submitting a patch based closely on that below.
> 
> This email is not addressed to Bandan Das <bandan.das@stratus.com>,
> who is the author of the commit I propose reverting, as his email
> address is no longer current.  I believe I have otherwise addressed
> all appropriate recipients and will circulate a formal patch to the
> same recipients if no adverse comments are received.  (That would
> surprise me.)
> 
> -------- Original Message --------
> Subject: 	Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge :
> Sanitize skb before it enters the IP stack)
> Date: 	Sat, 17 May 2014 00:03:16 +0930
> From: 	David Newall <davidn@davidnewall.com>
> To: 	Lukas Tribus <luky-37@hotmail.com>, Eric Dumazet
> <eric.dumazet@gmail.com>, Netdev <netdev@vger.kernel.org>
> CC: 	fw@strlen.de <fw@strlen.de>
> 
> 
> 
> We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge
> : Sanitize skb before it enters the IP stack) because it corrupts IP
> packets with RR or TS options set, only partially updating the IP header
> and leaving an incorrect checksum.
> 
> The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:
> 
> The bridge layer can overwrite the IPCB using the
> BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
> if we recieve a packet greater in size than the bridge
> device MTU, we call ip_fragment which in turn will lead to
> icmp_send calling ip_options_echo if the DF flag is set.
> ip_options_echo will then incorrectly try to parse the IPCB as
> IP options resulting in a buffer overflow.
> This change refills the CB area back with IP
> options before ip_fragment calls icmp_send. If we fail parsing,
> we zero out the IPCB area to guarantee that the stack does
> not get corrupted.
>
> A bridge should not fragment packets; an *ethernet* bridge should not
> need to.  Fragmenting packets is the job of higher level protocol.

Well, did you test what happens if we try to refrag a packet
containing ip options after the revert?

can happen e.g. when using netfilter conntrack on top of a bridge.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-19 14:01             ` Florian Westphal
@ 2014-05-19 14:19               ` David Newall
  2014-05-19 17:09                 ` Florian Westphal
  2014-05-20  4:55                 ` Valdis.Kletnieks
  0 siblings, 2 replies; 46+ messages in thread
From: David Newall @ 2014-05-19 14:19 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Stephen Hemminger, Netdev, Linux Kernel Mailing List, bridge

Thanks for the reply.  I've been hanging out for it!

On 19/05/14 23:31, Florian Westphal wrote:
> Well, did you test what happens if we try to refrag a packet
> containing ip options after the revert?
>
> can happen e.g. when using netfilter conntrack on top of a bridge.

No.  I expect it would panic, as was reported prior to the commit.

I tried to persevere with the commit: I recalculated checksum, which 
left routes and times improperly updated in options.  Then I tried 
calling ip_forward_options, which looks like it would correctly update 
RR and TS (not to mention checksum)m but that bombed because skb_rtable 
returned NULL.  I think calling skb_set_dst would answer that, but I 
don't know how to get a valid dst.  (I asked for help but no answer.)

I see three ways to progress:

1. Possibly call ip_forward_option, but that requires somebody who 
understands this code to help;
2. Just recalculate the checksum, leaving crap in the options; or
3. Revert the commit.

Option 1 doesn't look like it's going to happen; option 2 is stupid; 
leaving option 3, and I begin to think that's the right way to go if 
bridge is supposed to be a bridge and not a router.  The idea that 
bridge is doing too much seems to have quite a lot of currency, so think 
of reversion as chopping off a canker.  Or we keep fixing bugs, adding 
to bridge, until it replicates all of IP.

How does a packet get fragmented in this case?  Does it only happen when 
bridging to a device with smaller MTU?  That scenario sounds quite 
un-bridge-like.  It also sounds like something that can be handled by 
real routing.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-19 14:19               ` David Newall
@ 2014-05-19 17:09                 ` Florian Westphal
  2014-05-19 20:49                   ` Bart De Schuymer
  2014-05-20  3:57                   ` David Newall
  2014-05-20  4:55                 ` Valdis.Kletnieks
  1 sibling, 2 replies; 46+ messages in thread
From: Florian Westphal @ 2014-05-19 17:09 UTC (permalink / raw)
  To: David Newall
  Cc: Florian Westphal, Stephen Hemminger, Netdev, netfilter-devel, bridge

David Newall <davidn@davidnewall.com> wrote:

[ remove lkml and cc nf-devel ]

> I tried to persevere with the commit: I recalculated checksum, which
> left routes and times improperly updated in options.  Then I tried
> calling ip_forward_options, which looks like it would correctly
> update RR and TS (not to mention checksum)m but that bombed because
> skb_rtable returned NULL.

Yes.  bridge<->netfilter wiring is pure duct tape.
The glue code will set up a fake rtable for the skb after the
prerouting hook. [ see br_nf_pre_routing_finish() ].

> I see three ways to progress:
> 
> 1. Possibly call ip_forward_option, but that requires somebody who
> understands this code to help;
> 2. Just recalculate the checksum, leaving crap in the options; or
> 3. Revert the commit.

I think none of these are an option.

I fail to understand why a bridge should honor/modifiy IP options.

For the 'local delivery' case the ip stack will take care of
option parsing, for forwarding it should be sufficient to do
sanity tests (for netfilters sake).

>From a quick glance, it should be sufficient to edit
br_parse_ip_options() and remove everything after

memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));

A 2nd step would be to move a copy of ip_options_compile()
into br_netfilter.c and trim it down to only validate the
ipv4 header without modifying it.

If there is a good reason to mangle options on a bridge i'd
prefer a comment explaining them...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-19 17:09                 ` Florian Westphal
@ 2014-05-19 20:49                   ` Bart De Schuymer
  2014-05-21  7:49                     ` David Newall
  2014-05-20  3:57                   ` David Newall
  1 sibling, 1 reply; 46+ messages in thread
From: Bart De Schuymer @ 2014-05-19 20:49 UTC (permalink / raw)
  To: Florian Westphal, David Newall
  Cc: Stephen Hemminger, Netdev, bridge, netfilter-devel

Florian Westphal schreef op 19/05/2014 19:09:
> David Newall <davidn@davidnewall.com> wrote:
>
> [ remove lkml and cc nf-devel ]
>
>> I tried to persevere with the commit: I recalculated checksum, which
>> left routes and times improperly updated in options.  Then I tried
>> calling ip_forward_options, which looks like it would correctly
>> update RR and TS (not to mention checksum)m but that bombed because
>> skb_rtable returned NULL.
>
> Yes.  bridge<->netfilter wiring is pure duct tape.
> The glue code will set up a fake rtable for the skb after the
> prerouting hook. [ see br_nf_pre_routing_finish() ].
>
>> I see three ways to progress:
>>
>> 1. Possibly call ip_forward_option, but that requires somebody who
>> understands this code to help;
>> 2. Just recalculate the checksum, leaving crap in the options; or
>> 3. Revert the commit.
>
> I think none of these are an option.
>
> I fail to understand why a bridge should honor/modifiy IP options.
>
> For the 'local delivery' case the ip stack will take care of
> option parsing, for forwarding it should be sufficient to do
> sanity tests (for netfilters sake).
>
>>From a quick glance, it should be sufficient to edit
> br_parse_ip_options() and remove everything after
>
> memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
>
> A 2nd step would be to move a copy of ip_options_compile()
> into br_netfilter.c and trim it down to only validate the
> ipv4 header without modifying it.

Perhaps it's possible to call ip_options_compile with a skb == NULL, 
like ip_options.c::ip_options_get_finish does. That way we don't need to 
duplicate code.
An alternative would be to make sure that the data pointed to by IPCB 
and BR_INPUT_SKB_CB don't overlap. If this were the case, we could 
indeed just revert the commit that was referred to.


cheers,
Bart

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-19 17:09                 ` Florian Westphal
  2014-05-19 20:49                   ` Bart De Schuymer
@ 2014-05-20  3:57                   ` David Newall
  1 sibling, 0 replies; 46+ messages in thread
From: David Newall @ 2014-05-20  3:57 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Stephen Hemminger, Netdev, netfilter-devel, bridge

On 20/05/14 02:39, Florian Westphal wrote:
>  From a quick glance, it should be sufficient to edit
> br_parse_ip_options() and remove everything after
>
> memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));

Yes.  That's the way it used to be, and how it would return with the 
change I'm proposing.  The br_parse_ip_option function would be removed 
and its remaining code moved back from whence it came.


> A 2nd step would be to move a copy of ip_options_compile()
> into br_netfilter.c and trim it down to only validate the
> ipv4 header without modifying it.

The bridge sounds like the wrong place to validate an IPv4 header, 
unless it also validates every type of header; and that can't be right.  
That we need to zero the cb area seems like a big clue that IP's 
treatment of the area is lame.  I think that's where the problem lies, 
and that the right thing to do is to yank out the crap from bridge that 
papers over IP's weakness.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-19 14:19               ` David Newall
  2014-05-19 17:09                 ` Florian Westphal
@ 2014-05-20  4:55                 ` Valdis.Kletnieks
  2014-05-20 16:05                   ` Vlad Yasevich
  2014-05-21  8:10                   ` David Newall
  1 sibling, 2 replies; 46+ messages in thread
From: Valdis.Kletnieks @ 2014-05-20  4:55 UTC (permalink / raw)
  To: David Newall
  Cc: Florian Westphal, Stephen Hemminger, Netdev,
	Linux Kernel Mailing List, bridge

[-- Attachment #1: Type: text/plain, Size: 1206 bytes --]

On Mon, 19 May 2014 23:49:22 +0930, David Newall said:

> How does a packet get fragmented in this case?  Does it only happen when
> bridging to a device with smaller MTU?  That scenario sounds quite
> un-bridge-like.  It also sounds like something that can be handled by
> real routing.

Which doesn't change the fact that you *will* get clowns who take a box that
has a 10G card on a jumbogram-enabled subnet that's running with an MTU of
9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route.
(Did you know that you can actually mount an NFS filesystem across that? And
that ls and cat and friends will work *just fine*? Until you hit a file that's
more than 1.5 in size, that is. And when you do a traceroute to the wedged
client, it tells you it's on the 10G network, so you have no idea why you're
seeing an MTU issue.  Don't ask how I know this - let's just say that
supporting HPC users is never boring. :)

So yes, we *do* need to do something sensible there - either frag the packet
on the way out, or something.  It *would* be nice if we could drop the
packet and send an ICMP Frag Needed back - except it's unclear what IP
you use as the source address for the ICMP....


[-- Attachment #2: Type: application/pgp-signature, Size: 848 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-20  4:55                 ` Valdis.Kletnieks
@ 2014-05-20 16:05                   ` Vlad Yasevich
  2014-05-21  8:10                   ` David Newall
  1 sibling, 0 replies; 46+ messages in thread
From: Vlad Yasevich @ 2014-05-20 16:05 UTC (permalink / raw)
  To: Valdis.Kletnieks, David Newall
  Cc: Stephen Hemminger, Netdev, bridge, Florian Westphal,
	Linux Kernel Mailing List

On 05/20/2014 12:55 AM, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 19 May 2014 23:49:22 +0930, David Newall said:
> 
>> How does a packet get fragmented in this case?  Does it only happen when
>> bridging to a device with smaller MTU?  That scenario sounds quite
>> un-bridge-like.  It also sounds like something that can be handled by
>> real routing.
> 
> Which doesn't change the fact that you *will* get clowns who take a box that
> has a 10G card on a jumbogram-enabled subnet that's running with an MTU of
> 9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route.
> (Did you know that you can actually mount an NFS filesystem across that? And
> that ls and cat and friends will work *just fine*? Until you hit a file that's
> more than 1.5 in size, that is. And when you do a traceroute to the wedged
> client, it tells you it's on the 10G network, so you have no idea why you're
> seeing an MTU issue.  Don't ask how I know this - let's just say that
> supporting HPC users is never boring. :)
> 
> So yes, we *do* need to do something sensible there - either frag the packet
> on the way out, or something.  It *would* be nice if we could drop the
> packet and send an ICMP Frag Needed back - except it's unclear what IP
> you use as the source address for the ICMP....
> 

If there is no netfilter, then the bridge will just drop the packet
(see br_dev_queue_push_xmit).  It should probably also do that with
netfilter.

On the question of ICMP, I've also debated about sending ICMP Frag
Needed, but that's really beyond the scope of the bridge device.

Recording a stat might be sufficient to help troubleshoot these types
of issues.

-vlad

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-19 20:49                   ` Bart De Schuymer
@ 2014-05-21  7:49                     ` David Newall
  2014-05-21 18:51                       ` Bart De Schuymer
  0 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-21  7:49 UTC (permalink / raw)
  To: Bart De Schuymer, Florian Westphal
  Cc: Stephen Hemminger, Netdev, netfilter-devel, bridge

Hi Bart,

Thanks for thinking about this.

On 20/05/14 06:19, Bart De Schuymer wrote:
> Perhaps it's possible to call ip_options_compile with a skb == NULL, 
> like ip_options.c::ip_options_get_finish does. That way we don't need 
> to duplicate code.

I think not.  My reading of the discussion behind the commit is that skb 
cb area could contain something that was confused for IP options.  To 
solve that, and to allow for proper response when the packet's DF flag 
was set, the cb area was cleared and ip_options_compile() was called.  
Calling that function does only part of the job, leaving slots for 
addresses (and possibly timestamps) to be filled in by a later function; 
possibly ip_forward_options().  I did try calling that; it failed; 
skb_rtable() returned NULL.

I have also read enough comments deriding the "incestuous" relationship 
between bridge and IP to convince me that the relationship should be 
severed.  A bridge is such a simple concept which, when it starts 
looking into the payload, ceases to be a bridge.

I have experience in this code measured in hours; not a lot.  I welcome 
correction if I misunderstand things.

> An alternative would be to make sure that the data pointed to by IPCB 
> and BR_INPUT_SKB_CB don't overlap. If this were the case, we could 
> indeed just revert the commit that was referred to.

They are identical spaces, but you imply a good point: the cb area is 
possibly being used, simultaneously, for two, incompatible purposes.  
Yet another argument for divorcing bridge of ip logic.

Regards,

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-20  4:55                 ` Valdis.Kletnieks
  2014-05-20 16:05                   ` Vlad Yasevich
@ 2014-05-21  8:10                   ` David Newall
  2014-05-21 20:14                     ` David Miller
  1 sibling, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-21  8:10 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Florian Westphal, Stephen Hemminger, Netdev,
	Linux Kernel Mailing List, bridge

On 20/05/14 14:25, Valdis.Kletnieks@vt.edu wrote:
> So yes, we*do*  need to do something sensible there - either frag the packet
> on the way out, or something.

I think the problem is that a bridge cannot be used across incompatible 
media.  That's the job of a router.

A bridge should act like a bridge, not a router.  Fragmenting the packet 
is wrong; that's IP's job.  Dropping the packet is also arguably wrong; 
that's the real device-driver's job.  What seems right to me is to act 
like a bridge and forward packets by looking inside of them *no more 
than is necessary*.  Looking beyond MAC address is perhaps too much.

We can finish the job of processing IP options, or at least in this 
scenario, but that seems wrong-headed and invites more work as more 
problems are discovered; or we could remove the half-hearted attempt it 
currently does and leave the bridge as a simple bridge.

This problem wouldn't occur if all devices in a bridge were required to 
be compatible media; particularly identical MTU.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-21  7:49                     ` David Newall
@ 2014-05-21 18:51                       ` Bart De Schuymer
  2014-05-21 20:18                         ` David Miller
  2014-05-22  3:50                         ` David Newall
  0 siblings, 2 replies; 46+ messages in thread
From: Bart De Schuymer @ 2014-05-21 18:51 UTC (permalink / raw)
  To: David Newall, Florian Westphal
  Cc: Stephen Hemminger, Netdev, netfilter-devel, bridge

David Newall schreef op 21/05/2014 9:49:
>> An alternative would be to make sure that the data pointed to by IPCB
>> and BR_INPUT_SKB_CB don't overlap. If this were the case, we could
>> indeed just revert the commit that was referred to.
>
> They are identical spaces, but you imply a good point: the cb area is
> possibly being used, simultaneously, for two, incompatible purposes. Yet
> another argument for divorcing bridge of ip logic.

There's no reason why they should overlap in the cb: it's 48 bytes big, 
so big enough to hold both struct br_input_skb_cb and struct 
inet_skb_parm. The original problem was introduced when BR_INPUT_SKB_CB 
was introduced (around Feb 27, 2010), so fixing BR_INPUT_SKB_CB seems 
most appropriate to me.
As for your other remark: as I've said before, if you don't like 
bridge-netfilter then don't compile it into your kernel.

Bart

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-21  8:10                   ` David Newall
@ 2014-05-21 20:14                     ` David Miller
  0 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2014-05-21 20:14 UTC (permalink / raw)
  To: davidn; +Cc: Valdis.Kletnieks, netdev, bridge, fw, linux-kernel, stephen

From: David Newall <davidn@davidnewall.com>
Date: Wed, 21 May 2014 17:40:25 +0930

> On 20/05/14 14:25, Valdis.Kletnieks@vt.edu wrote:
>> So yes, we*do* need to do something sensible there - either frag the
>> packet
>> on the way out, or something.
> 
> I think the problem is that a bridge cannot be used across
> incompatible media.  That's the job of a router.
> 
> A bridge should act like a bridge, not a router.  Fragmenting the
> packet is wrong; that's IP's job.  Dropping the packet is also
> arguably wrong; that's the real device-driver's job.  What seems right
> to me is to act like a bridge and forward packets by looking inside of
> them *no more than is necessary*.  Looking beyond MAC address is
> perhaps too much.
> 
> We can finish the job of processing IP options, or at least in this
> scenario, but that seems wrong-headed and invites more work as more
> problems are discovered; or we could remove the half-hearted attempt
> it currently does and leave the bridge as a simple bridge.
> 
> This problem wouldn't occur if all devices in a bridge were required
> to be compatible media; particularly identical MTU.

I completely agree with you.

I also just want to state for the record, and I know some people will
disagree with me, that I think the bridging netfilter layer should
never have been integrated into the tree.

And I've been saying this for more than a decade.

It takes layering violations to a whole new level, and it's why we see
problems like this.

Besides this IP options issue, it also creates fake ipv4 routes, so
every time someone tries to do anything non-trivial with the ipv4
routing code the bridging netfilter fake route code had to be adjusted
or else we'd get crashes.

It has also held back many potential improvements to iptables in
general over the years because it does so many things differently
than the rest of the iptables modules.

It stinks, we never should have added it, and now since we have people
have been perversely convinced that doing stuff like this is actually
sane.  It's not.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-21 18:51                       ` Bart De Schuymer
@ 2014-05-21 20:18                         ` David Miller
  2014-05-22 18:57                           ` Bart De Schuymer
  2014-05-24  5:56                           ` David Newall
  2014-05-22  3:50                         ` David Newall
  1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2014-05-21 20:18 UTC (permalink / raw)
  To: bdschuym; +Cc: netdev, davidn, bridge, fw, stephen, netfilter-devel

From: Bart De Schuymer <bdschuym@pandora.be>
Date: Wed, 21 May 2014 20:51:14 +0200

> David Newall schreef op 21/05/2014 9:49:
>>> An alternative would be to make sure that the data pointed to by IPCB
>>> and BR_INPUT_SKB_CB don't overlap. If this were the case, we could
>>> indeed just revert the commit that was referred to.
>>
>> They are identical spaces, but you imply a good point: the cb area is
>> possibly being used, simultaneously, for two, incompatible
>> purposes. Yet
>> another argument for divorcing bridge of ip logic.
> 
> There's no reason why they should overlap in the cb: it's 48 bytes
> big, so big enough to hold both struct br_input_skb_cb and struct
> inet_skb_parm. The original problem was introduced when
> BR_INPUT_SKB_CB was introduced (around Feb 27, 2010), so fixing
> BR_INPUT_SKB_CB seems most appropriate to me.

So you are suggesting the patch below will fix everything?

> As for your other remark: as I've said before, if you don't like
> bridge-netfilter then don't compile it into your kernel.

That's never a good argument, please stop making it.

%99.999999999 of users get their kernels from distributions and
they are all going to enable basically every feature available.

We never should have added bridging netfilter to the tree in the
first place, I wish I had better judgment back then.

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 06811d7..2300def 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -18,6 +18,7 @@
 #include <linux/netpoll.h>
 #include <linux/u64_stats_sync.h>
 #include <net/route.h>
+#include <net/ip.h>
 #include <linux/if_vlan.h>
 
 #define BR_HASH_BITS 8
@@ -297,6 +298,7 @@ struct net_bridge
 };
 
 struct br_input_skb_cb {
+	struct inet_skb_parm ip;
 	struct net_device *brdev;
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	int igmp;

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-21 18:51                       ` Bart De Schuymer
  2014-05-21 20:18                         ` David Miller
@ 2014-05-22  3:50                         ` David Newall
  2014-05-22 18:57                           ` Bart De Schuymer
  1 sibling, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-22  3:50 UTC (permalink / raw)
  To: Bart De Schuymer, Florian Westphal, David S. Miller
  Cc: Stephen Hemminger, Netdev, netfilter-devel, bridge

On 22/05/14 04:21, Bart De Schuymer wrote:
> There's no reason why they should overlap in the cb: it's 48 bytes 
> big, so big enough to hold both struct br_input_skb_cb and struct 
> inet_skb_parm.

No reason, aside from the math, I think.  Those 48 bytes appear to be 
used for 16 bytes of ip_options plus up to 40 bytes of options data, so 
we're using pretend-space; of which we'd need more to squeeze 
br_input_skb_cb in at the same time.

I hate opening a second can of worms, but, if I read this right, IPCB is 
quite, quite broken.


> As for your other remark: as I've said before, if you don't like 
> bridge-netfilter then don't compile it into your kernel.

That's not very helpful.  I could say, with just as much merit, that it 
should be marked deprecated (so that it's not compiled into distribution 
kernels) and you can compile it into yours.

What I dislike is that bridge-netfilter is faulty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-22  3:50                         ` David Newall
@ 2014-05-22 18:57                           ` Bart De Schuymer
  0 siblings, 0 replies; 46+ messages in thread
From: Bart De Schuymer @ 2014-05-22 18:57 UTC (permalink / raw)
  To: David Newall, Florian Westphal, David S. Miller
  Cc: Stephen Hemminger, Netdev, netfilter-devel, bridge

David Newall schreef op 22/05/2014 5:50:
> On 22/05/14 04:21, Bart De Schuymer wrote:
>> As for your other remark: as I've said before, if you don't like
>> bridge-netfilter then don't compile it into your kernel.
>
> That's not very helpful.  I could say, with just as much merit, that it
> should be marked deprecated (so that it's not compiled into distribution
> kernels) and you can compile it into yours.
>
> What I dislike is that bridge-netfilter is faulty.

I can see this may be frustrating to many. Now and then I actually got 
some positive feedback :-) Anyway, since I didn't have much sleep last 
night I'll be brief.
I'm fine with deprecating ebtables/bridge-nf. The code is over a decade 
old and development hasn't really picked up speed after I diverted my 
spare time to other things.
 From now on I'm no longer wasting my spare time on involvement in 
bridge-nf/ebtables. Anyone that wants to step up to take over the git 
repository (also contains arptables userspace app) can please contact me 
by private mail.

Best regards,
Bart

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-21 20:18                         ` David Miller
@ 2014-05-22 18:57                           ` Bart De Schuymer
  2014-05-24 18:00                             ` David Miller
  2014-05-24  5:56                           ` David Newall
  1 sibling, 1 reply; 46+ messages in thread
From: Bart De Schuymer @ 2014-05-22 18:57 UTC (permalink / raw)
  To: David Miller; +Cc: davidn, fw, stephen, netdev, netfilter-devel, bridge

David Miller schreef op 21/05/2014 22:18:
> From: Bart De Schuymer <bdschuym@pandora.be>
>> There's no reason why they should overlap in the cb: it's 48 bytes
>> big, so big enough to hold both struct br_input_skb_cb and struct
>> inet_skb_parm. The original problem was introduced when
>> BR_INPUT_SKB_CB was introduced (around Feb 27, 2010), so fixing
>> BR_INPUT_SKB_CB seems most appropriate to me.
>
> So you are suggesting the patch below will fix everything?

Assuming:
- David Newall's worries about IPCB are incorrect
- you also revert the commit mentioned by David 
(462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge :
Sanitize skb before it enters the IP stack))

Then I give it a good chance the regression will be gone with your patch.

> We never should have added bridging netfilter to the tree in the
> first place, I wish I had better judgment back then.

Feel free to deprecate it. This is my last spare-time involvement.

Please apply following patch:

diff --git a/MAINTAINERS b/MAINTAINERS
index f5de16e..2369bae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3163,10 +3163,9 @@ S:	Maintained
  F:	drivers/scsi/eata_pio.*

  EBTABLES
-M:	Bart De Schuymer <bart.de.schuymer@pandora.be>
  L:	netfilter-devel@vger.kernel.org
  W:	http://ebtables.sourceforge.net/
-S:	Maintained
+S:	Orphan
  F:	include/linux/netfilter_bridge/ebt_*.h
  F:	include/uapi/linux/netfilter_bridge/ebt_*.h
  F:	net/bridge/netfilter/ebt*.c


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-16 14:33         ` Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack) David Newall
  2014-05-16 15:19           ` Eric Dumazet
  2014-05-19 12:58           ` David Newall
@ 2014-05-22 20:06           ` Bandan Das
  2 siblings, 0 replies; 46+ messages in thread
From: Bandan Das @ 2014-05-22 20:06 UTC (permalink / raw)
  To: David Newall; +Cc: Lukas Tribus, Eric Dumazet, Netdev, fw

David Newall <davidn@davidnewall.com> writes:

> We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0
> (bridge : Sanitize skb before it enters the IP stack) because it
> corrupts IP packets with RR or TS options set, only partially updating
> the IP header and leaving an incorrect checksum.
>
> The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:
>
> The bridge layer can overwrite the IPCB using the
> BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
> if we recieve a packet greater in size than the bridge
> device MTU, we call ip_fragment which in turn will lead to
> icmp_send calling ip_options_echo if the DF flag is set.
> ip_options_echo will then incorrectly try to parse the IPCB as
> IP options resulting in a buffer overflow.
> This change refills the CB area back with IP
> options before ip_fragment calls icmp_send. If we fail parsing,
> we zero out the IPCB area to guarantee that the stack does
> not get corrupted.
>
> A bridge should not fragment packets; an *ethernet* bridge should not
> need to.  Fragmenting packets is the job of higher level protocol.

IIRC, older dhcp servers try to set a low enough mtu and that was
causing ip_fragment to bail.. I think I was seeing this with a 
KVM guest with a bridged interface set to fetch an ip from the host
network.

Are you sure this won't cause a regression ? I think this was the 
relevant change -
 static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 {
+       int ret;
+
        if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
            skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
            !skb_is_gso(skb)) {
-               /* BUG: Should really parse the IP options here. */
-               memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-               return ip_fragment(skb, br_dev_queue_push_xmit);
+               if (br_parse_ip_options(skb))
+                       /* Drop invalid packet */
+                       return NF_DROP;
+               ret = ip_fragment(skb, br_dev_queue_push_xmit);

This call was the reason for the change, are you sure this isn't 
applicable anymore ?

        } else
-               return br_dev_queue_push_xmit(skb);
+               ret = br_dev_queue_push_xmit(skb);
+
+       return ret;
 }

> --- br_netfilter.c	2014-01-20 13:10:07.000000000 +1030
> +++ br_netfilter.c.prop	2014-05-16 23:07:57.975386905 +0930
> @@ -253,73 +253,6 @@
>  		skb->protocol = htons(ETH_P_PPP_SES);
>  }
>  -/* When handing a packet over to the IP layer
> - * check whether we have a skb that is in the
> - * expected format
> - */
> -
> -static int br_parse_ip_options(struct sk_buff *skb)
> -{
> -	struct ip_options *opt;
> -	const struct iphdr *iph;
> -	struct net_device *dev = skb->dev;
> -	u32 len;
> -
> -	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> -		goto inhdr_error;
> -
> -	iph = ip_hdr(skb);
> -	opt = &(IPCB(skb)->opt);
> -
> -	/* Basic sanity checks */
> -	if (iph->ihl < 5 || iph->version != 4)
> -		goto inhdr_error;
> -
> -	if (!pskb_may_pull(skb, iph->ihl*4))
> -		goto inhdr_error;
> -
> -	iph = ip_hdr(skb);
> -	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> -		goto inhdr_error;
> -
> -	len = ntohs(iph->tot_len);
> -	if (skb->len < len) {
> -		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
> -		goto drop;
> -	} else if (len < (iph->ihl*4))
> -		goto inhdr_error;
> -
> -	if (pskb_trim_rcsum(skb, len)) {
> -		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
> -		goto drop;
> -	}
> -
> -	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> -	if (iph->ihl == 5)
> -		return 0;
> -
> -	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
> -	if (ip_options_compile(dev_net(dev), opt, skb))
> -		goto inhdr_error;
> -
> -	/* Check correct handling of SRR option */
> -	if (unlikely(opt->srr)) {
> -		struct in_device *in_dev = __in_dev_get_rcu(dev);
> -		if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
> -			goto drop;
> -
> -		if (ip_options_rcv_srr(skb))
> -			goto drop;
> -	}
> -
> -	return 0;
> -
> -inhdr_error:
> -	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
> -drop:
> -	return -1;
> -}
> -
>  /* Fill in the header for fragmented IP packets handled by
>   * the IPv4 connection tracking code.
>   */
> @@ -679,6 +612,7 @@
>  {
>  	struct net_bridge_port *p;
>  	struct net_bridge *br;
> +	const struct iphdr *iph;
>  	__u32 len = nf_bridge_encap_header_len(skb);
>  	if (unlikely(!pskb_may_pull(skb, len)))
> @@ -704,9 +638,29 @@
>  		return NF_ACCEPT;
>  	nf_bridge_pull_encap_header_rcsum(skb);
> +
> +	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> +		goto inhdr_error;
>  -	if (br_parse_ip_options(skb))
> -		return NF_DROP;
> +	iph = ip_hdr(skb);
> +	if (iph->ihl < 5 || iph->version != 4)
> +		goto inhdr_error;
> +
> +	if (!pskb_may_pull(skb, 4 * iph->ihl))
> +		goto inhdr_error;
> +
> +	iph = ip_hdr(skb);
> +	if (ip_fast_csum((__u8 *) iph, iph->ihl) != 0)
> +		goto inhdr_error;
> +
> +	len = ntohs(iph->tot_len);
> +	if (skb->len < len || len < 4 * iph->ihl)
> +		goto inhdr_error;
> +
> +	pskb_trim_rcsum(skb, len);
> +
> +	/* BUG: Should really parse the IP options here. */
> +	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
>  	nf_bridge_put(skb->nf_bridge);
>  	if (!nf_bridge_alloc(skb))
> @@ -720,6 +674,10 @@
>  		br_nf_pre_routing_finish);
>  	return NF_STOLEN;
> +
> +inhdr_error:
> +//      IP_INC_STATS_BH(IpInHdrErrors);
> +	return NF_DROP;
>  }
>  @@ -806,9 +764,6 @@
>  		nf_bridge->mask |= BRNF_PKT_TYPE;
>  	}
>  -	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
> -		return NF_DROP;
> -
>  	/* The physdev module checks on this */
>  	nf_bridge->mask |= BRNF_BRIDGED;
>  	nf_bridge->physoutdev = skb->dev;
> @@ -862,19 +817,14 @@
>  #if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV4)
>  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
>  {
> -	int ret;
> -
>  	if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
>  	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
>  	    !skb_is_gso(skb)) {
> -		if (br_parse_ip_options(skb))
> -			/* Drop invalid packet */
> -			return NF_DROP;
> -		ret = ip_fragment(skb, br_dev_queue_push_xmit);
> +		/* BUG: Should really parse the IP options here. */
> +		memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> +		return ip_fragment(skb, br_dev_queue_push_xmit);
>  	} else
> -		ret = br_dev_queue_push_xmit(skb);
> -
> -	return ret;
> +		return br_dev_queue_push_xmit(skb);
>  }
>  #else
>  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-21 20:18                         ` David Miller
  2014-05-22 18:57                           ` Bart De Schuymer
@ 2014-05-24  5:56                           ` David Newall
  2014-05-24 17:43                             ` David Miller
  1 sibling, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-24  5:56 UTC (permalink / raw)
  To: David Miller, bdschuym
  Cc: fw, stephen, netdev, netfilter-devel, bridge, Bandan Das, Vlad Yasevich

On 22/05/14 05:48, David Miller wrote:
> From: Bart De Schuymer<bdschuym@pandora.be>
> Date: Wed, 21 May 2014 20:51:14 +0200
> > There's no reason why they should overlap in the cb: it's 48 bytes
> > big, so big enough to hold both struct br_input_skb_cb and struct
> > inet_skb_parm. The original problem was introduced when
> > BR_INPUT_SKB_CB was introduced (around Feb 27, 2010), so fixing
> > BR_INPUT_SKB_CB seems most appropriate to me.
>
> So you are suggesting the patch below will fix everything?


First, of course I was wrong about ip overflowing the cb area.  Even I 
thought that was unlikely.  I've reread the code, much more carefully, 
and spotted where I went wrong.

I've added the change that David Miller provided, to those which I am 
proposing, and minimally tested them by pinging through a bridge with RR 
set.  No surprise: it works.

The patch now reverts the commit and mitigates the original problem by 
ensuring bridge's use of cb does not overlap ip's.

--- linux-source-3.13.0/net/bridge/br_netfilter.c.orig	2014-05-17 00:12:23.418906498 +0930
+++ linux-source-3.13.0/net/bridge/br_netfilter.c	2014-05-17 01:04:43.540972961 +0930
@@ -253,73 +253,6 @@ static inline void nf_bridge_update_prot
  		skb->protocol = htons(ETH_P_PPP_SES);
  }
  
-/* When handing a packet over to the IP layer
- * check whether we have a skb that is in the
- * expected format
- */
-
-static int br_parse_ip_options(struct sk_buff *skb)
-{
-	struct ip_options *opt;
-	const struct iphdr *iph;
-	struct net_device *dev = skb->dev;
-	u32 len;
-
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	opt = &(IPCB(skb)->opt);
-
-	/* Basic sanity checks */
-	if (iph->ihl < 5 || iph->version != 4)
-		goto inhdr_error;
-
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
-
-	len = ntohs(iph->tot_len);
-	if (skb->len < len) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
-		goto drop;
-	} else if (len < (iph->ihl*4))
-		goto inhdr_error;
-
-	if (pskb_trim_rcsum(skb, len)) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
-		goto drop;
-	}
-
-	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-	if (iph->ihl == 5)
-		return 0;
-
-	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
-	if (ip_options_compile(dev_net(dev), opt, skb))
-		goto inhdr_error;
-
-	/* Check correct handling of SRR option */
-	if (unlikely(opt->srr)) {
-		struct in_device *in_dev = __in_dev_get_rcu(dev);
-		if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
-			goto drop;
-
-		if (ip_options_rcv_srr(skb))
-			goto drop;
-	}
-
-	return 0;
-
-inhdr_error:
-	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
-drop:
-	return -1;
-}
-
  /* Fill in the header for fragmented IP packets handled by
   * the IPv4 connection tracking code.
   */
@@ -679,6 +612,7 @@ static unsigned int br_nf_pre_routing(co
  {
  	struct net_bridge_port *p;
  	struct net_bridge *br;
+	const struct iphdr *iph;
  	__u32 len = nf_bridge_encap_header_len(skb);
  
  	if (unlikely(!pskb_may_pull(skb, len)))
@@ -704,10 +638,30 @@ static unsigned int br_nf_pre_routing(co
  		return NF_ACCEPT;
  
  	nf_bridge_pull_encap_header_rcsum(skb);
+
+	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+		return NF_DROP;
  
-	if (br_parse_ip_options(skb))
+	iph = ip_hdr(skb);
+	if (iph->ihl < 5 || iph->version != 4)
+		return NF_DROP;
+
+	if (!pskb_may_pull(skb, 4 * iph->ihl))
  		return NF_DROP;
  
+	iph = ip_hdr(skb);
+	if (ip_fast_csum((__u8 *) iph, iph->ihl) != 0)
+		return NF_DROP;
+
+	len = ntohs(iph->tot_len);
+	if (skb->len < len || len < 4 * iph->ihl)
+		return NF_DROP;
+
+	pskb_trim_rcsum(skb, len);
+
+	/* BUG: Should really parse the IP options here. */
+	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+
  	nf_bridge_put(skb->nf_bridge);
  	if (!nf_bridge_alloc(skb))
  		return NF_DROP;
@@ -806,9 +760,6 @@ static unsigned int br_nf_forward_ip(con
  		nf_bridge->mask |= BRNF_PKT_TYPE;
  	}
  
-	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
-		return NF_DROP;
-
  	/* The physdev module checks on this */
  	nf_bridge->mask |= BRNF_BRIDGED;
  	nf_bridge->physoutdev = skb->dev;
@@ -862,19 +813,14 @@ static unsigned int br_nf_forward_arp(co
  #if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV4)
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
  {
-	int ret;
-
  	if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
  	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
  	    !skb_is_gso(skb)) {
-		if (br_parse_ip_options(skb))
-			/* Drop invalid packet */
-			return NF_DROP;
-		ret = ip_fragment(skb, br_dev_queue_push_xmit);
+		/* BUG: Should really parse the IP options here. */
+		memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+		return ip_fragment(skb, br_dev_queue_push_xmit);
  	} else
-		ret = br_dev_queue_push_xmit(skb);
-
-	return ret;
+		return br_dev_queue_push_xmit(skb);
  }
  #else
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
--- linux-source-3.13.0/net/ipv4/ip_options.c.orig	2014-05-16 18:11:10.260370554 +0930
+++ linux-source-3.13.0/net/ipv4/ip_options.c	2014-05-17 01:01:56.738277137 +0930
@@ -475,7 +475,6 @@ error:
  	}
  	return -EINVAL;
  }
-EXPORT_SYMBOL(ip_options_compile);
  
  /*
   *	Undo all the changes done by ip_options_compile().
@@ -658,4 +657,3 @@ int ip_options_rcv_srr(struct sk_buff *s
  	}
  	return 0;
  }
-EXPORT_SYMBOL(ip_options_rcv_srr);
--- /usr/src/linux-source-3.13.0/net/bridge/br_private.h.orig	2014-05-24 13:51:09.269709831 +0930
+++ /usr/src/linux-source-3.13.0/net/bridge/br_private.h	2014-05-24 13:53:20.243551927 +0930
@@ -18,6 +18,7 @@
  #include <linux/netpoll.h>
  #include <linux/u64_stats_sync.h>
  #include <net/route.h>
+#include <net/ip.h>
  #include <linux/if_vlan.h>
  
  #define BR_HASH_BITS 8
@@ -304,6 +305,7 @@ struct net_bridge
  };
  
  struct br_input_skb_cb {
+	struct inet_skb_parm ip;	/* we don't interfere with ip's use of cb area */
  	struct net_device *brdev;
  #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
  	int igmp;


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-24  5:56                           ` David Newall
@ 2014-05-24 17:43                             ` David Miller
  2014-05-25  2:32                               ` David Newall
  0 siblings, 1 reply; 46+ messages in thread
From: David Miller @ 2014-05-24 17:43 UTC (permalink / raw)
  To: davidn
  Cc: bdschuym, netdev, vyasevich, bridge, fw, stephen, bsd, netfilter-devel

From: David Newall <davidn@davidnewall.com>
Date: Sat, 24 May 2014 15:26:24 +0930

> The patch now reverts the commit and mitigates the original problem by
> ensuring bridge's use of cb does not overlap ip's.

This patch was substantially corrupted by your email client.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-22 18:57                           ` Bart De Schuymer
@ 2014-05-24 18:00                             ` David Miller
  0 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2014-05-24 18:00 UTC (permalink / raw)
  To: bdschuym; +Cc: netdev, davidn, bridge, fw, stephen, netfilter-devel

From: Bart De Schuymer <bdschuym@pandora.be>
Date: Thu, 22 May 2014 20:57:13 +0200

> Please apply following patch:

This patch was corrupted and wouldn't apply cleanly.

But I applied it by hand for you, because clearly (as has been the
case for some time) you don't care.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-24 17:43                             ` David Miller
@ 2014-05-25  2:32                               ` David Newall
  2014-05-25  3:02                                 ` David Miller
                                                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: David Newall @ 2014-05-25  2:32 UTC (permalink / raw)
  To: David Miller
  Cc: bdschuym, fw, stephen, netdev, netfilter-devel, bridge, bsd, vyasevich

On 25/05/14 03:13, David Miller wrote:
> This patch was substantially corrupted by your email client.

We should be sending these things as mime attachments.  Having to put 
patches inline is brittle, absurd and a waste of everyone's time.  Is 
there actually anybody here who doesn't have a mime-compatible MUA?

Trying again...

--- linux-source-3.13.0/net/bridge/br_netfilter.c.orig	2014-05-17 00:12:23.418906498 +0930
+++ linux-source-3.13.0/net/bridge/br_netfilter.c	2014-05-17 01:04:43.540972961 +0930
@@ -253,73 +253,6 @@ static inline void nf_bridge_update_prot
  		skb->protocol = htons(ETH_P_PPP_SES);
  }
  
-/* When handing a packet over to the IP layer
- * check whether we have a skb that is in the
- * expected format
- */
-
-static int br_parse_ip_options(struct sk_buff *skb)
-{
-	struct ip_options *opt;
-	const struct iphdr *iph;
-	struct net_device *dev = skb->dev;
-	u32 len;
-
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	opt = &(IPCB(skb)->opt);
-
-	/* Basic sanity checks */
-	if (iph->ihl < 5 || iph->version != 4)
-		goto inhdr_error;
-
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
-
-	len = ntohs(iph->tot_len);
-	if (skb->len < len) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
-		goto drop;
-	} else if (len < (iph->ihl*4))
-		goto inhdr_error;
-
-	if (pskb_trim_rcsum(skb, len)) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
-		goto drop;
-	}
-
-	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-	if (iph->ihl == 5)
-		return 0;
-
-	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
-	if (ip_options_compile(dev_net(dev), opt, skb))
-		goto inhdr_error;
-
-	/* Check correct handling of SRR option */
-	if (unlikely(opt->srr)) {
-		struct in_device *in_dev = __in_dev_get_rcu(dev);
-		if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
-			goto drop;
-
-		if (ip_options_rcv_srr(skb))
-			goto drop;
-	}
-
-	return 0;
-
-inhdr_error:
-	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
-drop:
-	return -1;
-}
-
  /* Fill in the header for fragmented IP packets handled by
   * the IPv4 connection tracking code.
   */
@@ -679,6 +612,7 @@ static unsigned int br_nf_pre_routing(co
  {
  	struct net_bridge_port *p;
  	struct net_bridge *br;
+	const struct iphdr *iph;
  	__u32 len = nf_bridge_encap_header_len(skb);
  
  	if (unlikely(!pskb_may_pull(skb, len)))
@@ -704,10 +638,30 @@ static unsigned int br_nf_pre_routing(co
  		return NF_ACCEPT;
  
  	nf_bridge_pull_encap_header_rcsum(skb);
+
+	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+		return NF_DROP;
  
-	if (br_parse_ip_options(skb))
+	iph = ip_hdr(skb);
+	if (iph->ihl < 5 || iph->version != 4)
+		return NF_DROP;
+
+	if (!pskb_may_pull(skb, 4 * iph->ihl))
  		return NF_DROP;
  
+	iph = ip_hdr(skb);
+	if (ip_fast_csum((__u8 *) iph, iph->ihl) != 0)
+		return NF_DROP;
+
+	len = ntohs(iph->tot_len);
+	if (skb->len < len || len < 4 * iph->ihl)
+		return NF_DROP;
+
+	pskb_trim_rcsum(skb, len);
+
+	/* BUG: Should really parse the IP options here. */
+	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+
  	nf_bridge_put(skb->nf_bridge);
  	if (!nf_bridge_alloc(skb))
  		return NF_DROP;
@@ -806,9 +760,6 @@ static unsigned int br_nf_forward_ip(con
  		nf_bridge->mask |= BRNF_PKT_TYPE;
  	}
  
-	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
-		return NF_DROP;
-
  	/* The physdev module checks on this */
  	nf_bridge->mask |= BRNF_BRIDGED;
  	nf_bridge->physoutdev = skb->dev;
@@ -862,19 +813,14 @@ static unsigned int br_nf_forward_arp(co
  #if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV4)
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
  {
-	int ret;
-
  	if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
  	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
  	    !skb_is_gso(skb)) {
-		if (br_parse_ip_options(skb))
-			/* Drop invalid packet */
-			return NF_DROP;
-		ret = ip_fragment(skb, br_dev_queue_push_xmit);
+		/* BUG: Should really parse the IP options here. */
+		memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+		return ip_fragment(skb, br_dev_queue_push_xmit);
  	} else
-		ret = br_dev_queue_push_xmit(skb);
-
-	return ret;
+		return br_dev_queue_push_xmit(skb);
  }
  #else
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
--- linux-source-3.13.0/net/ipv4/ip_options.c.orig	2014-05-16 18:11:10.260370554 +0930
+++ linux-source-3.13.0/net/ipv4/ip_options.c	2014-05-17 01:01:56.738277137 +0930
@@ -475,7 +475,6 @@ error:
  	}
  	return -EINVAL;
  }
-EXPORT_SYMBOL(ip_options_compile);
  
  /*
   *	Undo all the changes done by ip_options_compile().
@@ -658,4 +657,3 @@ int ip_options_rcv_srr(struct sk_buff *s
  	}
  	return 0;
  }
-EXPORT_SYMBOL(ip_options_rcv_srr);
--- /usr/src/linux-source-3.13.0/net/bridge/br_private.h.orig	2014-05-24 13:51:09.269709831 +0930
+++ /usr/src/linux-source-3.13.0/net/bridge/br_private.h	2014-05-24 13:53:20.243551927 +0930
@@ -18,6 +18,7 @@
  #include <linux/netpoll.h>
  #include <linux/u64_stats_sync.h>
  #include <net/route.h>
+#include <net/ip.h>
  #include <linux/if_vlan.h>
  
  #define BR_HASH_BITS 8
@@ -304,6 +305,7 @@ struct net_bridge
  };
  
  struct br_input_skb_cb {
+	struct inet_skb_parm ip;	/* we don't interfere with ip's use of cb area */
  	struct net_device *brdev;
  #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
  	int igmp;

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-25  2:32                               ` David Newall
@ 2014-05-25  3:02                                 ` David Miller
  2014-05-25  6:37                                   ` David Newall
  2014-05-27  8:55                                 ` David Laight
  2014-05-29 22:34                                 ` David Miller
  2 siblings, 1 reply; 46+ messages in thread
From: David Miller @ 2014-05-25  3:02 UTC (permalink / raw)
  To: davidn
  Cc: bdschuym, netdev, vyasevich, bridge, fw, stephen, bsd, netfilter-devel

From: David Newall <davidn@davidnewall.com>
Date: Sun, 25 May 2014 12:02:03 +0930

> On 25/05/14 03:13, David Miller wrote:
>> This patch was substantially corrupted by your email client.
> 
> We should be sending these things as mime attachments.  Having to put
> patches inline is brittle, absurd and a waste of everyone's time.  Is
> there actually anybody here who doesn't have a mime-compatible MUA?

It makes replying and commenting inline easy.

It's not our problem that so many email clients make sending
plain unmolested ASCII text difficult.  But at least we've gone
out of our way to document how to do so in the kernel tree.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-25  3:02                                 ` David Miller
@ 2014-05-25  6:37                                   ` David Newall
  0 siblings, 0 replies; 46+ messages in thread
From: David Newall @ 2014-05-25  6:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, netfilter-devel, bridge

On 25/05/14 12:32, David Miller wrote:
> From: David Newall<davidn@davidnewall.com>
> Date: Sun, 25 May 2014 12:02:03 +0930
>
>> On 25/05/14 03:13, David Miller wrote:
>>> This patch was substantially corrupted by your email client.
>> >We should be sending these things as mime attachments.
> It makes replying and commenting inline easy.

Patches are intrinsically corrupted by commenting on them inline, and 
that doesn't matter.  What does matter is when a patch that people will 
need to test is corrupted, and sending them as mime attachments is the 
best answer I know of.  It's trivial to copy and paste from an 
attachment to the body so that you can comment; far easier than copying 
and pasting a patch verbatim (i.e. without corrupting it.)


> It's not our problem that so many email clients make sending
> plain unmolested ASCII text difficult.

It wasn't the email client; it was the xfce-terminal copy that corrupted 
it.  It's proven to corrupt this patch in two different ways; the other, 
which I caught before send, was because of unreliable scrollback.

In fact it is our problem when we insist that patches be sent in a way 
which we know is brittle and error-prone; our problem and our fault.  
Just imagine if you could have back all of the time you've wasted 
looking at included code, only to discover that it had been corrupted in 
some way or another; and then multiply that by everybody else who's 
wasted time the same way.  The argument that it makes it easy to comment 
is unconvincing to me because the alternative is so easy.

I apologise for this noise as I don't believe this is something which 
will change any time soon; it will change, just not soon.  I'm quite 
willing to drop the issue.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-25  2:32                               ` David Newall
  2014-05-25  3:02                                 ` David Miller
@ 2014-05-27  8:55                                 ` David Laight
  2014-05-29 22:34                                 ` David Miller
  2 siblings, 0 replies; 46+ messages in thread
From: David Laight @ 2014-05-27  8:55 UTC (permalink / raw)
  To: 'David Newall', David Miller
  Cc: bdschuym, fw, stephen, netdev, netfilter-devel, bridge, bsd, vyasevich

From: David Newall
> On 25/05/14 03:13, David Miller wrote:
> > This patch was substantially corrupted by your email client.
> 
> We should be sending these things as mime attachments.  Having to put
> patches inline is brittle, absurd and a waste of everyone's time.  Is
> there actually anybody here who doesn't have a mime-compatible MUA?

Yes - anyone using the email client from the world's largest desktop
computer software company.

It doesn't have any method for displaying text attachments.
It has a scheme for executing attachments, for which it will use
an interpreter based on the filename extension.
(Yes - this is why it is very good at propagating viruses.)

FWIW it can send valid patches quite easily - just copy/paste from wordpad.
(Possibly after hacking the registry to allow lines longer than 72 characters.

	David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-25  2:32                               ` David Newall
  2014-05-25  3:02                                 ` David Miller
  2014-05-27  8:55                                 ` David Laight
@ 2014-05-29 22:34                                 ` David Miller
  2014-05-30  9:17                                   ` David Newall
  2 siblings, 1 reply; 46+ messages in thread
From: David Miller @ 2014-05-29 22:34 UTC (permalink / raw)
  To: davidn
  Cc: bdschuym, fw, stephen, netdev, netfilter-devel, bridge, bsd, vyasevich

From: David Newall <davidn@davidnewall.com>
Date: Sun, 25 May 2014 12:02:03 +0930

> +	pskb_trim_rcsum(skb, len);

You really need to check the return value as this can perform allocations,
GFP_ATOMIC ones in fact.

Also, why are we not bumping the statistics any more?  I didn't see a
discussion of that in this thread.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-29 22:34                                 ` David Miller
@ 2014-05-30  9:17                                   ` David Newall
  2014-05-31  0:46                                     ` David Miller
  0 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-30  9:17 UTC (permalink / raw)
  To: David Miller
  Cc: bdschuym, fw, stephen, netdev, netfilter-devel, bridge, bsd, vyasevich

On 30/05/14 08:04, David Miller wrote:
> You really need to check the return value as this can perform allocations,
> GFP_ATOMIC ones in fact.
>
> Also, why are we not bumping the statistics any more?  I didn't see a
> discussion of that in this thread.

I was only restoring the code as it was before the commit.  Maybe this, 
(instead of the previous patch of br_netfilter.c,) to keep the (added) 
check on pskb_may_pull's return value and incremented statistics?

--- br_netfilter.c	2014-05-30 18:01:40.221868365 +0930
+++ br_netfilter.c.orig	2014-05-30 18:17:39.697425383 +0930
@@ -253,73 +253,6 @@ static inline void nf_bridge_update_prot
  		skb->protocol = htons(ETH_P_PPP_SES);
  }
  
-/* When handing a packet over to the IP layer
- * check whether we have a skb that is in the
- * expected format
- */
-
-static int br_parse_ip_options(struct sk_buff *skb)
-{
-	struct ip_options *opt;
-	const struct iphdr *iph;
-	struct net_device *dev = skb->dev;
-	u32 len;
-
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	opt = &(IPCB(skb)->opt);
-
-	/* Basic sanity checks */
-	if (iph->ihl < 5 || iph->version != 4)
-		goto inhdr_error;
-
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		goto inhdr_error;
-
-	iph = ip_hdr(skb);
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
-
-	len = ntohs(iph->tot_len);
-	if (skb->len < len) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
-		goto drop;
-	} else if (len < (iph->ihl*4))
-		goto inhdr_error;
-
-	if (pskb_trim_rcsum(skb, len)) {
-		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
-		goto drop;
-	}
-
-	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-	if (iph->ihl == 5)
-		return 0;
-
-	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
-	if (ip_options_compile(dev_net(dev), opt, skb))
-		goto inhdr_error;
-
-	/* Check correct handling of SRR option */
-	if (unlikely(opt->srr)) {
-		struct in_device *in_dev = __in_dev_get_rcu(dev);
-		if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
-			goto drop;
-
-		if (ip_options_rcv_srr(skb))
-			goto drop;
-	}
-
-	return 0;
-
-inhdr_error:
-	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
-drop:
-	return -1;
-}
-
  /* Fill in the header for fragmented IP packets handled by
   * the IPv4 connection tracking code.
   */
@@ -679,6 +612,8 @@ static unsigned int br_nf_pre_routing(co
  {
  	struct net_bridge_port *p;
  	struct net_bridge *br;
+	const struct iphdr *iph;
+	struct net_device *dev = skb->dev;
  	__u32 len = nf_bridge_encap_header_len(skb);
  
  	if (unlikely(!pskb_may_pull(skb, len)))
@@ -704,9 +639,35 @@ static unsigned int br_nf_pre_routing(co
  		return NF_ACCEPT;
  
  	nf_bridge_pull_encap_header_rcsum(skb);
+
+	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+		goto inhdr_error;
+
+	iph = ip_hdr(skb);
+	if (iph->ihl < 5 || iph->version != 4)
+		goto inhdr_error;
+
+	if (!pskb_may_pull(skb, 4 * iph->ihl))
+		goto inhdr_error;
+
+	iph = ip_hdr(skb);
+	if (ip_fast_csum((__u8 *) iph, iph->ihl) != 0)
+		goto inhdr_error;
+
+	len = ntohs(iph->tot_len);
+	if (skb->len < len) {
+		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
+		return NF_DROP;
+	} else if (len < (iph->ihl*4))
+		goto inhdr_error;
  
-	if (br_parse_ip_options(skb))
+	if (pskb_trim_rcsum(skb, len)) {
+		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
  		return NF_DROP;
+	}
+
+	/* BUG: Should really parse the IP options here. */
+	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
  
  	nf_bridge_put(skb->nf_bridge);
  	if (!nf_bridge_alloc(skb))
@@ -720,6 +681,10 @@ static unsigned int br_nf_pre_routing(co
  		br_nf_pre_routing_finish);
  
  	return NF_STOLEN;
+
+inhdr_error:
+	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
+	return NF_DROP;
  }
  
  
@@ -806,9 +771,6 @@ static unsigned int br_nf_forward_ip(con
  		nf_bridge->mask |= BRNF_PKT_TYPE;
  	}
  
-	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
-		return NF_DROP;
-
  	/* The physdev module checks on this */
  	nf_bridge->mask |= BRNF_BRIDGED;
  	nf_bridge->physoutdev = skb->dev;
@@ -862,19 +824,14 @@ static unsigned int br_nf_forward_arp(co
  #if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV4)
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
  {
-	int ret;
-
  	if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
  	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
  	    !skb_is_gso(skb)) {
-		if (br_parse_ip_options(skb))
-			/* Drop invalid packet */
-			return NF_DROP;
-		ret = ip_fragment(skb, br_dev_queue_push_xmit);
+		/* BUG: Should really parse the IP options here. */
+		memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+		return ip_fragment(skb, br_dev_queue_push_xmit);
  	} else
-		ret = br_dev_queue_push_xmit(skb);
-
-	return ret;
+		return br_dev_queue_push_xmit(skb);
  }
  #else
  static int br_nf_dev_queue_xmit(struct sk_buff *skb)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-30  9:17                                   ` David Newall
@ 2014-05-31  0:46                                     ` David Miller
  2014-05-31  6:13                                       ` David Newall
  0 siblings, 1 reply; 46+ messages in thread
From: David Miller @ 2014-05-31  0:46 UTC (permalink / raw)
  To: davidn
  Cc: bdschuym, fw, stephen, netdev, netfilter-devel, bridge, bsd, vyasevich

From: David Newall <davidn@davidnewall.com>
Date: Fri, 30 May 2014 18:47:58 +0930

> On 30/05/14 08:04, David Miller wrote:
>> You really need to check the return value as this can perform
>> allocations,
>> GFP_ATOMIC ones in fact.
>>
>> Also, why are we not bumping the statistics any more?  I didn't see a
>> discussion of that in this thread.
> 
> I was only restoring the code as it was before the commit.  Maybe
> this, (instead of the previous patch of br_netfilter.c,) to keep the
> (added) check on pskb_may_pull's return value and incremented
> statistics?

I don't see why you don't simply keep br_parse_ip_options() around
and adjust it as you need, you're just mostly duplicating it's
contents into br_nf_pre_routing().

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-31  0:46                                     ` David Miller
@ 2014-05-31  6:13                                       ` David Newall
  2014-05-31  6:37                                         ` David Miller
  0 siblings, 1 reply; 46+ messages in thread
From: David Newall @ 2014-05-31  6:13 UTC (permalink / raw)
  To: David Miller
  Cc: bdschuym, fw, stephen, netdev, netfilter-devel, bridge, bsd, vyasevich

On 31/05/14 10:16, David Miller wrote:
> I don't see why you don't simply keep br_parse_ip_options() around
> and adjust it as you need, you're just mostly duplicating it's
> contents into br_nf_pre_routing().

More accurately, I'm *restoring* br_parse_ip_options()'s contents to 
br_nf_pre_routing().  The reasons why are twofold: I'm undoing a change 
which turns out to have been a mistake; and leaving it largely as-is, 
just removing the call to ip_options_compile(), would be confusing in 
that the name (br_pase_ip_options()) gives an expectation of function 
that would be untrue.

I can see an argument in favour of leaving br_parse_options() around, 
being that it is called from three places, and thus restoring the code 
removes checks which are currently being performed.  They weren't being 
performed before and it's not clear that they are needed, but if you say 
that it would be better, I'll leave it around and just remove the call 
to ip_options_compile(). Just say the word.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
  2014-05-31  6:13                                       ` David Newall
@ 2014-05-31  6:37                                         ` David Miller
  0 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2014-05-31  6:37 UTC (permalink / raw)
  To: davidn
  Cc: bdschuym, fw, stephen, netdev, netfilter-devel, bridge, bsd, vyasevich

From: David Newall <davidn@davidnewall.com>
Date: Sat, 31 May 2014 15:43:16 +0930

> On 31/05/14 10:16, David Miller wrote:
>> I don't see why you don't simply keep br_parse_ip_options() around
>> and adjust it as you need, you're just mostly duplicating it's
>> contents into br_nf_pre_routing().
> 
> More accurately, I'm *restoring* br_parse_ip_options()'s contents to
> br_nf_pre_routing().  The reasons why are twofold: I'm undoing a
> change which turns out to have been a mistake; and leaving it largely
> as-is, just removing the call to ip_options_compile(), would be
> confusing in that the name (br_pase_ip_options()) gives an expectation
> of function that would be untrue.
> 
> I can see an argument in favour of leaving br_parse_options() around,
> being that it is called from three places, and thus restoring the code
> removes checks which are currently being performed.  They weren't
> being performed before and it's not clear that they are needed, but if
> you say that it would be better, I'll leave it around and just remove
> the call to ip_options_compile(). Just say the word.

You can rename the function to something more suitable.

Because then it's just a handful of line changes rather than a huge
bunch of hunks which are harder to audit.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2014-05-31  6:37 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-11 14:41 Bad checksum on bridge with IP options David Newall
2014-05-11 19:42 ` Lukas Tribus
2014-05-12  8:14   ` David Newall
2014-05-12 10:15     ` Lukas Tribus
2014-05-12 10:25       ` David Newall
2014-05-12 10:31         ` Lukas Tribus
2014-05-12 10:48           ` David Newall
2014-05-12 13:23 ` David Newall
2014-05-12 13:51   ` Florian Westphal
2014-05-12 14:19     ` David Newall
2014-05-12 18:54   ` Lukas Tribus
2014-05-12 23:46     ` David Newall
2014-05-14 13:08       ` David Newall
2014-05-16 14:33         ` Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack) David Newall
2014-05-16 15:19           ` Eric Dumazet
2014-05-16 15:23             ` David Newall
2014-05-16 15:24             ` David Newall
2014-05-19 12:58           ` David Newall
2014-05-19 14:01             ` Florian Westphal
2014-05-19 14:19               ` David Newall
2014-05-19 17:09                 ` Florian Westphal
2014-05-19 20:49                   ` Bart De Schuymer
2014-05-21  7:49                     ` David Newall
2014-05-21 18:51                       ` Bart De Schuymer
2014-05-21 20:18                         ` David Miller
2014-05-22 18:57                           ` Bart De Schuymer
2014-05-24 18:00                             ` David Miller
2014-05-24  5:56                           ` David Newall
2014-05-24 17:43                             ` David Miller
2014-05-25  2:32                               ` David Newall
2014-05-25  3:02                                 ` David Miller
2014-05-25  6:37                                   ` David Newall
2014-05-27  8:55                                 ` David Laight
2014-05-29 22:34                                 ` David Miller
2014-05-30  9:17                                   ` David Newall
2014-05-31  0:46                                     ` David Miller
2014-05-31  6:13                                       ` David Newall
2014-05-31  6:37                                         ` David Miller
2014-05-22  3:50                         ` David Newall
2014-05-22 18:57                           ` Bart De Schuymer
2014-05-20  3:57                   ` David Newall
2014-05-20  4:55                 ` Valdis.Kletnieks
2014-05-20 16:05                   ` Vlad Yasevich
2014-05-21  8:10                   ` David Newall
2014-05-21 20:14                     ` David Miller
2014-05-22 20:06           ` Bandan Das

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).