linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] flush_cache_page while pte valid
@ 2002-11-11 18:25 Hugh Dickins
  2002-11-11 18:35 ` Andrew Morton
  2002-11-11 23:19 ` David S. Miller
  0 siblings, 2 replies; 29+ messages in thread
From: Hugh Dickins @ 2002-11-11 18:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave McCracken, Rik van Riel, David S. Miller, linux-kernel

On some architectures (cachetlb.txt gives HyperSparc as an example)
it is essential to flush_cache_page while pte is still valid: the
rmap VM diverged from the base 2.4 VM before that fix was made,
so this error has crept back into 2.5.

Patch below applies to 2.5.47 or 2.5.47-mm1 - needs more work over
shared pagetables, but they've silently fallen out of 2.5.47-mm1:
oversight?  I'll send Alan the equivalent for 2.4-ac shortly.

(I wonder, what happens if userspace now modifies the page
after the flush_cache_page, before the pte is invalidated?)

--- 2.5.47/mm/rmap.c	Mon Oct  7 20:37:50 2002
+++ linux/mm/rmap.c	Mon Nov 11 17:01:27 2002
@@ -393,9 +393,9 @@
 	}
 
 	/* Nuke the page table entry. */
+	flush_cache_page(vma, address);
 	pte = ptep_get_and_clear(ptep);
 	flush_tlb_page(vma, address);
-	flush_cache_page(vma, address);
 
 	/* Store the swap location in the pte. See handle_pte_fault() ... */
 	if (PageSwapCache(page)) {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-11 18:25 [PATCH] flush_cache_page while pte valid Hugh Dickins
@ 2002-11-11 18:35 ` Andrew Morton
  2002-11-11 23:19 ` David S. Miller
  1 sibling, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2002-11-11 18:35 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Dave McCracken, Rik van Riel, David S. Miller, linux-kernel

Hugh Dickins wrote:
> 
> On some architectures (cachetlb.txt gives HyperSparc as an example)
> it is essential to flush_cache_page while pte is still valid: the
> rmap VM diverged from the base 2.4 VM before that fix was made,
> so this error has crept back into 2.5.

OK, thanks.

> Patch below applies to 2.5.47 or 2.5.47-mm1 - needs more work over
> shared pagetables, but they've silently fallen out of 2.5.47-mm1:
> oversight?

Yes, oversight.  dcache-rcu was similarly lost.  Pathetic, isn't it?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-11 18:25 [PATCH] flush_cache_page while pte valid Hugh Dickins
  2002-11-11 18:35 ` Andrew Morton
@ 2002-11-11 23:19 ` David S. Miller
  2002-11-11 23:34   ` [PATCH] [RFC] increase MAX_ADDR_LEN Roland Dreier
  2002-11-12  6:53   ` [PATCH] flush_cache_page while pte valid Hugh Dickins
  1 sibling, 2 replies; 29+ messages in thread
From: David S. Miller @ 2002-11-11 23:19 UTC (permalink / raw)
  To: hugh; +Cc: akpm, dmccr, riel, linux-kernel

   From: Hugh Dickins <hugh@veritas.com>
   Date: Mon, 11 Nov 2002 18:25:25 +0000 (GMT)

   On some architectures (cachetlb.txt gives HyperSparc as an example)
   it is essential to flush_cache_page while pte is still valid: the
   rmap VM diverged from the base 2.4 VM before that fix was made,
   so this error has crept back into 2.5.
...   
   (I wonder, what happens if userspace now modifies the page
   after the flush_cache_page, before the pte is invalidated?)

Thanks for catching this.

On architectures that are affected (such as the mentioned HyperSPARC
chips), the cpu will take a trap and OOPS the kernel if the PTE is
invalidated before the cache flush is made.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-11 23:19 ` David S. Miller
@ 2002-11-11 23:34   ` Roland Dreier
  2002-11-11 23:38     ` David S. Miller
  2002-11-12  6:53   ` [PATCH] flush_cache_page while pte valid Hugh Dickins
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2002-11-11 23:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

Hi Dave,

I posted this to lkml last week and didn't get any response, positive
or negative, so I'm sending this directly to you.  Would you include
these patches (or something that solves the same problem) in 2.5?
This will make it possible to add IP-over-InfiniBand when the driver
is ready without changes to the core kernel.  Since the driver will
not be ready until 2.6 time, I would very much like to get all the
core changes required into the kernel now, before it's too late.

I'm very open to a suggestion of a better way to handle the fact that
IPoIB has a 20 byte hardware address, and I'd be happy to implement a
patch for an alternative approach if that's what it will take to get
support into the kernel.

Thanks,
  Roland <roland@topspin.com>

Here's my previous email:

Included below are some changes that I would like to see included (in
some form) in the standard kernel.  They will make future support for
InfiniBand much simpler; while the code at <http://infiniband.sf.net>
is nowhere near ready for widespread use (let alone inclusion in the
kernel), if we can get these small changes in the kernel then it will
be much simpler to distribute and use InfiniBand drivers in the future.

Note that I am definitely not asking that these patches be included
as-is in the kernel.  I would just like to open discussion by raising
an issue and proposing one possible solution.

These patches (against Linus's BK tree of a few days ago):

  1. Add ARPHRD_INFINIBAND to if_arp.h.  This is extremely minor, but
     does make an IP-over-InfiniBand driver cleaner.  My feeling is we
     might as well do this.

  2. Increase MAX_ADDR_LEN to 32 (from 8) and make some small changes
     to prevent this from causing problems for existing code.

  3. Get rid of suspicious-looking uses of MAX_ADDR_LEN in the sungem
     and s390/net/lcs drivers.  I have no idea whether these changes
     are necessary but I wanted to call this code to the maintainer's
     attention.

I expect #2 to be somewhat controversial, so let me give my motivation
for proposing this.  The IETF draft for IP-over-InfiniBand (which
seems very unlikely to change at this point) defines the IPoIB
hardware address to be 20 bytes long.  Of course, not everyone feels
this was the best way to define the encapsulation, but the issue was
extensively debated in the IETF ipoib working group, and the 20 byte
hardware address seems to be the least bad option.

(By the way, I have no particular attachment to the exact number 32.
As long as we raise MAX_ADDR_LEN to at least 20, I'll be happy.  It
does seem desirable for MAX_ADDR_LEN to be a multiple of 8.  With this
in mind, 24 would be perfectly acceptable for IPoIB; I just chose 32
becase I had to choose something!)

There are two ways to implement an IPoIB network driver:

  1. The driver can tell the kernel that the hardware address length
     is less than it really is, and rewrite packets as they pass into
     and out of the driver.  I've actually implemented this, and while
     it works, it is ugly, complex, and implies that special tools are
     required in userspace (for example, creating an ARP entry
     requires a special mechanism for passing the hardware address
     directly to the driver, and then creating a corresponding entry
     in the kernel's ARP table).

  2. The driver can give the real 20-byte hardware address length to
     the kernel.  This seems much cleaner, but of course it requires
     the value of MAX_ADDR_LEN to be increased.

Since the second option seems so much better, I would like to get the
needed changes into the kernel before the 2.6 release.  If this change
is not in the standard kernel, then vendor kernels will likely not
pick it up.  This will substantially complicate the task of developing
an IPoIB driver, since anyone who wants to test or use the driver will
have to patch their kernel.  If these changes go into the standard
kernel, then an IPoIB driver can simply be distributed as a module
that is compiled and loaded into any 2.6 kernel.

Please try these patches and let me know if they cause any problems
for you.  I am currently running these patches and a small dummy net
driver that registers a device with type ARPHRD_INFINIBAND and
hardware address length 20 on my workstation without trouble.
ifconfig does produce bogus output for the dummy network device but
that is another battle.  (I would be interested in ideas for how to
resolve the fact that the sa_data member of struct sockaddr is only 14
bytes long)

Of course I am also eager to find out other developers' thoughts on
how to implement IPoIB on Linux.

Thanks,
  Roland <roland@topspin.com>

===================================================================

You can import this changeset into BK by piping this whole message to:
'| bk receive [path to repository]' or apply the patch as usual.

===================================================================


ChangeSet@1.874, 2002-11-07 11:37:37-08:00, roland@topspin.com
  Increase MAX_ADDR_LEN from 8 to 32 (to support
  IP-over-InfiniBand hardware addresses)

ChangeSet@1.873, 2002-11-07 11:35:31-08:00, roland@topspin.com
  Add IANA-assigned ARPHRD_INFINIBAND value for IP-over-InfiniBand to if_arp.h


 drivers/net/sungem.c      |    2 +-
 drivers/s390/net/lcs.c    |    2 +-
 include/linux/if_arp.h    |    1 +
 include/linux/netdevice.h |    2 +-
 net/core/dev.c            |    4 ++--
 5 files changed, 6 insertions(+), 5 deletions(-)


diff -Nru a/drivers/net/sungem.c b/drivers/net/sungem.c
--- a/drivers/net/sungem.c	Thu Nov  7 12:47:32 2002
+++ b/drivers/net/sungem.c	Thu Nov  7 12:47:32 2002
@@ -2858,7 +2858,7 @@
 		printk(KERN_ERR "%s: can't get mac-address\n", dev->name);
 		return -1;
 	}
-	memcpy(dev->dev_addr, addr, MAX_ADDR_LEN);
+	memcpy(dev->dev_addr, addr, 6);
 #else
 	get_gem_mac_nonobp(gp->pdev, gp->dev->dev_addr);
 #endif
diff -Nru a/drivers/s390/net/lcs.c b/drivers/s390/net/lcs.c
--- a/drivers/s390/net/lcs.c	Thu Nov  7 12:47:32 2002
+++ b/drivers/s390/net/lcs.c	Thu Nov  7 12:47:32 2002
@@ -3569,7 +3569,7 @@
 	struct ifmcaddr6 *im6;
 	struct inet6_dev *in6_dev;
 #endif
-	char buf[MAX_ADDR_LEN];
+	char buf[LCS_ADDR_LEN];
 	lcs_ipm_list *curr_lmem, *new_lmem;
 	lcs_state oldstate;
 
diff -Nru a/include/linux/if_arp.h b/include/linux/if_arp.h
--- a/include/linux/if_arp.h	Thu Nov  7 12:47:32 2002
+++ b/include/linux/if_arp.h	Thu Nov  7 12:47:32 2002
@@ -40,6 +40,7 @@
 #define ARPHRD_METRICOM	23		/* Metricom STRIP (new IANA id)	*/
 #define	ARPHRD_IEEE1394	24		/* IEEE 1394 IPv4 - RFC 2734	*/
 #define ARPHRD_EUI64	27		/* EUI-64                       */
+#define ARPHRD_INFINIBAND 32		/* InfiniBand			*/
 
 /* Dummy types for non ARP hardware */
 #define ARPHRD_SLIP	256
diff -Nru a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h	Thu Nov  7 12:47:32 2002
+++ b/include/linux/netdevice.h	Thu Nov  7 12:47:32 2002
@@ -65,7 +65,7 @@
 
 #endif
 
-#define MAX_ADDR_LEN	8		/* Largest hardware address length */
+#define MAX_ADDR_LEN	32		/* Largest hardware address length */
 
 /*
  *	Compute the worst case header length according to the protocols
diff -Nru a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c	Thu Nov  7 12:47:32 2002
+++ b/net/core/dev.c	Thu Nov  7 12:47:32 2002
@@ -2062,7 +2062,7 @@
 
 		case SIOCGIFHWADDR:
 			memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,
-			       MAX_ADDR_LEN);
+			       min(sizeof ifr->ifr_hwaddr.sa_data, dev->addr_len));
 			ifr->ifr_hwaddr.sa_family = dev->type;
 			return 0;
 
@@ -2083,7 +2083,7 @@
 			if (ifr->ifr_hwaddr.sa_family != dev->type)
 				return -EINVAL;
 			memcpy(dev->broadcast, ifr->ifr_hwaddr.sa_data,
-			       MAX_ADDR_LEN);
+			       min(sizeof ifr->ifr_hwaddr.sa_data, dev->addr_len));
 			notifier_call_chain(&netdev_chain,
 					    NETDEV_CHANGEADDR, dev);
 			return 0;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-11 23:34   ` [PATCH] [RFC] increase MAX_ADDR_LEN Roland Dreier
@ 2002-11-11 23:38     ` David S. Miller
  2002-11-11 23:58       ` Roland Dreier
  2002-11-12  0:18       ` Alan Cox
  0 siblings, 2 replies; 29+ messages in thread
From: David S. Miller @ 2002-11-11 23:38 UTC (permalink / raw)
  To: roland; +Cc: linux-kernel


So how are apps able to specify such larger hw addresses to configure
a driver if IFHWADDRLEN is still 6?

I'm not going to increase MAX_ADDR_LEN if there is no user ABI capable
of configuring such larger addresses properly.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-11 23:38     ` David S. Miller
@ 2002-11-11 23:58       ` Roland Dreier
  2002-11-12  0:18       ` Alan Cox
  1 sibling, 0 replies; 29+ messages in thread
From: Roland Dreier @ 2002-11-11 23:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

>>>>> "David" == David S Miller <davem@redhat.com> writes:

    David> So how are apps able to specify such larger hw addresses to
    David> configure a driver if IFHWADDRLEN is still 6?

In the InfiniBand case, the device's hardware address comes from a
combination of the port GID (which is set by the InfiniBand subnet
manager through an IB-specific mechanism) and the queue pair number
that the driver gets when it initializes.  There definitely still are 
problems to solve, such as specifying static ARP entries.

    David> I'm not going to increase MAX_ADDR_LEN if there is no user
    David> ABI capable of configuring such larger addresses properly.

What would you consider a palatable ABI?  (I'm happy to implement it)
Enlarging sa_data in struct sockaddr doesn't seem feasible.  I guess
we could add a new socket ioctl() or extend SIOCGIFHWADDR/SIOCSIFHWADDR
somehow....

Thanks,
  Roland <roland@topspin.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12  0:18       ` Alan Cox
@ 2002-11-12  0:01         ` Roland Dreier
  2002-11-12 14:23           ` Alan Cox
  0 siblings, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2002-11-12  0:01 UTC (permalink / raw)
  To: Alan Cox; +Cc: David S. Miller, Linux Kernel Mailing List

>>>>> "Alan" == Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

    >>>>> On Mon, 2002-11-11 at 23:38, David S. Miller wrote:

    Dave> So how are apps able to specify such larger hw addresses to
    Dave> configure a driver if IFHWADDRLEN is still 6?

    Dave> I'm not going to increase MAX_ADDR_LEN if there is no user
    Dave> ABI capable of configuring such larger addresses properly.

    Alan> The kernel just ignores it. We support multiple devices with
    Alan> larger address lengths. Its mostly a legacy constant

What drivers in the kernel are there with address length more than
MAX_ADDR_LEN?  What do they put dev->addr_len and dev->dev_addr?

Thanks,
  Roland  <roland@topspin.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-11 23:38     ` David S. Miller
  2002-11-11 23:58       ` Roland Dreier
@ 2002-11-12  0:18       ` Alan Cox
  2002-11-12  0:01         ` Roland Dreier
  1 sibling, 1 reply; 29+ messages in thread
From: Alan Cox @ 2002-11-12  0:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: roland, Linux Kernel Mailing List

On Mon, 2002-11-11 at 23:38, David S. Miller wrote:
> 
> So how are apps able to specify such larger hw addresses to configure
> a driver if IFHWADDRLEN is still 6?
> 
> I'm not going to increase MAX_ADDR_LEN if there is no user ABI capable
> of configuring such larger addresses properly.

The kernel just ignores it. We support multiple devices with larger
address lengths. Its mostly a legacy constant


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-11 23:19 ` David S. Miller
  2002-11-11 23:34   ` [PATCH] [RFC] increase MAX_ADDR_LEN Roland Dreier
@ 2002-11-12  6:53   ` Hugh Dickins
  2002-11-12  6:53     ` David S. Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2002-11-12  6:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, dmccr, riel, linux-kernel

On Mon, 11 Nov 2002, David S. Miller wrote:
>    From: Hugh Dickins <hugh@veritas.com>
>    Date: Mon, 11 Nov 2002 18:25:25 +0000 (GMT)
> 
>    On some architectures (cachetlb.txt gives HyperSparc as an example)
>    it is essential to flush_cache_page while pte is still valid: the
>    rmap VM diverged from the base 2.4 VM before that fix was made,
>    so this error has crept back into 2.5.
> ...   
>    (I wonder, what happens if userspace now modifies the page
>    after the flush_cache_page, before the pte is invalidated?)
> 
> Thanks for catching this.
> 
> On architectures that are affected (such as the mentioned HyperSPARC
> chips), the cpu will take a trap and OOPS the kernel if the PTE is
> invalidated before the cache flush is made.

Thanks for shedding light on that; but I'm still wondering if there
might be data loss if userspace modifies the page in the tiny window
between correctly positioned flush_cache_page and pte invalidation?

Hugh


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-12  6:53   ` [PATCH] flush_cache_page while pte valid Hugh Dickins
@ 2002-11-12  6:53     ` David S. Miller
  2002-11-12 14:49       ` Rik van Riel
  2002-11-12 17:43       ` Hugh Dickins
  0 siblings, 2 replies; 29+ messages in thread
From: David S. Miller @ 2002-11-12  6:53 UTC (permalink / raw)
  To: hugh; +Cc: akpm, dmccr, riel, linux-kernel

   From: Hugh Dickins <hugh@veritas.com>
   Date: Tue, 12 Nov 2002 06:53:04 +0000 (GMT)
   
   Thanks for shedding light on that; but I'm still wondering if there
   might be data loss if userspace modifies the page in the tiny window
   between correctly positioned flush_cache_page and pte invalidation?

The flush merely writes back the data, a copy-back operation, fully L2
cache coherent.  All cpus will see correct data if an intermittant
store occurs.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12 14:23           ` Alan Cox
@ 2002-11-12 14:20             ` Folkert van Heusden
  2002-11-12 15:14             ` Roland Dreier
  1 sibling, 0 replies; 29+ messages in thread
From: Folkert van Heusden @ 2002-11-12 14:20 UTC (permalink / raw)
  To: alan; +Cc: roland, davem, linux-kernel

Hi,

Am I right that in net/ipv4/tcp_ipv4.c in function "tcp_v4_get_port" the
portnumber for a new connection is generated? Because fiddling with that
code seems to have no effect on the portnumbers generated for new
connections.


Folkert



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12  0:01         ` Roland Dreier
@ 2002-11-12 14:23           ` Alan Cox
  2002-11-12 14:20             ` Folkert van Heusden
  2002-11-12 15:14             ` Roland Dreier
  0 siblings, 2 replies; 29+ messages in thread
From: Alan Cox @ 2002-11-12 14:23 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David S. Miller, Linux Kernel Mailing List

On Tue, 2002-11-12 at 00:01, Roland Dreier wrote:
> What drivers in the kernel are there with address length more than
> MAX_ADDR_LEN?  What do they put dev->addr_len and dev->dev_addr?

AX.25 addresses are 7 bytes long at the physical layer, but because they
including routing info up to about 60 bytes long elsewhere. Fortunately
lengths are passed to all but the hw level SIOCSIF/SIOCGIF calls, and
even those can be increased a little without harm as they use the
struct sockaddr - in fact I think 14 bytes would be the limit.

Seems trivial to do in 2.5 and quite doable in 2.4 if you actually have
to worry about this at the SIOCSIFADDR/GIFHWADDR level.

Alan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-12  6:53     ` David S. Miller
@ 2002-11-12 14:49       ` Rik van Riel
  2002-11-12 21:45         ` David S. Miller
  2002-11-12 17:43       ` Hugh Dickins
  1 sibling, 1 reply; 29+ messages in thread
From: Rik van Riel @ 2002-11-12 14:49 UTC (permalink / raw)
  To: David S. Miller; +Cc: hugh, akpm, dmccr, linux-kernel

On Mon, 11 Nov 2002, David S. Miller wrote:
>    From: Hugh Dickins <hugh@veritas.com>
>    Date: Tue, 12 Nov 2002 06:53:04 +0000 (GMT)
>
>    Thanks for shedding light on that; but I'm still wondering if there
>    might be data loss if userspace modifies the page in the tiny window
>    between correctly positioned flush_cache_page and pte invalidation?
>
> The flush merely writes back the data, a copy-back operation, fully L2
> cache coherent.  All cpus will see correct data if an intermittant
> store occurs.

The CPUs will, but the harddisk might not.

We really need to get this right in the swapout path.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://distro.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12 14:23           ` Alan Cox
  2002-11-12 14:20             ` Folkert van Heusden
@ 2002-11-12 15:14             ` Roland Dreier
  2002-11-12 16:00               ` Alan Cox
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2002-11-12 15:14 UTC (permalink / raw)
  To: Alan Cox; +Cc: David S. Miller, Linux Kernel Mailing List

>>>>> "Alan" == Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

    Alan> AX.25 addresses are 7 bytes long at the physical layer, but
    Alan> because they including routing info up to about 60 bytes
    Alan> long elsewhere. Fortunately lengths are passed to all but
    Alan> the hw level SIOCSIF/SIOCGIF calls, and even those can be
    Alan> increased a little without harm as they use the struct
    Alan> sockaddr - in fact I think 14 bytes would be the limit.

    Alan> Seems trivial to do in 2.5 and quite doable in 2.4 if you
    Alan> actually have to worry about this at the
    Alan> SIOCSIFADDR/GIFHWADDR level.

Hmm... the problem I would like to solve is that IP-over-InfiniBand
has 20 byte hardware addresses.  One can implement a driver that lies
about its HW address len (you have an internal ARP cache and only
expose a hash value/cookie to the rest of the world).  In fact I've
done just that for IPoIB.

It works but it's ugly and overly complex.  I would like to find a
clean solution for 2.5/2.6, but it looks like it will require changes
to the net core.  I'd like to do those now so the IPoIB driver can
just drop in cleanly later.  (I want to be able to just add
drivers/infiniband during 2.6 without and invasive changes required)

To do that seems to require increasing MAX_ADDR_LEN -- otherwise I
don't see what the driver can put in addr_len/dev_addr.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12 16:00               ` Alan Cox
@ 2002-11-12 15:50                 ` YOSHIFUJI Hideaki / 吉藤英明
  2002-11-12 20:36                 ` Roland Dreier
  1 sibling, 0 replies; 29+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2002-11-12 15:50 UTC (permalink / raw)
  To: alan; +Cc: roland, davem, linux-kernel

In article <1037116836.8500.55.camel@irongate.swansea.linux.org.uk> (at 12 Nov 2002 16:00:36 +0000), Alan Cox <alan@lxorguk.ukuu.org.uk> says:

> 2. Add some new address setting ioctls, and ensure the old ones keep the
> old address length limit. That is needed because the old caller wont
> have allocated enough address space for a 20 byte address return.
> 
> You have to solve both though, and the first patch should probably be
> the one to add more sensible address set/get functions.

*BSDs have SIOCGLIFPHYADDR etc., but, they're obsolete; 
we should use rtnetlink (or routing socket in BSDs) to manage 
addresses.  So, not having such ioctls for long addresses 
would be ok.

-- 
Hideaki YOSHIFUJI @ USAGI Project <yoshfuji@linux-ipv6.org>
GPG FP: 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12 15:14             ` Roland Dreier
@ 2002-11-12 16:00               ` Alan Cox
  2002-11-12 15:50                 ` YOSHIFUJI Hideaki / 吉藤英明
  2002-11-12 20:36                 ` Roland Dreier
  0 siblings, 2 replies; 29+ messages in thread
From: Alan Cox @ 2002-11-12 16:00 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David S. Miller, Linux Kernel Mailing List

On Tue, 2002-11-12 at 15:14, Roland Dreier wrote:
> Hmm... the problem I would like to solve is that IP-over-InfiniBand
> has 20 byte hardware addresses.  One can implement a driver that lies
> about its HW address len (you have an internal ARP cache and only
> expose a hash value/cookie to the rest of the world).  In fact I've
> done just that for IPoIB.

Our ARP code handles AX.25 fine, and that can include long addresses, so
either we have a live bug or it works 8). The multicast layer does have
problems with addresses over 8 bytes (MAX_ADDR_LEN).

> It works but it's ugly and overly complex.  I would like to find a
> clean solution for 2.5/2.6, but it looks like it will require changes
> to the net core.  I'd like to do those now so the IPoIB driver can
> just drop in cleanly later.  (I want to be able to just add
> drivers/infiniband during 2.6 without and invasive changes required)

I think you need to do two things.

1. Increase MAX_ADDR_LEN
2. Add some new address setting ioctls, and ensure the old ones keep the
old address length limit. That is needed because the old caller wont
have allocated enough address space for a 20 byte address return.

You have to solve both though, and the first patch should probably be
the one to add more sensible address set/get functions.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-12  6:53     ` David S. Miller
  2002-11-12 14:49       ` Rik van Riel
@ 2002-11-12 17:43       ` Hugh Dickins
  2002-11-12 21:51         ` David S. Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2002-11-12 17:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, dmccr, riel, linux-kernel

On Mon, 11 Nov 2002, David S. Miller wrote:
>    From: Hugh Dickins <hugh@veritas.com>
>    Date: Tue, 12 Nov 2002 06:53:04 +0000 (GMT)
>    
>    Thanks for shedding light on that; but I'm still wondering if there
>    might be data loss if userspace modifies the page in the tiny window
>    between correctly positioned flush_cache_page and pte invalidation?
> 
> The flush merely writes back the data, a copy-back operation, fully L2
> cache coherent.  All cpus will see correct data if an intermittant
> store occurs.

Sorry, I still don't get it.  If the flush_cache_page is doing something
necessary, then won't a user access in between it and invalidating pte
undo what was necessary?  And if it's not necessary, why do we do it?
(For better performance would be a very good reason.)

But don't worry about me: I may well have some fundamental misconception
which emailing back and forth will fail to resolve, would just waste your
time and expose my ignorance.  (I think Andrew has sometimes protested
that "flush" can mean too many different things.)  So I'm trying to
understand it better from over here - but glad to see Rik seems
to understand my concern.

Hugh


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12 16:00               ` Alan Cox
  2002-11-12 15:50                 ` YOSHIFUJI Hideaki / 吉藤英明
@ 2002-11-12 20:36                 ` Roland Dreier
  2002-11-12 22:31                   ` David S. Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2002-11-12 20:36 UTC (permalink / raw)
  To: Alan Cox; +Cc: David S. Miller, Linux Kernel Mailing List

>>>>> "Alan" == Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

    Alan> 1. Increase MAX_ADDR_LEN 2. Add some new address setting
    Alan> ioctls, and ensure the old ones keep the old address length
    Alan> limit. That is needed because the old caller wont have
    Alan> allocated enough address space for a 20 byte address return.

Thanks to YOSHIFUJI Hideaki / 吉藤英明, I had a look at rtnetlink.  It
seems like we would get the necessary address setting functionality if
I implemented the following:

  1. Add an RTM_SETLINK message type that handles at least the
     IFLA_ADDRESS attribute.  This would replace SIOCSIFHWADDR for
     interfaces with long hardware addresses.

  2. Add code to handle receiving RTM_NEWNEIGH and RTM_DELNEIGH
     messages from user space.  This would replace SIOCSARP and
     SIOCDARP for interfaces with long hardware addresses.

Dave, Alan, if I wrote a patch to do this would you accept it?  (And
following that increase MAX_ADDR_LEN?)

(By the way the original patch I posted added code to the
SIOCSIFHWADDR/SIOCGIFHWADDR handler to prevent a long hardware address
from overrunning the ifr_data member that user space passed in)

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-12 14:49       ` Rik van Riel
@ 2002-11-12 21:45         ` David S. Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David S. Miller @ 2002-11-12 21:45 UTC (permalink / raw)
  To: riel; +Cc: hugh, akpm, dmccr, linux-kernel

   From: Rik van Riel <riel@conectiva.com.br>
   Date: Tue, 12 Nov 2002 12:49:50 -0200 (BRST)

   On Mon, 11 Nov 2002, David S. Miller wrote:
   > The flush merely writes back the data, a copy-back operation, fully L2
   > cache coherent.  All cpus will see correct data if an intermittant
   > store occurs.
   
   The CPUs will, but the harddisk might not.
   
   We really need to get this right in the swapout path.

We do get it right, watch:

1) remove last mapping instance of page

   -> MM level cache flush pushes it permanently to ram
      in full view of DMA activity

2) remove last mapping, but new one appears as we swap out

   -> no problem we'll see one of several instances of the
      page and we'll need to reswap it out later anyways
      if someone currently writes to it

I know this because I tested this extensively ages ago on sparc32
where it really mattered.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-12 17:43       ` Hugh Dickins
@ 2002-11-12 21:51         ` David S. Miller
  2002-11-12 22:59           ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: David S. Miller @ 2002-11-12 21:51 UTC (permalink / raw)
  To: hugh; +Cc: akpm, dmccr, riel, linux-kernel

   From: Hugh Dickins <hugh@veritas.com>
   Date: Tue, 12 Nov 2002 17:43:40 +0000 (GMT)
   
   Sorry, I still don't get it.  If the flush_cache_page is doing something
   necessary, then won't a user access in between it and invalidating pte
   undo what was necessary?  And if it's not necessary, why do we do it?
   (For better performance would be a very good reason.)
   
If there are other writable mappings of the page, we can't swap
it out legally.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12 20:36                 ` Roland Dreier
@ 2002-11-12 22:31                   ` David S. Miller
  2002-11-12 22:39                     ` Roland Dreier
  2002-12-04 18:31                     ` rtnetlink replacement for SIOCSIFHWADDR Roland Dreier
  0 siblings, 2 replies; 29+ messages in thread
From: David S. Miller @ 2002-11-12 22:31 UTC (permalink / raw)
  To: roland; +Cc: alan, linux-kernel

   From: Roland Dreier <roland@topspin.com>
   Date: 12 Nov 2002 12:36:10 -0800

   Dave, Alan, if I wrote a patch to do this would you accept it?  (And
   following that increase MAX_ADDR_LEN?)

I would have to see it first, but likely yes.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [RFC] increase MAX_ADDR_LEN
  2002-11-12 22:31                   ` David S. Miller
@ 2002-11-12 22:39                     ` Roland Dreier
  2002-12-04 18:31                     ` rtnetlink replacement for SIOCSIFHWADDR Roland Dreier
  1 sibling, 0 replies; 29+ messages in thread
From: Roland Dreier @ 2002-11-12 22:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: alan, linux-kernel

>>>>> "David" == David S Miller <davem@redhat.com> writes:

    Roland> if I wrote a patch to do this would you accept it?  (And
    Roland> following that increase MAX_ADDR_LEN?)

    David> I would have to see it first, but likely yes.

Of course, that's what I would expect.  (And I expect to go through a
few revisions before it's good enough)  I just don't want to start on
something that you think is categorically a stupid idea.

I will code up the rtnetlink extensions I proposed and post them when
they are ready.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] flush_cache_page while pte valid
  2002-11-12 21:51         ` David S. Miller
@ 2002-11-12 22:59           ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2002-11-12 22:59 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, dmccr, riel, linux-kernel

On Tue, 12 Nov 2002, David S. Miller wrote:

>    From: Hugh Dickins <hugh@veritas.com>
>    Date: Tue, 12 Nov 2002 17:43:40 +0000 (GMT)
>    
>    Sorry, I still don't get it.  If the flush_cache_page is doing something
>    necessary, then won't a user access in between it and invalidating pte
>    undo what was necessary?  And if it's not necessary, why do we do it?
>    (For better performance would be a very good reason.)
>    
> If there are other writable mappings of the page, we can't swap
> it out legally.

But I'm worried about the case where this is the last writable mapping:
it seems userspace (on another CPU) can still write to it in between
the flush_cache_page and invalidation of the pte (on this CPU hoping
to swap out that page).

Hugh


^ permalink raw reply	[flat|nested] 29+ messages in thread

* rtnetlink replacement for SIOCSIFHWADDR
@ 2002-12-04 18:31                     ` Roland Dreier
  2002-12-04 20:11                       ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2002-12-04 18:31 UTC (permalink / raw)
  To: ak, kuznet; +Cc: linux-kernel, davem

Hi, since you are the architects of the rtnetlink facility, I am
writing to you to ask your opinion on how to add some functionality.

I'm exploring what changes are needed in the kernel to support devices
with large addr_len.  My motivation is to get the infrastructure for
IP-over-InfiniBand, which has addr_len 20, into the kernel, and I am
trying to produce a patch that can be accepted into mainline 2.5.

With a few a minor fixes to avoid overrunning array bounds, almost
everything just works.  The RTM_NEWNEIGH and RTM_DELNEIGH messages
seem to provide what is needed to manage ARP entries with large L2
addresses (replacing the SIOCSARP and SIOCDARP ioctls).

The one major piece that is missing is a replacement for SIOCSIFHWADDR
(which can only set L2 addresses up to 14 bytes long, because of the
size of sa_data in struct sockaddr).  I can see two ways one might set
an interface's L2 address through rtnetlink:

  We could extend the RTM_NEWLINK message (possibly using the change
  mask) to include changing just the L2 address, and add support in
  the kernel for receiving these messages from userspace.

  Or, we could add a new RTM_SETLINK message that userspace can send
  to the kernel to modify an interface's properties.

Of course I am open to other suggestions for how to replace
SIOCSIFHWADDR.  I would very much appreciate your thoughts on how to
proceed.

Thanks,
  Roland <roland@topspin.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: rtnetlink replacement for SIOCSIFHWADDR
  2002-12-04 18:31                     ` rtnetlink replacement for SIOCSIFHWADDR Roland Dreier
@ 2002-12-04 20:11                       ` Andi Kleen
  2002-12-31 23:11                         ` [PATCH] increase MAX_ADDR_LEN Roland Dreier
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2002-12-04 20:11 UTC (permalink / raw)
  To: Roland Dreier; +Cc: ak, kuznet, linux-kernel, davem

>   Or, we could add a new RTM_SETLINK message that userspace can send
>   to the kernel to modify an interface's properties.

Defining a RTM_SETLINK would seem to be cleanest to me.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] increase MAX_ADDR_LEN
  2002-12-04 20:11                       ` Andi Kleen
@ 2002-12-31 23:11                         ` Roland Dreier
  2003-01-06 15:52                           ` [PATCH] increase MAX_ADDR_LEN (resend) Roland Dreier
  2003-01-07  8:58                           ` [PATCH] increase MAX_ADDR_LEN David S. Miller
  0 siblings, 2 replies; 29+ messages in thread
From: Roland Dreier @ 2002-12-31 23:11 UTC (permalink / raw)
  To: davem; +Cc: ak, linux-kernel

The patch included below is intended to increase MAX_ADDR_LEN to 32,
which will make adding IP-over-InfiniBand support much simpler.

I included the rtnetlink-based support for big hardware adresses that
we discussed earlier.  Specifically, the patch:

  1. Adds ARPHRD_INFINIBAND to if_arp.h.  This is extremely minor, but
     will make an IP-over-InfiniBand driver cleaner.

  2. Increases MAX_ADDR_LEN to 32 (from 8).

  3. Makes some small changes to net/core/dev.c to prevent the
     increased MAX_ADDR_LEN from overflowing when the SIOCGIFHWADDR
     and SIOCSIFHWADDR ioctls are used on interfaces with addr_len >
     14 (the size of sa_data in struct sockaddr).

  4. Adds RTM_SETLINK to <linux/rtnetlink.h> and adds a function
     do_setlink() to net/core/rtnetlink.c to handle this message from
     userspace.  This lets userspace set the hardware address and
     broadcast address for interfaces with addr_len > 14.

  5. Cleans up the initializer for link_rtnetlink_table[] so that it
     is much easier to read and change.

There is some suspicious-looking use of MAX_ADDR_LEN in
drivers/net/sungem.c and drivers/s390/net/lcs.c but I did not touch
those files in this patch.

This patch was created and tested with UML on kernel 2.5.47, but
applies cleanly to Linus's current 2.5.53-BK (with a few offsets).

Please apply this patch or let me know what needs to change for it to
be applied.

Thanks,
  Roland <roland@topspin.com>



diff -Naur linux-2.5.47/include/linux/if_arp.h linux-2.5.47-max-addr/include/linux/if_arp.h
--- linux-2.5.47/include/linux/if_arp.h	Sun Nov 10 19:28:29 2002
+++ linux-2.5.47-max-addr/include/linux/if_arp.h	Mon Dec 30 13:50:07 2002
@@ -40,6 +40,7 @@
 #define ARPHRD_METRICOM	23		/* Metricom STRIP (new IANA id)	*/
 #define	ARPHRD_IEEE1394	24		/* IEEE 1394 IPv4 - RFC 2734	*/
 #define ARPHRD_EUI64	27		/* EUI-64                       */
+#define ARPHRD_INFINIBAND 32		/* InfiniBand			*/
 
 /* Dummy types for non ARP hardware */
 #define ARPHRD_SLIP	256
diff -Naur linux-2.5.47/include/linux/netdevice.h linux-2.5.47-max-addr/include/linux/netdevice.h
--- linux-2.5.47/include/linux/netdevice.h	Sun Nov 10 19:28:31 2002
+++ linux-2.5.47-max-addr/include/linux/netdevice.h	Mon Dec 30 12:05:42 2002
@@ -65,7 +65,7 @@
 
 #endif
 
-#define MAX_ADDR_LEN	8		/* Largest hardware address length */
+#define MAX_ADDR_LEN	32		/* Largest hardware address length */
 
 /*
  *	Compute the worst case header length according to the protocols
diff -Naur linux-2.5.47/include/linux/rtnetlink.h linux-2.5.47-max-addr/include/linux/rtnetlink.h
--- linux-2.5.47/include/linux/rtnetlink.h	Sun Nov 10 19:28:29 2002
+++ linux-2.5.47-max-addr/include/linux/rtnetlink.h	Tue Dec 31 14:38:13 2002
@@ -17,6 +17,7 @@
 #define	RTM_NEWLINK	(RTM_BASE+0)
 #define	RTM_DELLINK	(RTM_BASE+1)
 #define	RTM_GETLINK	(RTM_BASE+2)
+#define	RTM_SETLINK	(RTM_BASE+3)
 
 #define	RTM_NEWADDR	(RTM_BASE+4)
 #define	RTM_DELADDR	(RTM_BASE+5)
diff -Naur linux-2.5.47/net/core/dev.c linux-2.5.47-max-addr/net/core/dev.c
--- linux-2.5.47/net/core/dev.c	Sun Nov 10 19:28:16 2002
+++ linux-2.5.47-max-addr/net/core/dev.c	Mon Dec 30 12:06:06 2002
@@ -2062,7 +2062,7 @@
 
 		case SIOCGIFHWADDR:
 			memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,
-			       MAX_ADDR_LEN);
+			       min(sizeof ifr->ifr_hwaddr.sa_data, (size_t) dev->addr_len));
 			ifr->ifr_hwaddr.sa_family = dev->type;
 			return 0;
 
@@ -2083,7 +2083,7 @@
 			if (ifr->ifr_hwaddr.sa_family != dev->type)
 				return -EINVAL;
 			memcpy(dev->broadcast, ifr->ifr_hwaddr.sa_data,
-			       MAX_ADDR_LEN);
+			       min(sizeof ifr->ifr_hwaddr.sa_data, (size_t) dev->addr_len));
 			notifier_call_chain(&netdev_chain,
 					    NETDEV_CHANGEADDR, dev);
 			return 0;
diff -Naur linux-2.5.47/net/core/rtnetlink.c linux-2.5.47-max-addr/net/core/rtnetlink.c
--- linux-2.5.47/net/core/rtnetlink.c	Sun Nov 10 19:28:33 2002
+++ linux-2.5.47-max-addr/net/core/rtnetlink.c	Tue Dec 31 14:38:50 2002
@@ -220,6 +220,40 @@
 	return skb->len;
 }
 
+static int do_setlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) {
+	struct ifinfomsg  *ifm = NLMSG_DATA(nlh);
+	struct rtattr    **ida = arg;
+	struct net_device *dev;
+	int err;
+
+	dev = dev_get_by_index(ifm->ifi_index);
+	if (!dev) {
+		return -ENODEV;
+	}
+
+	err = -EINVAL;
+
+	if (ida[IFLA_ADDRESS - 1]) {
+		if (ida[IFLA_ADDRESS - 1]->rta_len != RTA_LENGTH(dev->addr_len)) {
+			goto out;
+		}
+		memcpy(dev->dev_addr, RTA_DATA(ida[IFLA_ADDRESS - 1]), dev->addr_len);
+	}
+
+	if (ida[IFLA_BROADCAST - 1]) {
+		if (ida[IFLA_BROADCAST - 1]->rta_len != RTA_LENGTH(dev->addr_len)) {
+			goto out;
+		}
+		memcpy(dev->broadcast, RTA_DATA(ida[IFLA_BROADCAST - 1]), dev->addr_len);
+	}
+
+	err = 0;
+
+out:
+	dev_put(dev);
+	return err;
+}
+
 int rtnetlink_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int idx;
@@ -457,32 +491,14 @@
 
 static struct rtnetlink_link link_rtnetlink_table[RTM_MAX-RTM_BASE+1] =
 {
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			rtnetlink_dump_ifinfo,	},
-	{ NULL,			NULL,			},
-
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			rtnetlink_dump_all,	},
-	{ NULL,			NULL,			},
-
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			rtnetlink_dump_all,	},
-	{ NULL,			NULL,			},
-
-	{ neigh_add,		NULL,			},
-	{ neigh_delete,		NULL,			},
-	{ NULL,			neigh_dump_info,	},
-	{ NULL,			NULL,			},
-
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
+	[RTM_GETLINK  - RTM_BASE] = { .dumpit = rtnetlink_dump_ifinfo },
+	[RTM_SETLINK  - RTM_BASE] = { .doit   = do_setlink	      },
+	[RTM_GETADDR  - RTM_BASE] = { .dumpit = rtnetlink_dump_all    },
+	[RTM_GETROUTE - RTM_BASE] = { .dumpit = rtnetlink_dump_all    },
+	[RTM_NEWNEIGH - RTM_BASE] = { .doit   = neigh_add	      },
+	[RTM_DELNEIGH - RTM_BASE] = { .doit   = neigh_delete	      },
+	[RTM_GETNEIGH - RTM_BASE] = { .dumpit = neigh_dump_info	      }
 };
-
 
 static int rtnetlink_event(struct notifier_block *this, unsigned long event, void *ptr)
 {

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] increase MAX_ADDR_LEN (resend)
  2002-12-31 23:11                         ` [PATCH] increase MAX_ADDR_LEN Roland Dreier
@ 2003-01-06 15:52                           ` Roland Dreier
  2003-01-06 23:29                             ` David S. Miller
  2003-01-07  8:58                           ` [PATCH] increase MAX_ADDR_LEN David S. Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2003-01-06 15:52 UTC (permalink / raw)
  To: davem; +Cc: linux-kernel

Hi, I think this got lost in your inbox over New Year's (I send it
December 31), so I'm resending.

The patch included below is intended to increase MAX_ADDR_LEN to 32,
which will make adding IP-over-InfiniBand support much simpler.

I included the rtnetlink-based support for big hardware adresses that
we discussed earlier.  Specifically, the patch:

  1. Adds ARPHRD_INFINIBAND to if_arp.h.  This is extremely minor, but
     will make an IP-over-InfiniBand driver cleaner.

  2. Increases MAX_ADDR_LEN to 32 (from 8).

  3. Makes some small changes to net/core/dev.c to prevent the
     increased MAX_ADDR_LEN from overflowing when the SIOCGIFHWADDR
     and SIOCSIFHWADDR ioctls are used on interfaces with addr_len >
     14 (the size of sa_data in struct sockaddr).

  4. Adds RTM_SETLINK to <linux/rtnetlink.h> and adds a function
     do_setlink() to net/core/rtnetlink.c to handle this message from
     userspace.  This lets userspace set the hardware address and
     broadcast address for interfaces with addr_len > 14.

  5. Cleans up the initializer for link_rtnetlink_table[] so that it
     is much easier to read and change.

There is some suspicious-looking use of MAX_ADDR_LEN in
drivers/net/sungem.c and drivers/s390/net/lcs.c but I did not touch
those files in this patch.

This patch was created and tested with UML on kernel 2.5.47, but
applies cleanly to Linus's current 2.5.53-BK (with a few offsets).

Please apply this patch or let me know what needs to change for it to
be applied.

Thanks,
  Roland <roland@topspin.com>



diff -Naur linux-2.5.47/include/linux/if_arp.h linux-2.5.47-max-addr/include/linux/if_arp.h
--- linux-2.5.47/include/linux/if_arp.h	Sun Nov 10 19:28:29 2002
+++ linux-2.5.47-max-addr/include/linux/if_arp.h	Mon Dec 30 13:50:07 2002
@@ -40,6 +40,7 @@
 #define ARPHRD_METRICOM	23		/* Metricom STRIP (new IANA id)	*/
 #define	ARPHRD_IEEE1394	24		/* IEEE 1394 IPv4 - RFC 2734	*/
 #define ARPHRD_EUI64	27		/* EUI-64                       */
+#define ARPHRD_INFINIBAND 32		/* InfiniBand			*/
 
 /* Dummy types for non ARP hardware */
 #define ARPHRD_SLIP	256
diff -Naur linux-2.5.47/include/linux/netdevice.h linux-2.5.47-max-addr/include/linux/netdevice.h
--- linux-2.5.47/include/linux/netdevice.h	Sun Nov 10 19:28:31 2002
+++ linux-2.5.47-max-addr/include/linux/netdevice.h	Mon Dec 30 12:05:42 2002
@@ -65,7 +65,7 @@
 
 #endif
 
-#define MAX_ADDR_LEN	8		/* Largest hardware address length */
+#define MAX_ADDR_LEN	32		/* Largest hardware address length */
 
 /*
  *	Compute the worst case header length according to the protocols
diff -Naur linux-2.5.47/include/linux/rtnetlink.h linux-2.5.47-max-addr/include/linux/rtnetlink.h
--- linux-2.5.47/include/linux/rtnetlink.h	Sun Nov 10 19:28:29 2002
+++ linux-2.5.47-max-addr/include/linux/rtnetlink.h	Tue Dec 31 14:38:13 2002
@@ -17,6 +17,7 @@
 #define	RTM_NEWLINK	(RTM_BASE+0)
 #define	RTM_DELLINK	(RTM_BASE+1)
 #define	RTM_GETLINK	(RTM_BASE+2)
+#define	RTM_SETLINK	(RTM_BASE+3)
 
 #define	RTM_NEWADDR	(RTM_BASE+4)
 #define	RTM_DELADDR	(RTM_BASE+5)
diff -Naur linux-2.5.47/net/core/dev.c linux-2.5.47-max-addr/net/core/dev.c
--- linux-2.5.47/net/core/dev.c	Sun Nov 10 19:28:16 2002
+++ linux-2.5.47-max-addr/net/core/dev.c	Mon Dec 30 12:06:06 2002
@@ -2062,7 +2062,7 @@
 
 		case SIOCGIFHWADDR:
 			memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,
-			       MAX_ADDR_LEN);
+			       min(sizeof ifr->ifr_hwaddr.sa_data, (size_t) dev->addr_len));
 			ifr->ifr_hwaddr.sa_family = dev->type;
 			return 0;
 
@@ -2083,7 +2083,7 @@
 			if (ifr->ifr_hwaddr.sa_family != dev->type)
 				return -EINVAL;
 			memcpy(dev->broadcast, ifr->ifr_hwaddr.sa_data,
-			       MAX_ADDR_LEN);
+			       min(sizeof ifr->ifr_hwaddr.sa_data, (size_t) dev->addr_len));
 			notifier_call_chain(&netdev_chain,
 					    NETDEV_CHANGEADDR, dev);
 			return 0;
diff -Naur linux-2.5.47/net/core/rtnetlink.c linux-2.5.47-max-addr/net/core/rtnetlink.c
--- linux-2.5.47/net/core/rtnetlink.c	Sun Nov 10 19:28:33 2002
+++ linux-2.5.47-max-addr/net/core/rtnetlink.c	Tue Dec 31 14:38:50 2002
@@ -220,6 +220,40 @@
 	return skb->len;
 }
 
+static int do_setlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) {
+	struct ifinfomsg  *ifm = NLMSG_DATA(nlh);
+	struct rtattr    **ida = arg;
+	struct net_device *dev;
+	int err;
+
+	dev = dev_get_by_index(ifm->ifi_index);
+	if (!dev) {
+		return -ENODEV;
+	}
+
+	err = -EINVAL;
+
+	if (ida[IFLA_ADDRESS - 1]) {
+		if (ida[IFLA_ADDRESS - 1]->rta_len != RTA_LENGTH(dev->addr_len)) {
+			goto out;
+		}
+		memcpy(dev->dev_addr, RTA_DATA(ida[IFLA_ADDRESS - 1]), dev->addr_len);
+	}
+
+	if (ida[IFLA_BROADCAST - 1]) {
+		if (ida[IFLA_BROADCAST - 1]->rta_len != RTA_LENGTH(dev->addr_len)) {
+			goto out;
+		}
+		memcpy(dev->broadcast, RTA_DATA(ida[IFLA_BROADCAST - 1]), dev->addr_len);
+	}
+
+	err = 0;
+
+out:
+	dev_put(dev);
+	return err;
+}
+
 int rtnetlink_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int idx;
@@ -457,32 +491,14 @@
 
 static struct rtnetlink_link link_rtnetlink_table[RTM_MAX-RTM_BASE+1] =
 {
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			rtnetlink_dump_ifinfo,	},
-	{ NULL,			NULL,			},
-
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			rtnetlink_dump_all,	},
-	{ NULL,			NULL,			},
-
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			rtnetlink_dump_all,	},
-	{ NULL,			NULL,			},
-
-	{ neigh_add,		NULL,			},
-	{ neigh_delete,		NULL,			},
-	{ NULL,			neigh_dump_info,	},
-	{ NULL,			NULL,			},
-
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
-	{ NULL,			NULL,			},
+	[RTM_GETLINK  - RTM_BASE] = { .dumpit = rtnetlink_dump_ifinfo },
+	[RTM_SETLINK  - RTM_BASE] = { .doit   = do_setlink	      },
+	[RTM_GETADDR  - RTM_BASE] = { .dumpit = rtnetlink_dump_all    },
+	[RTM_GETROUTE - RTM_BASE] = { .dumpit = rtnetlink_dump_all    },
+	[RTM_NEWNEIGH - RTM_BASE] = { .doit   = neigh_add	      },
+	[RTM_DELNEIGH - RTM_BASE] = { .doit   = neigh_delete	      },
+	[RTM_GETNEIGH - RTM_BASE] = { .dumpit = neigh_dump_info	      }
 };
-
 
 static int rtnetlink_event(struct notifier_block *this, unsigned long event, void *ptr)
 {

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] increase MAX_ADDR_LEN (resend)
  2003-01-06 15:52                           ` [PATCH] increase MAX_ADDR_LEN (resend) Roland Dreier
@ 2003-01-06 23:29                             ` David S. Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David S. Miller @ 2003-01-06 23:29 UTC (permalink / raw)
  To: roland; +Cc: linux-kernel

   From: Roland Dreier <roland@topspin.com>
   Date: 06 Jan 2003 07:52:57 -0800

   Hi, I think this got lost in your inbox over New Year's (I send it
   December 31), so I'm resending.

Not lost, I'm just still working on my backlog.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] increase MAX_ADDR_LEN
  2002-12-31 23:11                         ` [PATCH] increase MAX_ADDR_LEN Roland Dreier
  2003-01-06 15:52                           ` [PATCH] increase MAX_ADDR_LEN (resend) Roland Dreier
@ 2003-01-07  8:58                           ` David S. Miller
  1 sibling, 0 replies; 29+ messages in thread
From: David S. Miller @ 2003-01-07  8:58 UTC (permalink / raw)
  To: roland; +Cc: ak, linux-kernel

   From: Roland Dreier <roland@topspin.com>
   Date: 31 Dec 2002 15:11:31 -0800
   
   There is some suspicious-looking use of MAX_ADDR_LEN in
   drivers/net/sungem.c and drivers/s390/net/lcs.c but I did not touch
   those files in this patch.
 ...   
   Please apply this patch or let me know what needs to change for it to
   be applied.

I applied the patch and added a #warning to the powerpc bits of
sungem.c that look suspicious.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2003-01-07  8:58 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-11 18:25 [PATCH] flush_cache_page while pte valid Hugh Dickins
2002-11-11 18:35 ` Andrew Morton
2002-11-11 23:19 ` David S. Miller
2002-11-11 23:34   ` [PATCH] [RFC] increase MAX_ADDR_LEN Roland Dreier
2002-11-11 23:38     ` David S. Miller
2002-11-11 23:58       ` Roland Dreier
2002-11-12  0:18       ` Alan Cox
2002-11-12  0:01         ` Roland Dreier
2002-11-12 14:23           ` Alan Cox
2002-11-12 14:20             ` Folkert van Heusden
2002-11-12 15:14             ` Roland Dreier
2002-11-12 16:00               ` Alan Cox
2002-11-12 15:50                 ` YOSHIFUJI Hideaki / 吉藤英明
2002-11-12 20:36                 ` Roland Dreier
2002-11-12 22:31                   ` David S. Miller
2002-11-12 22:39                     ` Roland Dreier
2002-12-04 18:31                     ` rtnetlink replacement for SIOCSIFHWADDR Roland Dreier
2002-12-04 20:11                       ` Andi Kleen
2002-12-31 23:11                         ` [PATCH] increase MAX_ADDR_LEN Roland Dreier
2003-01-06 15:52                           ` [PATCH] increase MAX_ADDR_LEN (resend) Roland Dreier
2003-01-06 23:29                             ` David S. Miller
2003-01-07  8:58                           ` [PATCH] increase MAX_ADDR_LEN David S. Miller
2002-11-12  6:53   ` [PATCH] flush_cache_page while pte valid Hugh Dickins
2002-11-12  6:53     ` David S. Miller
2002-11-12 14:49       ` Rik van Riel
2002-11-12 21:45         ` David S. Miller
2002-11-12 17:43       ` Hugh Dickins
2002-11-12 21:51         ` David S. Miller
2002-11-12 22:59           ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).