linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] -stable review
@ 2006-01-05 21:47 ` Chris Wright
  2006-01-06  0:45   ` [PATCH 1/6] drivers/net/sungem.c: gem_remove_one mustnt be __devexit Chris Wright
                     ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-05 21:47 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, torvalds, akpm, alan

This is the start of the stable review cycle for the 2.6.14.6 release.
There are 6 patches in this series, all will be posted as a response to
this one.  If anyone has any issues with these being applied, please let
us know.  If anyone is a maintainer of the proper subsystem, and wants
to add a Signed-off-by: line to the patch, please respond with it.

These patches are sent out with a number of different people on the
Cc: line.  If you wish to be a reviewer, please email stable@kernel.org
to add your name to the list.  If you want to be off the reviewer list,
also email us.

Responses should be made by Sat, Jan 7, 23:00 UTC.  Anything received after
that time, might be too late.

thanks,

the -stable release team
--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/6] drivers/net/sungem.c: gem_remove_one mustnt be __devexit
  2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
@ 2006-01-06  0:45   ` Chris Wright
  2006-01-06  0:45   ` [PATCH 2/6] ieee80211_crypt_tkip depends on NET_RADIO Chris Wright
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-06  0:45 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, torvalds, akpm, alan, Adrian Bunk,
	David S. Miller, jgarzik, Greg Kroah-Hartman

[-- Attachment #1: drivers-net-sungem.c-gem_remove_one-mustn-t-be-__devexit.patch --]
[-- Type: text/plain, Size: 1106 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

gem_remove_one() is called from the __devinit gem_init_one().

Therefore, gem_remove_one() mustn't be __devexit.

This patch was already included in 2.6.15-rc7.


Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 drivers/net/sungem.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.14.5.orig/drivers/net/sungem.c
+++ linux-2.6.14.5/drivers/net/sungem.c
@@ -2905,7 +2905,7 @@ static int __devinit gem_get_device_addr
 	return 0;
 }
 
-static void __devexit gem_remove_one(struct pci_dev *pdev)
+static void gem_remove_one(struct pci_dev *pdev)
 {
 	struct net_device *dev = pci_get_drvdata(pdev);
 
@@ -3179,7 +3179,7 @@ static struct pci_driver gem_driver = {
 	.name		= GEM_MODULE_NAME,
 	.id_table	= gem_pci_tbl,
 	.probe		= gem_init_one,
-	.remove		= __devexit_p(gem_remove_one),
+	.remove		= gem_remove_one,
 #ifdef CONFIG_PM
 	.suspend	= gem_suspend,
 	.resume		= gem_resume,

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/6] ieee80211_crypt_tkip depends on NET_RADIO
  2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
  2006-01-06  0:45   ` [PATCH 1/6] drivers/net/sungem.c: gem_remove_one mustnt be __devexit Chris Wright
@ 2006-01-06  0:45   ` Chris Wright
  2006-01-06  0:45   ` [PATCH 3/6] Insanity avoidance in /proc (CVE-2005-4605) Chris Wright
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-06  0:45 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, torvalds, akpm, alan, Adrian Bunk,
	Jeff Garzik, Olaf Hering

[-- Attachment #1: ieee80211_crypt_tkip-depends-on-net_radio.patch --]
[-- Type: text/plain, Size: 821 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

*** Warning: ".wireless_send_event" [net/ieee80211/ieee80211_crypt_tkip.ko]

This bug was also reported as kerenl Bugzilla #5551.

Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 net/ieee80211/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.14.5.orig/net/ieee80211/Kconfig
+++ linux-2.6.14.5/net/ieee80211/Kconfig
@@ -55,7 +55,7 @@ config IEEE80211_CRYPT_CCMP
 
 config IEEE80211_CRYPT_TKIP
 	tristate "IEEE 802.11i TKIP encryption"
-	depends on IEEE80211
+	depends on IEEE80211 && NET_RADIO
 	select CRYPTO
 	select CRYPTO_MICHAEL_MIC
 	---help---

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/6] Insanity avoidance in /proc (CVE-2005-4605)
  2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
  2006-01-06  0:45   ` [PATCH 1/6] drivers/net/sungem.c: gem_remove_one mustnt be __devexit Chris Wright
  2006-01-06  0:45   ` [PATCH 2/6] ieee80211_crypt_tkip depends on NET_RADIO Chris Wright
@ 2006-01-06  0:45   ` Chris Wright
  2006-01-06  0:45   ` [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0 Chris Wright
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-06  0:45 UTC (permalink / raw)
  To: linux-kernel, stable, git-commits-head
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, torvalds, akpm, alan,
	Greg Kroah-Hartman

[-- Attachment #1: insanity-avoidance-in-proc.patch --]
[-- Type: text/plain, Size: 2221 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

Insanity avoidance in /proc

The old /proc interfaces were never updated to use loff_t, and are just
generally broken.  Now, we should be using the seq_file interface for
all of the proc files, but converting the legacy functions is more work
than most people care for and has little upside..

But at least we can make the non-LFS rules explicit, rather than just
insanely wrapping the offset or something.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 fs/proc/generic.c |   47 +++++++++++++++++++++++------------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

--- linux-2.6.14.5.orig/fs/proc/generic.c
+++ linux-2.6.14.5/fs/proc/generic.c
@@ -54,6 +54,18 @@ proc_file_read(struct file *file, char _
 	ssize_t	n, count;
 	char	*start;
 	struct proc_dir_entry * dp;
+	unsigned long long pos;
+
+	/*
+	 * Gaah, please just use "seq_file" instead. The legacy /proc
+	 * interfaces cut loff_t down to off_t for reads, and ignore
+	 * the offset entirely for writes..
+	 */
+	pos = *ppos;
+	if (pos > MAX_NON_LFS)
+		return 0;
+	if (nbytes > MAX_NON_LFS - pos)
+		nbytes = MAX_NON_LFS - pos;
 
 	dp = PDE(inode);
 	if (!(page = (char*) __get_free_page(GFP_KERNEL)))
@@ -202,30 +214,17 @@ proc_file_write(struct file *file, const
 static loff_t
 proc_file_lseek(struct file *file, loff_t offset, int orig)
 {
-    lock_kernel();
-
-    switch (orig) {
-    case 0:
-	if (offset < 0)
-	    goto out;
-	file->f_pos = offset;
-	unlock_kernel();
-	return(file->f_pos);
-    case 1:
-	if (offset + file->f_pos < 0)
-	    goto out;
-	file->f_pos += offset;
-	unlock_kernel();
-	return(file->f_pos);
-    case 2:
-	goto out;
-    default:
-	goto out;
-    }
-
-out:
-    unlock_kernel();
-    return -EINVAL;
+	loff_t retval = -EINVAL;
+	switch (orig) {
+	case 1:
+		offset += file->f_pos;
+	/* fallthrough */
+	case 0:
+		if (offset < 0 || offset > MAX_NON_LFS)
+			break;
+		file->f_pos = retval = offset;
+	}
+	return retval;
 }
 
 static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0
  2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
                     ` (2 preceding siblings ...)
  2006-01-06  0:45   ` [PATCH 3/6] Insanity avoidance in /proc (CVE-2005-4605) Chris Wright
@ 2006-01-06  0:45   ` Chris Wright
  2006-01-06  1:30     ` Linus Torvalds
  2006-01-06  0:46   ` [PATCH 5/6] UFS: inode->i_sem is not released in error path Chris Wright
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Chris Wright @ 2006-01-06  0:45 UTC (permalink / raw)
  To: linux-kernel, stable, git-commits-head
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, torvalds, akpm, alan, Linus Torvalds,
	Greg Kroah-Hartman

[-- Attachment #1: sysctl-don-t-overflow-the-user-supplied-buffer-with-0.patch --]
[-- Type: text/plain, Size: 1076 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

If the string was too long to fit in the user-supplied buffer,
the sysctl layer would zero-terminate it by writing past the
end of the buffer. Don't do that.

Noticed by Yi Yang <yang.y.yi@gmail.com>

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 kernel/sysctl.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- linux-2.6.14.5.orig/kernel/sysctl.c
+++ linux-2.6.14.5/kernel/sysctl.c
@@ -2200,14 +2200,12 @@ int sysctl_string(ctl_table *table, int 
 		if (get_user(len, oldlenp))
 			return -EFAULT;
 		if (len) {
-			l = strlen(table->data);
+			l = strlen(table->data)+1;
 			if (len > l) len = l;
 			if (len >= table->maxlen)
 				len = table->maxlen;
 			if(copy_to_user(oldval, table->data, len))
 				return -EFAULT;
-			if(put_user(0, ((char __user *) oldval) + len))
-				return -EFAULT;
 			if(put_user(len, oldlenp))
 				return -EFAULT;
 		}

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 5/6] UFS: inode->i_sem is not released in error path
  2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
                     ` (3 preceding siblings ...)
  2006-01-06  0:45   ` [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0 Chris Wright
@ 2006-01-06  0:46   ` Chris Wright
  2006-01-06  0:46   ` [PATCH 6/6] [ATYFB]: Fix onboard video on SPARC Blade 100 for 2.6.{13,14,15} Chris Wright
  2006-01-06  0:53   ` [PATCH 0/6] -stable review Chris Wright
  6 siblings, 0 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-06  0:46 UTC (permalink / raw)
  To: linux-kernel, stable, johnpol, mm-commits
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, torvalds, akpm, alan,
	Greg Kroah-Hartman

[-- Attachment #1: ufs-inode-i_sem-is-not-released-in-error-path.patch --]
[-- Type: text/plain, Size: 630 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

---

 fs/ufs/super.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.14.5/fs/ufs/super.c
===================================================================
--- linux-2.6.14.5.orig/fs/ufs/super.c
+++ linux-2.6.14.5/fs/ufs/super.c
@@ -1294,8 +1294,10 @@ static ssize_t ufs_quota_write(struct su
 		blk++;
 	}
 out:
-	if (len == towrite)
+	if (len == towrite) {
+		up(&inode->i_sem);
 		return err;
+	}
 	if (inode->i_size < off+len-towrite)
 		i_size_write(inode, off+len-towrite);
 	inode->i_version++;

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 6/6] [ATYFB]: Fix onboard video on SPARC Blade 100 for 2.6.{13,14,15}
  2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
                     ` (4 preceding siblings ...)
  2006-01-06  0:46   ` [PATCH 5/6] UFS: inode->i_sem is not released in error path Chris Wright
@ 2006-01-06  0:46   ` Chris Wright
  2006-01-06  0:53   ` [PATCH 0/6] -stable review Chris Wright
  6 siblings, 0 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-06  0:46 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, torvalds, akpm, alan, David S. Miller,
	Luis F. Ortiz

[-- Attachment #1: fix-ATY-video-on-sunblade.patch --]
[-- Type: text/plain, Size: 2528 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

	I have recently been switching from using 2.4.32 on my trusty
old Sparc Blade 100 to using 2.6.15 .  Some of the problems I ran into
were distorted video when the console was active (missing first
character, skipped dots) and when running X windows (colored snow,
stripes, missing pixels).  A quick examination of the 2.6 versus 2.4
source for the ATY driver revealed alot of changes.

         A closer look at the code/data for the 64GR/XL chip revealed
two minor "typos" that the rewriter(s) of the code made.  The first is
a incorrect clock value (230 .vs. 235) and the second is a missing
flag (M64F_SDRAM_MAGIC_PLL).  Making both these changes seems to have
fixed my problem.  I tend to think the 235 value is the correct one,
as there is a 29.4 Mhz clock crystal close to the video chip and 235.2
(29.4*8) is too close to 235 to make it a coincidence.

	The flag for M64F_SDRAM_MAGIC_PLL was dropped during the
changes made by adaplas in file revision 1.72 on the old bitkeeper
repository.

	The change relating to the clock rate has been there forever,
at least in the 2.6 tree.  I'm not sure where to look for the old 2.5
tree or if anyone cares when it happened.

On SPARC Blades 100's, which use the ATY MACH64GR video chipset, the
clock crystal frequency is 235.2 Mhz, not 230 Mhz.  The chipset also
requires the use of M64F_SDRAM_MAGIC_PLL in order to setup the PLL
properly for the DRAM.

Signed-off-by: "Luis F. Ortiz" <lfo@Polyad.Org>
Signed-off-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 drivers/video/aty/atyfb_base.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.14.5.orig/drivers/video/aty/atyfb_base.c
+++ linux-2.6.14.5/drivers/video/aty/atyfb_base.c
@@ -404,7 +404,7 @@ static struct {
 	{ PCI_CHIP_MACH64GM, "3D RAGE XL (Mach64 GM, AGP)", 230, 83, 63, ATI_CHIP_264XL },
 	{ PCI_CHIP_MACH64GN, "3D RAGE XL (Mach64 GN, AGP)", 230, 83, 63, ATI_CHIP_264XL },
 	{ PCI_CHIP_MACH64GO, "3D RAGE XL (Mach64 GO, PCI-66/BGA)", 230, 83, 63, ATI_CHIP_264XL },
-	{ PCI_CHIP_MACH64GR, "3D RAGE XL (Mach64 GR, PCI-33MHz)", 230, 83, 63, ATI_CHIP_264XL },
+	{ PCI_CHIP_MACH64GR, "3D RAGE XL (Mach64 GR, PCI-33MHz)", 235, 83, 63, ATI_CHIP_264XL | M64F_SDRAM_MAGIC_PLL },
 	{ PCI_CHIP_MACH64GL, "3D RAGE XL (Mach64 GL, PCI)", 230, 83, 63, ATI_CHIP_264XL },
 	{ PCI_CHIP_MACH64GS, "3D RAGE XL (Mach64 GS, PCI)", 230, 83, 63, ATI_CHIP_264XL },
 

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] -stable review
  2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
                     ` (5 preceding siblings ...)
  2006-01-06  0:46   ` [PATCH 6/6] [ATYFB]: Fix onboard video on SPARC Blade 100 for 2.6.{13,14,15} Chris Wright
@ 2006-01-06  0:53   ` Chris Wright
  6 siblings, 0 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-06  0:53 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Theodore Ts'o, Zwane Mwaikambo, Justin Forbes, torvalds,
	Randy Dunlap, Dave Jones, Chuck Wolber, alan

* Chris Wright (chrisw@sous-sol.org) wrote:
> There are 6 patches in this series, all will be posted as a response to
> this one.

Sorry about the misfire.  I reposted the actual patches.

thanks,
-chris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0
  2006-01-06  0:45   ` [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0 Chris Wright
@ 2006-01-06  1:30     ` Linus Torvalds
  2006-01-06  3:40       ` Chris Wright
  2006-01-06 10:17       ` [PATCH, RFC] RCU : OOM avoidance and lower latency Eric Dumazet
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2006-01-06  1:30 UTC (permalink / raw)
  To: Chris Wright
  Cc: linux-kernel, stable, git-commits-head, Justin Forbes,
	Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap, Dave Jones,
	Chuck Wolber, akpm, alan, Greg Kroah-Hartman



On Thu, 5 Jan 2006, Chris Wright wrote:
> 
> If the string was too long to fit in the user-supplied buffer,
> the sysctl layer would zero-terminate it by writing past the
> end of the buffer. Don't do that.

Don't use this one. It doesn't zero-pad the string if the result buffer is 
too small (which _could_ cause problems) and it actually returns the wrong 
length too. 

There's a better one in the final 2.6.15.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0
  2006-01-06  1:30     ` Linus Torvalds
@ 2006-01-06  3:40       ` Chris Wright
  2006-01-06 10:17       ` [PATCH, RFC] RCU : OOM avoidance and lower latency Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: Chris Wright @ 2006-01-06  3:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Wright, linux-kernel, stable, git-commits-head,
	Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, akpm, alan, Greg Kroah-Hartman

* Linus Torvalds (torvalds@osdl.org) wrote:
> Don't use this one. It doesn't zero-pad the string if the result buffer is 
> too small (which _could_ cause problems) and it actually returns the wrong 
> length too. 
> 
> There's a better one in the final 2.6.15.

Thanks, adding this (it's got both rolled together so it's the same as
upstream).

thanks,
-chris
--

Subject: sysctl: make sure to terminate strings with a NUL

From: Linus Torvalds <torvalds@osdl.org>

This is a slightly more complete fix for the previous minimal sysctl
string fix.  It always terminates the returned string with a NUL, even
if the full result wouldn't fit in the user-supplied buffer.

The returned length is the full untruncated length, so that you can
tell when truncation has occurred.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[chrisw: inclusive of minimal fix so it's same as upstream]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 kernel/sysctl.c |   29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

--- linux-2.6.14.5.orig/kernel/sysctl.c
+++ linux-2.6.14.5/kernel/sysctl.c
@@ -2191,29 +2191,32 @@ int sysctl_string(ctl_table *table, int 
 		  void __user *oldval, size_t __user *oldlenp,
 		  void __user *newval, size_t newlen, void **context)
 {
-	size_t l, len;
-	
 	if (!table->data || !table->maxlen) 
 		return -ENOTDIR;
 	
 	if (oldval && oldlenp) {
-		if (get_user(len, oldlenp))
+		size_t bufsize;
+		if (get_user(bufsize, oldlenp))
 			return -EFAULT;
-		if (len) {
-			l = strlen(table->data);
-			if (len > l) len = l;
-			if (len >= table->maxlen)
+		if (bufsize) {
+			size_t len = strlen(table->data), copied;
+
+			/* This shouldn't trigger for a well-formed sysctl */
+			if (len > table->maxlen)
 				len = table->maxlen;
-			if(copy_to_user(oldval, table->data, len))
-				return -EFAULT;
-			if(put_user(0, ((char __user *) oldval) + len))
+
+			/* Copy up to a max of bufsize-1 bytes of the string */
+			copied = (len >= bufsize) ? bufsize - 1 : len;
+
+			if (copy_to_user(oldval, table->data, copied) ||
+			    put_user(0, (char __user *)(oldval + copied)))
 				return -EFAULT;
-			if(put_user(len, oldlenp))
+			if (put_user(len, oldlenp))
 				return -EFAULT;
 		}
 	}
 	if (newval && newlen) {
-		len = newlen;
+		size_t len = newlen;
 		if (len > table->maxlen)
 			len = table->maxlen;
 		if(copy_from_user(table->data, newval, len))
@@ -2222,7 +2225,7 @@ int sysctl_string(ctl_table *table, int 
 			len--;
 		((char *) table->data)[len] = 0;
 	}
-	return 0;
+	return 1;
 }
 
 /*

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06  1:30     ` Linus Torvalds
  2006-01-06  3:40       ` Chris Wright
@ 2006-01-06 10:17       ` Eric Dumazet
  2006-01-06 12:52         ` [PATCH, RFC] RCU : OOM avoidance and lower latency (Version 2), HOTPLUG_CPU fix Eric Dumazet
                           ` (3 more replies)
  1 sibling, 4 replies; 34+ messages in thread
From: Eric Dumazet @ 2006-01-06 10:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, David S. Miller, Dipankar Sarma, Paul E. McKenney,
	Manfred Spraul, netdev

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]


In order to avoid some OOM triggered by a flood of call_rcu() calls, we 
increased in linux 2.6.14 maxbatch from 10 to 10000, and conditionally call 
set_need_resched() in call_rcu().

This solution doesnt solve all the problems and has drawbacks.

1) Using a big maxbatch has a bad impact on latency.
2) A flood of call_rcu_bh() still can OOM

I have some servers that once in a while crashes when the ip route cache is 
flushed. After raising /proc/sys/net/ipv4/route/secret_interval (so that *no* 
flush is done), I got better uptime for these servers. But in some cases I 
think the network stack can floods call_rcu_bh(), and a fatal OOM occurs.

I suggest in this patch :

1) To lower maxbatch to a more reasonable value (as far as the latency is 
concerned)

2) To be able to guard a RCU cpu queue against a maximal count (10.000 for 
example). If this limit is reached, free the oldest entry of this queue.

I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest 
entry cannot still be in use by another CPU. This might sounds as a violation 
of RCU rules, (I'm not an RCU expert) but seems quite reasonable.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: RCU_OOM.patch --]
[-- Type: text/plain, Size: 2078 bytes --]

--- linux-2.6.15/kernel/rcupdate.c	2006-01-03 04:21:10.000000000 +0100
+++ linux-2.6.15-edum/kernel/rcupdate.c	2006-01-06 11:10:45.000000000 +0100
@@ -71,14 +71,14 @@
 
 /* Fake initialization required by compiler */
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10000;
+static int maxbatch = 100;
 
 #ifndef __HAVE_ARCH_CMPXCHG
 /*
  * We use an array of spinlocks for the rcurefs -- similar to ones in sparc
  * 32 bit atomic_t implementations, and a hash function similar to that
  * for our refcounting needs.
- * Can't help multiprocessors which donot have cmpxchg :(
+ * Can't help multiprocessors which dont have cmpxchg :(
  */
 
 spinlock_t __rcuref_hash[RCUREF_HASH_SIZE] = {
@@ -110,9 +110,17 @@
 	*rdp->nxttail = head;
 	rdp->nxttail = &head->next;
 
-	if (unlikely(++rdp->count > 10000))
-		set_need_resched();
-
+/*
+ * OOM avoidance : If we queued too many items in this queue,
+ *  free the oldest entry
+ */
+	if (unlikely(++rdp->count > 10000)) {
+		rdp->count--;
+		head = rdp->donelist;
+		rdp->donelist = head->next;
+		local_irq_restore(flags);
+		return head->func(head);
+	}
 	local_irq_restore(flags);
 }
 
@@ -148,12 +156,17 @@
 	rdp = &__get_cpu_var(rcu_bh_data);
 	*rdp->nxttail = head;
 	rdp->nxttail = &head->next;
-	rdp->count++;
 /*
- *  Should we directly call rcu_do_batch() here ?
- *  if (unlikely(rdp->count > 10000))
- *      rcu_do_batch(rdp);
+ * OOM avoidance : If we queued too many items in this queue,
+ *  free the oldest entry
  */
+	if (unlikely(++rdp->count > 10000)) {
+		rdp->count--;
+		head = rdp->donelist;
+		rdp->donelist = head->next;
+		local_irq_restore(flags);
+		return head->func(head);
+	}
 	local_irq_restore(flags);
 }
 
@@ -209,7 +222,7 @@
 static void rcu_do_batch(struct rcu_data *rdp)
 {
 	struct rcu_head *next, *list;
-	int count = 0;
+	int count = maxbatch;
 
 	list = rdp->donelist;
 	while (list) {
@@ -217,7 +230,7 @@
 		list->func(list);
 		list = next;
 		rdp->count--;
-		if (++count >= maxbatch)
+		if (--count <= 0)
 			break;
 	}
 	if (!rdp->donelist)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency (Version 2), HOTPLUG_CPU fix
  2006-01-06 10:17       ` [PATCH, RFC] RCU : OOM avoidance and lower latency Eric Dumazet
@ 2006-01-06 12:52         ` Eric Dumazet
  2006-01-06 12:58         ` [PATCH, RFC] RCU : OOM avoidance and lower latency Andi Kleen
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2006-01-06 12:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, David S. Miller, Dipankar Sarma, Paul E. McKenney,
	Manfred Spraul, netdev

[-- Attachment #1: Type: text/plain, Size: 1617 bytes --]

First patch was buggy, sorry :(

This 2nd version makes no more RCU assumptions, because only the 'donelist' 
queue is fetched for an item to be deleted. Items from the donelist are ready 
to be freed.

This V2 also corrects a problem in case of a CPU hotplug, we forgot to update 
the ->count variable when transfering a queue to another one.

-------------------------------------------------------------------------
In order to avoid some OOM triggered by a flood of call_rcu() calls, we 
increased in linux 2.6.14 maxbatch from 10 to 10000, and conditionally call 
set_need_resched() in call_rcu().

This solution doesnt solve all the problems and has drawbacks.

1) Using a big maxbatch has a bad impact on latency.
2) A flood of call_rcu_bh() still can OOM

I have some servers that once in a while crashes when the ip route cache is 
flushed. After raising /proc/sys/net/ipv4/route/secret_interval (so that *no* 
flush is done), I got better uptime for these servers. But in some cases I 
think the network stack can floods call_rcu_bh(), and a fatal OOM occurs.

I suggest in this patch :

1) To lower maxbatch to a more reasonable value (as far as the latency is 
concerned)

2) To be able to guard a RCU cpu queue against a maximal count (10.000 for 
example). If this limit is reached, free the oldest entry (if available from 
the donelist queue).

3) Bug correction in __rcu_offline_cpu() where we forgot to adjust ->count 
field when transfering a queue to another one.

In my stress tests, I could not reproduce OOM anymore after applying this patch.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: RCU_OOM.patch --]
[-- Type: text/plain, Size: 3213 bytes --]

--- linux-2.6.15/kernel/rcupdate.c	2006-01-03 04:21:10.000000000 +0100
+++ linux-2.6.15-edum/kernel/rcupdate.c	2006-01-06 13:32:02.000000000 +0100
@@ -71,14 +71,14 @@
 
 /* Fake initialization required by compiler */
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10000;
+static int maxbatch = 100;
 
 #ifndef __HAVE_ARCH_CMPXCHG
 /*
  * We use an array of spinlocks for the rcurefs -- similar to ones in sparc
  * 32 bit atomic_t implementations, and a hash function similar to that
  * for our refcounting needs.
- * Can't help multiprocessors which donot have cmpxchg :(
+ * Can't help multiprocessors which dont have cmpxchg :(
  */
 
 spinlock_t __rcuref_hash[RCUREF_HASH_SIZE] = {
@@ -110,9 +110,19 @@
 	*rdp->nxttail = head;
 	rdp->nxttail = &head->next;
 
-	if (unlikely(++rdp->count > 10000))
-		set_need_resched();
-
+/*
+ * OOM avoidance : If we queued too many items in this queue,
+ *  free the oldest entry (from the donelist only to respect
+ *  RCU constraints)
+ */
+	if (unlikely(++rdp->count > 10000 && (head = rdp->donelist))) {
+		rdp->count--;
+		rdp->donelist = head->next;
+		if (!rdp->donelist)
+			rdp->donetail = &rdp->donelist;
+		local_irq_restore(flags);
+		return head->func(head);
+	}
 	local_irq_restore(flags);
 }
 
@@ -148,12 +158,19 @@
 	rdp = &__get_cpu_var(rcu_bh_data);
 	*rdp->nxttail = head;
 	rdp->nxttail = &head->next;
-	rdp->count++;
 /*
- *  Should we directly call rcu_do_batch() here ?
- *  if (unlikely(rdp->count > 10000))
- *      rcu_do_batch(rdp);
+ * OOM avoidance : If we queued too many items in this queue,
+ *  free the oldest entry (from the donelist only to respect
+ *  RCU constraints)
  */
+	if (unlikely(++rdp->count > 10000 && (head = rdp->donelist))) {
+		rdp->count--;
+		rdp->donelist = head->next;
+		if (!rdp->donelist)
+			rdp->donetail = &rdp->donelist;
+		local_irq_restore(flags);
+		return head->func(head);
+	}
 	local_irq_restore(flags);
 }
 
@@ -208,19 +225,20 @@
  */
 static void rcu_do_batch(struct rcu_data *rdp)
 {
-	struct rcu_head *next, *list;
-	int count = 0;
+	struct rcu_head *next = NULL, *list;
+	int count = maxbatch;
 
 	list = rdp->donelist;
 	while (list) {
-		next = rdp->donelist = list->next;
+		next = list->next;
 		list->func(list);
 		list = next;
 		rdp->count--;
-		if (++count >= maxbatch)
+		if (--count <= 0)
 			break;
 	}
-	if (!rdp->donelist)
+	rdp->donelist = next;
+	if (!next)
 		rdp->donetail = &rdp->donelist;
 	else
 		tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
@@ -344,11 +362,9 @@
 static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
 				struct rcu_head **tail)
 {
-	local_irq_disable();
 	*this_rdp->nxttail = list;
 	if (list)
 		this_rdp->nxttail = tail;
-	local_irq_enable();
 }
 
 static void __rcu_offline_cpu(struct rcu_data *this_rdp,
@@ -362,9 +378,12 @@
 	if (rcp->cur != rcp->completed)
 		cpu_quiet(rdp->cpu, rcp, rsp);
 	spin_unlock_bh(&rsp->lock);
+	local_irq_disable();
 	rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
 	rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
-
+	this_rdp->count += rdp->count;
+	rdp->count = 0;
+	local_irq_enable();
 }
 static void rcu_offline_cpu(int cpu)
 {

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 10:17       ` [PATCH, RFC] RCU : OOM avoidance and lower latency Eric Dumazet
  2006-01-06 12:52         ` [PATCH, RFC] RCU : OOM avoidance and lower latency (Version 2), HOTPLUG_CPU fix Eric Dumazet
@ 2006-01-06 12:58         ` Andi Kleen
  2006-01-06 13:09           ` Eric Dumazet
  2006-01-06 19:26           ` Lee Revell
  2006-01-06 13:37         ` Alan Cox
  2006-01-06 19:24         ` Lee Revell
  3 siblings, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2006-01-06 12:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, David S. Miller, Dipankar Sarma,
	Paul E. McKenney, Manfred Spraul, netdev

On Friday 06 January 2006 11:17, Eric Dumazet wrote:

>
> I assume that if a CPU queued 10.000 items in its RCU queue, then the
> oldest entry cannot still be in use by another CPU. This might sounds as a
> violation of RCU rules, (I'm not an RCU expert) but seems quite reasonable.

I don't think it's a good assumption. Another CPU might be stuck in a long 
running interrupt, and still have a reference in the code running below
the interrupt handler.

And in general letting correctness depend on magic numbers like this is 
very nasty.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 12:58         ` [PATCH, RFC] RCU : OOM avoidance and lower latency Andi Kleen
@ 2006-01-06 13:09           ` Eric Dumazet
  2006-01-06 19:26           ` Lee Revell
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2006-01-06 13:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, linux-kernel, David S. Miller, Dipankar Sarma,
	Paul E. McKenney, Manfred Spraul, netdev

Andi Kleen a écrit :
> On Friday 06 January 2006 11:17, Eric Dumazet wrote:
> 
>> I assume that if a CPU queued 10.000 items in its RCU queue, then the
>> oldest entry cannot still be in use by another CPU. This might sounds as a
>> violation of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
> 
> I don't think it's a good assumption. Another CPU might be stuck in a long 
> running interrupt, and still have a reference in the code running below
> the interrupt handler.
> 
> And in general letting correctness depend on magic numbers like this is 
> very nasty.
> 

I agree Andi, I posted a 2nd version of the patch with no more assumptions.

Eric



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 10:17       ` [PATCH, RFC] RCU : OOM avoidance and lower latency Eric Dumazet
  2006-01-06 12:52         ` [PATCH, RFC] RCU : OOM avoidance and lower latency (Version 2), HOTPLUG_CPU fix Eric Dumazet
  2006-01-06 12:58         ` [PATCH, RFC] RCU : OOM avoidance and lower latency Andi Kleen
@ 2006-01-06 13:37         ` Alan Cox
  2006-01-06 14:00           ` Eric Dumazet
  2006-01-06 16:47           ` Paul E. McKenney
  2006-01-06 19:24         ` Lee Revell
  3 siblings, 2 replies; 34+ messages in thread
From: Alan Cox @ 2006-01-06 13:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, David S. Miller, Dipankar Sarma,
	Paul E. McKenney, Manfred Spraul, netdev

On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest 
> entry cannot still be in use by another CPU. This might sounds as a violation 
> of RCU rules, (I'm not an RCU expert) but seems quite reasonable.

Fixing the real problem in the routing code would be the real fix. 

The underlying problem of RCU and memory usage could be solved more
safely by making sure that the sleeping memory allocator path always
waits until at least one RCU cleanup has occurred after it fails an
allocation before it starts trying harder. That ought to also naturally
throttle memory consumers more in the situation which is the right
behaviour.

Alan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 13:37         ` Alan Cox
@ 2006-01-06 14:00           ` Eric Dumazet
  2006-01-06 14:45             ` Alan Cox
  2006-01-06 16:47           ` Paul E. McKenney
  1 sibling, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2006-01-06 14:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, linux-kernel, David S. Miller, Dipankar Sarma,
	Paul E. McKenney, Manfred Spraul, netdev

Alan Cox a écrit :
> On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
>> I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest 
>> entry cannot still be in use by another CPU. This might sounds as a violation 
>> of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
> 
> Fixing the real problem in the routing code would be the real fix. 
> 

So far nobody succeeded in 'fixing the routing code', few people can even read 
the code from the first line to the last one...

I think this code is not buggy, it only makes general RCU assumptions about 
delayed freeing of dst entries. In some cases, the general assumptions are 
just wrong. We can fix it at RCU level, and future users of call_rcu_bh() wont 
have to think *hard* about 'general assumptions'.

Of course, we can ignore the RCU problem and mark somewhere on a sticker: 
***DONT USE OR RISK CRASHES***
***USE IT ONLY FOR FUN***

> The underlying problem of RCU and memory usage could be solved more
> safely by making sure that the sleeping memory allocator path always
> waits until at least one RCU cleanup has occurred after it fails an
> allocation before it starts trying harder. That ought to also naturally
> throttle memory consumers more in the situation which is the right
> behaviour.
> 

In the case of call_rcu_bh(), you can be sure that the caller cannot afford 
'sleeping memory allocations'. Better drop a frame than block the stack, no ?

Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 14:00           ` Eric Dumazet
@ 2006-01-06 14:45             ` Alan Cox
  0 siblings, 0 replies; 34+ messages in thread
From: Alan Cox @ 2006-01-06 14:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, David S. Miller, Dipankar Sarma,
	Paul E. McKenney, Manfred Spraul, netdev

On Gwe, 2006-01-06 at 15:00 +0100, Eric Dumazet wrote:
> In the case of call_rcu_bh(), you can be sure that the caller cannot afford 
> 'sleeping memory allocations'. Better drop a frame than block the stack, no ?

atomic allocations can't sleep and will fail which is fine. If memory
allocation pressure exists for sleeping allocations because of a large
rcu backlog we want to be sure that the rcu backlog from the networking
stack or other sources does not cause us to OOM kill or take incorrect
action.

So if for example we want to grow a process stack and the memory is
there just stuck in the RCU lists pending recovery we want to let the
RCU recovery happen before making drastic decisions.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 13:37         ` Alan Cox
  2006-01-06 14:00           ` Eric Dumazet
@ 2006-01-06 16:47           ` Paul E. McKenney
  2006-01-06 17:19             ` Eric Dumazet
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2006-01-06 16:47 UTC (permalink / raw)
  To: Alan Cox
  Cc: Eric Dumazet, Linus Torvalds, linux-kernel, David S. Miller,
	Dipankar Sarma, Manfred Spraul, netdev

On Fri, Jan 06, 2006 at 01:37:12PM +0000, Alan Cox wrote:
> On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> > I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest 
> > entry cannot still be in use by another CPU. This might sounds as a violation 
> > of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
> 
> Fixing the real problem in the routing code would be the real fix. 
> 
> The underlying problem of RCU and memory usage could be solved more
> safely by making sure that the sleeping memory allocator path always
> waits until at least one RCU cleanup has occurred after it fails an
> allocation before it starts trying harder. That ought to also naturally
> throttle memory consumers more in the situation which is the right
> behaviour.

A quick look at rt_garbage_collect() leads me to believe that although
the IP route cache does try to limit its use of memory, it does not
fully account for memory that it has released to RCU, but that RCU has
not yet freed due to a grace period not having elapsed.

The following appears to be possible:

1.	rt_garbage_collect() sees that there are too many entries,
	and sets "goal" to the number to free up, based on a
	computed "equilibrium" value.

2.	The number of entries is (correctly) decremented only when
	the corresponding RCU callback is invoked, which actually
	frees the entry.

3.	Between the time that rt_garbage_collect() is invoked the
	first time and when the RCU grace period ends, rt_garbage_collect()
	is invoked again.  It still sees too many entries (since
	RCU has not yet freed the ones released by the earlier
	invocation in step (1) above), so frees a bunch more.

4.	Packets routed now miss the route cache, because the corresponding
	entries are waiting for a grace period, slowing the system down.
	Therefore, even more entries are freed to make room for new
	entries corresponding to the new packets.

If my (likely quite naive) reading of the IP route cache code is correct,
it would be possible to end up in a steady state with most of the entries
always being in RCU rather than in the route cache.

Eric, could this be what is happening to your system?

If it is, one straightforward fix would be to keep a count of the number
of route-cache entries waiting on RCU, and for rt_garbage_collect()
to subtract this number of entries from its goal.  Does this make sense?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 16:47           ` Paul E. McKenney
@ 2006-01-06 17:19             ` Eric Dumazet
  2006-01-06 20:26               ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2006-01-06 17:19 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Cox, Linus Torvalds, linux-kernel, David S. Miller,
	Dipankar Sarma, Manfred Spraul, netdev

Paul E. McKenney a écrit :
> On Fri, Jan 06, 2006 at 01:37:12PM +0000, Alan Cox wrote:
>> On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
>>> I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest 
>>> entry cannot still be in use by another CPU. This might sounds as a violation 
>>> of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
>> Fixing the real problem in the routing code would be the real fix. 
>>
>> The underlying problem of RCU and memory usage could be solved more
>> safely by making sure that the sleeping memory allocator path always
>> waits until at least one RCU cleanup has occurred after it fails an
>> allocation before it starts trying harder. That ought to also naturally
>> throttle memory consumers more in the situation which is the right
>> behaviour.
> 
> A quick look at rt_garbage_collect() leads me to believe that although
> the IP route cache does try to limit its use of memory, it does not
> fully account for memory that it has released to RCU, but that RCU has
> not yet freed due to a grace period not having elapsed.
> 
> The following appears to be possible:
> 
> 1.	rt_garbage_collect() sees that there are too many entries,
> 	and sets "goal" to the number to free up, based on a
> 	computed "equilibrium" value.
> 
> 2.	The number of entries is (correctly) decremented only when
> 	the corresponding RCU callback is invoked, which actually
> 	frees the entry.
> 
> 3.	Between the time that rt_garbage_collect() is invoked the
> 	first time and when the RCU grace period ends, rt_garbage_collect()
> 	is invoked again.  It still sees too many entries (since
> 	RCU has not yet freed the ones released by the earlier
> 	invocation in step (1) above), so frees a bunch more.
> 
> 4.	Packets routed now miss the route cache, because the corresponding
> 	entries are waiting for a grace period, slowing the system down.
> 	Therefore, even more entries are freed to make room for new
> 	entries corresponding to the new packets.
> 
> If my (likely quite naive) reading of the IP route cache code is correct,
> it would be possible to end up in a steady state with most of the entries
> always being in RCU rather than in the route cache.
> 
> Eric, could this be what is happening to your system?
> 
> If it is, one straightforward fix would be to keep a count of the number
> of route-cache entries waiting on RCU, and for rt_garbage_collect()
> to subtract this number of entries from its goal.  Does this make sense?
> 

Hi Paul

Thanks for reviewing route code :)

As I said, the problem comes from 'route flush cache', that is periodically 
done by rt_run_flush(), triggered by rt_flush_timer.

The 10% of LOWMEM ram that was used by route-cache entries are pushed into rcu 
queues (with call_rcu_bh()) and network continue to receive
packets from *many* sources that want their route-cache entry.


Eric


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 10:17       ` [PATCH, RFC] RCU : OOM avoidance and lower latency Eric Dumazet
                           ` (2 preceding siblings ...)
  2006-01-06 13:37         ` Alan Cox
@ 2006-01-06 19:24         ` Lee Revell
  3 siblings, 0 replies; 34+ messages in thread
From: Lee Revell @ 2006-01-06 19:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, David S. Miller, Dipankar Sarma,
	Paul E. McKenney, Manfred Spraul, netdev

On Fri, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> I have some servers that once in a while crashes when the ip route
> cache is flushed. After
> raising /proc/sys/net/ipv4/route/secret_interval (so that *no* 
> flush is done), I got better uptime for these servers. 

Argh, where is that documented?  I have been banging my head against
this for weeks - how do I keep the kernel from flushing 4096 routes at
once in softirq context causing huge (~8-20ms) latency problems?

I tried all the route related sysctls I could find and nothing worked...

Lee


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 12:58         ` [PATCH, RFC] RCU : OOM avoidance and lower latency Andi Kleen
  2006-01-06 13:09           ` Eric Dumazet
@ 2006-01-06 19:26           ` Lee Revell
  2006-01-06 22:18             ` Andi Kleen
  1 sibling, 1 reply; 34+ messages in thread
From: Lee Revell @ 2006-01-06 19:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric Dumazet, Linus Torvalds, linux-kernel, David S. Miller,
	Dipankar Sarma, Paul E. McKenney, Manfred Spraul, netdev

On Fri, 2006-01-06 at 13:58 +0100, Andi Kleen wrote:
> Another CPU might be stuck in a long 
> running interrupt

Shouldn't a long running interrupt be considered a bug?

Lee


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 17:19             ` Eric Dumazet
@ 2006-01-06 20:26               ` Paul E. McKenney
  2006-01-06 20:33                 ` David S. Miller
  2006-01-06 20:57                 ` Andi Kleen
  0 siblings, 2 replies; 34+ messages in thread
From: Paul E. McKenney @ 2006-01-06 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alan Cox, Linus Torvalds, linux-kernel, David S. Miller,
	Dipankar Sarma, Manfred Spraul, netdev

On Fri, Jan 06, 2006 at 06:19:15PM +0100, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> >On Fri, Jan 06, 2006 at 01:37:12PM +0000, Alan Cox wrote:
> >>On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> >>>I assume that if a CPU queued 10.000 items in its RCU queue, then the 
> >>>oldest entry cannot still be in use by another CPU. This might sounds as 
> >>>a violation of RCU rules, (I'm not an RCU expert) but seems quite 
> >>>reasonable.
> >>Fixing the real problem in the routing code would be the real fix. 
> >>
> >>The underlying problem of RCU and memory usage could be solved more
> >>safely by making sure that the sleeping memory allocator path always
> >>waits until at least one RCU cleanup has occurred after it fails an
> >>allocation before it starts trying harder. That ought to also naturally
> >>throttle memory consumers more in the situation which is the right
> >>behaviour.
> >
> >A quick look at rt_garbage_collect() leads me to believe that although
> >the IP route cache does try to limit its use of memory, it does not
> >fully account for memory that it has released to RCU, but that RCU has
> >not yet freed due to a grace period not having elapsed.
> >
> >The following appears to be possible:
> >
> >1.	rt_garbage_collect() sees that there are too many entries,
> >	and sets "goal" to the number to free up, based on a
> >	computed "equilibrium" value.
> >
> >2.	The number of entries is (correctly) decremented only when
> >	the corresponding RCU callback is invoked, which actually
> >	frees the entry.
> >
> >3.	Between the time that rt_garbage_collect() is invoked the
> >	first time and when the RCU grace period ends, rt_garbage_collect()
> >	is invoked again.  It still sees too many entries (since
> >	RCU has not yet freed the ones released by the earlier
> >	invocation in step (1) above), so frees a bunch more.
> >
> >4.	Packets routed now miss the route cache, because the corresponding
> >	entries are waiting for a grace period, slowing the system down.
> >	Therefore, even more entries are freed to make room for new
> >	entries corresponding to the new packets.
> >
> >If my (likely quite naive) reading of the IP route cache code is correct,
> >it would be possible to end up in a steady state with most of the entries
> >always being in RCU rather than in the route cache.
> >
> >Eric, could this be what is happening to your system?
> >
> >If it is, one straightforward fix would be to keep a count of the number
> >of route-cache entries waiting on RCU, and for rt_garbage_collect()
> >to subtract this number of entries from its goal.  Does this make sense?
> >
> 
> Hi Paul
> 
> Thanks for reviewing route code :)
> 
> As I said, the problem comes from 'route flush cache', that is periodically 
> done by rt_run_flush(), triggered by rt_flush_timer.
> 
> The 10% of LOWMEM ram that was used by route-cache entries are pushed into 
> rcu queues (with call_rcu_bh()) and network continue to receive
> packets from *many* sources that want their route-cache entry.

Hello, Eric,

The rt_run_flush() function could indeed be suffering from the same
problem.  Dipankar's recent patch should help RCU grace periods proceed
more quickly, does that help?

If not, it may be worthwhile to limit the number of times that
rt_run_flush() runs per RCU grace period.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 20:26               ` Paul E. McKenney
@ 2006-01-06 20:33                 ` David S. Miller
  2006-01-06 20:57                 ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: David S. Miller @ 2006-01-06 20:33 UTC (permalink / raw)
  To: paulmck; +Cc: dada1, alan, torvalds, linux-kernel, dipankar, manfred, netdev

From: "Paul E. McKenney" <paulmck@us.ibm.com>
Date: Fri, 6 Jan 2006 12:26:26 -0800

> If not, it may be worthwhile to limit the number of times that
> rt_run_flush() runs per RCU grace period.

This is mixing two sets of requirements.

rt_run_flush() runs periodically in order to regenerate the hash
function secret key.  Now, for that specific case it might actually be
possible to rehash instead of flush, but the locking is a little bit
tricky :-)  And also, I think we're regenerating the secret key
just a little bit too often, I think we'd get enough security
with a less frequent regeneration.

I'll look into this and your other ideas later today hopefully.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 20:26               ` Paul E. McKenney
  2006-01-06 20:33                 ` David S. Miller
@ 2006-01-06 20:57                 ` Andi Kleen
  2006-01-07  0:17                   ` David S. Miller
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2006-01-06 20:57 UTC (permalink / raw)
  To: paulmck
  Cc: Eric Dumazet, Alan Cox, Linus Torvalds, linux-kernel,
	David S. Miller, Dipankar Sarma, Manfred Spraul, netdev

On Friday 06 January 2006 21:26, Paul E. McKenney wrote:

> If not, it may be worthwhile to limit the number of times that
> rt_run_flush() runs per RCU grace period.

Problem is that without rt_run_flush new routes and route attribute
changes don't get used by the stack. If RCU takes long and routes
keep changing this might be a big issue.

As a admin I would be certainly annoyed if the network stack
ignored my new route for some unbounded time.

Perhaps a better way would be to just exclude dst entries in RCU state
from the normal accounting and assume that if the system
really runs short of memory because of this the results would
trigger quiescent states more quickly, freeing the memory again.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 19:26           ` Lee Revell
@ 2006-01-06 22:18             ` Andi Kleen
  0 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2006-01-06 22:18 UTC (permalink / raw)
  To: Lee Revell
  Cc: Eric Dumazet, Linus Torvalds, linux-kernel, David S. Miller,
	Dipankar Sarma, Paul E. McKenney, Manfred Spraul, netdev

On Friday 06 January 2006 20:26, Lee Revell wrote:
> On Fri, 2006-01-06 at 13:58 +0100, Andi Kleen wrote:
> > Another CPU might be stuck in a long 
> > running interrupt
> 
> Shouldn't a long running interrupt be considered a bug?

In normal operation yes, but there can be always exceptional
circumstances where it's unavoidable (e.g. during error handling) 
and in the name of defensive programming the rest of the system ought 
to tolerate it.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-06 20:57                 ` Andi Kleen
@ 2006-01-07  0:17                   ` David S. Miller
  2006-01-07  1:09                     ` Andi Kleen
  2006-01-07  8:30                     ` Eric Dumazet
  0 siblings, 2 replies; 34+ messages in thread
From: David S. Miller @ 2006-01-07  0:17 UTC (permalink / raw)
  To: ak
  Cc: paulmck, dada1, alan, torvalds, linux-kernel, dipankar, manfred, netdev

From: Andi Kleen <ak@suse.de>
Date: Fri, 6 Jan 2006 21:57:41 +0100

> Perhaps a better way would be to just exclude dst entries in RCU state
> from the normal accounting and assume that if the system
> really runs short of memory because of this the results would
> trigger quiescent states more quickly, freeing the memory again.

That's one idea...

Eric, how important do you honestly think the per-hashchain spinlocks
are?  That's the big barrier from making rt_secret_rebuild() a simple
rehash instead of flushing the whole table as it does now.

The lock is only grabbed for updates, and the access to these locks is
random and as such probably non-local when taken anyways.  Back before
we used RCU for reads, this array-of-spinlock thing made a lot more
sense.

I mean something like this patch:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f701a13..f9436c7 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -204,36 +204,8 @@ __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-/*
- * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
- * The size of this table is a power of two and depends on the number of CPUS.
- */
-#if NR_CPUS >= 32
-#define RT_HASH_LOCK_SZ	4096
-#elif NR_CPUS >= 16
-#define RT_HASH_LOCK_SZ	2048
-#elif NR_CPUS >= 8
-#define RT_HASH_LOCK_SZ	1024
-#elif NR_CPUS >= 4
-#define RT_HASH_LOCK_SZ	512
-#else
-#define RT_HASH_LOCK_SZ	256
-#endif
 
-static spinlock_t	*rt_hash_locks;
-# define rt_hash_lock_addr(slot) &rt_hash_locks[(slot) & (RT_HASH_LOCK_SZ - 1)]
-# define rt_hash_lock_init()	{ \
-		int i; \
-		rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ, GFP_KERNEL); \
-		if (!rt_hash_locks) panic("IP: failed to allocate rt_hash_locks\n"); \
-		for (i = 0; i < RT_HASH_LOCK_SZ; i++) \
-			spin_lock_init(&rt_hash_locks[i]); \
-		}
-#else
-# define rt_hash_lock_addr(slot) NULL
-# define rt_hash_lock_init()
-#endif
+static DEFINE_SPINLOCK(rt_hash_lock);
 
 static struct rt_hash_bucket 	*rt_hash_table;
 static unsigned			rt_hash_mask;
@@ -627,7 +599,7 @@ static void rt_check_expire(unsigned lon
 
 		if (*rthp == 0)
 			continue;
-		spin_lock(rt_hash_lock_addr(i));
+		spin_lock(&rt_hash_lock);
 		while ((rth = *rthp) != NULL) {
 			if (rth->u.dst.expires) {
 				/* Entry is expired even if it is in use */
@@ -660,7 +632,7 @@ static void rt_check_expire(unsigned lon
  			rt_free(rth);
 #endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
 		}
-		spin_unlock(rt_hash_lock_addr(i));
+		spin_unlock(&rt_hash_lock);
 
 		/* Fallback loop breaker. */
 		if (time_after(jiffies, now))
@@ -683,11 +655,11 @@ static void rt_run_flush(unsigned long d
 	get_random_bytes(&rt_hash_rnd, 4);
 
 	for (i = rt_hash_mask; i >= 0; i--) {
-		spin_lock_bh(rt_hash_lock_addr(i));
+		spin_lock_bh(&rt_hash_lock);
 		rth = rt_hash_table[i].chain;
 		if (rth)
 			rt_hash_table[i].chain = NULL;
-		spin_unlock_bh(rt_hash_lock_addr(i));
+		spin_unlock_bh(&rt_hash_lock);
 
 		for (; rth; rth = next) {
 			next = rth->u.rt_next;
@@ -820,7 +792,7 @@ static int rt_garbage_collect(void)
 
 			k = (k + 1) & rt_hash_mask;
 			rthp = &rt_hash_table[k].chain;
-			spin_lock_bh(rt_hash_lock_addr(k));
+			spin_lock_bh(&rt_hash_lock);
 			while ((rth = *rthp) != NULL) {
 				if (!rt_may_expire(rth, tmo, expire)) {
 					tmo >>= 1;
@@ -852,7 +824,7 @@ static int rt_garbage_collect(void)
 				goal--;
 #endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
 			}
-			spin_unlock_bh(rt_hash_lock_addr(k));
+			spin_unlock_bh(&rt_hash_lock);
 			if (goal <= 0)
 				break;
 		}
@@ -922,7 +894,7 @@ restart:
 
 	rthp = &rt_hash_table[hash].chain;
 
-	spin_lock_bh(rt_hash_lock_addr(hash));
+	spin_lock_bh(&rt_hash_lock);
 	while ((rth = *rthp) != NULL) {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
 		if (!(rth->u.dst.flags & DST_BALANCED) &&
@@ -948,7 +920,7 @@ restart:
 			rth->u.dst.__use++;
 			dst_hold(&rth->u.dst);
 			rth->u.dst.lastuse = now;
-			spin_unlock_bh(rt_hash_lock_addr(hash));
+			spin_unlock_bh(&rt_hash_lock);
 
 			rt_drop(rt);
 			*rp = rth;
@@ -989,7 +961,7 @@ restart:
 	if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
 		int err = arp_bind_neighbour(&rt->u.dst);
 		if (err) {
-			spin_unlock_bh(rt_hash_lock_addr(hash));
+			spin_unlock_bh(&rt_hash_lock);
 
 			if (err != -ENOBUFS) {
 				rt_drop(rt);
@@ -1030,7 +1002,7 @@ restart:
 	}
 #endif
 	rt_hash_table[hash].chain = rt;
-	spin_unlock_bh(rt_hash_lock_addr(hash));
+	spin_unlock_bh(&rt_hash_lock);
 	*rp = rt;
 	return 0;
 }
@@ -1098,7 +1070,7 @@ static void rt_del(unsigned hash, struct
 {
 	struct rtable **rthp;
 
-	spin_lock_bh(rt_hash_lock_addr(hash));
+	spin_lock_bh(&rt_hash_lock);
 	ip_rt_put(rt);
 	for (rthp = &rt_hash_table[hash].chain; *rthp;
 	     rthp = &(*rthp)->u.rt_next)
@@ -1107,7 +1079,7 @@ static void rt_del(unsigned hash, struct
 			rt_free(rt);
 			break;
 		}
-	spin_unlock_bh(rt_hash_lock_addr(hash));
+	spin_unlock_bh(&rt_hash_lock);
 }
 
 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
@@ -3155,7 +3127,6 @@ int __init ip_rt_init(void)
 					&rt_hash_mask,
 					0);
 	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
-	rt_hash_lock_init();
 
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  0:17                   ` David S. Miller
@ 2006-01-07  1:09                     ` Andi Kleen
  2006-01-07  7:10                       ` David S. Miller
  2006-01-07  7:34                       ` Eric Dumazet
  2006-01-07  8:30                     ` Eric Dumazet
  1 sibling, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2006-01-07  1:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: paulmck, dada1, alan, torvalds, linux-kernel, dipankar, manfred, netdev

On Saturday 07 January 2006 01:17, David S. Miller wrote:

> 
> I mean something like this patch:

Looks like a good idea to me.

I always disliked the per chain spinlocks even for other hash tables like
TCP/UDP multiplex - it would be much nicer to use a much smaller separately 
hashed lock table and save cache. In this case the special case of using
a one entry only lock hash table makes sense.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  1:09                     ` Andi Kleen
@ 2006-01-07  7:10                       ` David S. Miller
  2006-01-07  7:34                       ` Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: David S. Miller @ 2006-01-07  7:10 UTC (permalink / raw)
  To: ak
  Cc: paulmck, dada1, alan, torvalds, linux-kernel, dipankar, manfred, netdev

From: Andi Kleen <ak@suse.de>
Date: Sat, 7 Jan 2006 02:09:01 +0100

> I always disliked the per chain spinlocks even for other hash tables like
> TCP/UDP multiplex - it would be much nicer to use a much smaller separately 
> hashed lock table and save cache. In this case the special case of using
> a one entry only lock hash table makes sense.

I used to think they were a great technique.  But in each case I
thought they could be applied, better schemes have come along.
In the case of the page cache we went to a per-address-space tree,
and here in the routing cache we went to RCU.

There are RCU patches around for the TCP hashes and I'd like to
put those in at some point as well.  In fact, they'd be even
more far reaching since Arnaldo abstracted away the socket
hashing stuff into an inet_hashtables subsystem.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  1:09                     ` Andi Kleen
  2006-01-07  7:10                       ` David S. Miller
@ 2006-01-07  7:34                       ` Eric Dumazet
  2006-01-07  7:44                         ` David S. Miller
  1 sibling, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2006-01-07  7:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Miller, paulmck, alan, torvalds, linux-kernel, dipankar,
	manfred, netdev

Andi Kleen a écrit :
> 
> I always disliked the per chain spinlocks even for other hash tables like
> TCP/UDP multiplex - it would be much nicer to use a much smaller separately 
> hashed lock table and save cache. In this case the special case of using
> a one entry only lock hash table makes sense.
> 

I agree, I do use a hashed spinlock array on my local tree for TCP, mainly to 
reduce the hash table size by a 2 factor.

Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  7:34                       ` Eric Dumazet
@ 2006-01-07  7:44                         ` David S. Miller
  2006-01-07  7:53                           ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: David S. Miller @ 2006-01-07  7:44 UTC (permalink / raw)
  To: dada1
  Cc: ak, paulmck, alan, torvalds, linux-kernel, dipankar, manfred, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 07 Jan 2006 08:34:35 +0100

> I agree, I do use a hashed spinlock array on my local tree for TCP,
> mainly to reduce the hash table size by a 2 factor.

So what do you think about going to a single spinlock for the
routing cache?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  7:44                         ` David S. Miller
@ 2006-01-07  7:53                           ` Eric Dumazet
  2006-01-07  8:36                             ` David S. Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2006-01-07  7:53 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, paulmck, alan, torvalds, linux-kernel, dipankar, manfred, netdev

David S. Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 07 Jan 2006 08:34:35 +0100
> 
>> I agree, I do use a hashed spinlock array on my local tree for TCP,
>> mainly to reduce the hash table size by a 2 factor.
> 
> So what do you think about going to a single spinlock for the
> routing cache?

I have no problem with this, since the biggest server I have is 4 way, but are 
you sure big machines wont suffer from this single spinlock ?

Also I dont understand what you want to do after this single spinlock patch.
How is it supposed to help the 'ip route flush cache' problem ?

In my case, I have about 600.000 dst-entries :

# grep ip_dst /proc/slabinfo
ip_dst_cache      616250 622440    320   12    1 : tunables   54   27    8 : 
slabdata  51870  51870      0


Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  0:17                   ` David S. Miller
  2006-01-07  1:09                     ` Andi Kleen
@ 2006-01-07  8:30                     ` Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2006-01-07  8:30 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, paulmck, alan, torvalds, linux-kernel, dipankar, manfred, netdev

David S. Miller a écrit :
> 
> Eric, how important do you honestly think the per-hashchain spinlocks
> are?  That's the big barrier from making rt_secret_rebuild() a simple
> rehash instead of flushing the whole table as it does now.
> 

No problem for me in going to a single spinlock.
I did the hashed spinlock patch in order to reduce the size of the route hash 
table and not hurting big NUMA machines. If you think a single spinlock is OK, 
that's even better !

> The lock is only grabbed for updates, and the access to these locks is
> random and as such probably non-local when taken anyways.  Back before
> we used RCU for reads, this array-of-spinlock thing made a lot more
> sense.
> 
> I mean something like this patch:
> 

> +static DEFINE_SPINLOCK(rt_hash_lock);


Just one point : This should be cache_line aligned, and use one full cache 
line to avoid false sharing at least. (If a cpu takes the lock, no need to 
invalidate *rt_hash_table for all other cpus)

Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  7:53                           ` Eric Dumazet
@ 2006-01-07  8:36                             ` David S. Miller
  2006-01-07 20:30                               ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: David S. Miller @ 2006-01-07  8:36 UTC (permalink / raw)
  To: dada1
  Cc: ak, paulmck, alan, torvalds, linux-kernel, dipankar, manfred, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 07 Jan 2006 08:53:52 +0100

> I have no problem with this, since the biggest server I have is 4
> way, but are you sure big machines wont suffer from this single
> spinlock ?

It is the main question.

> Also I dont understand what you want to do after this single
> spinlock patch.  How is it supposed to help the 'ip route flush
> cache' problem ?  In my case, I have about 600.000 dst-entries :

I don't claim to have a solution to this problem currently.

Doing RCU and going through the whole DST GC machinery is overkill for
an active system.  So, perhaps a very simple solution will do:

1) On rt_run_flush(), do not rt_free(), instead collect all active
   routing cache entries onto a global list, begin a timer to
   fire in 10 seconds (or some sysctl configurable amount).

2) When a new routing cache entry is needed, check the global
   list appended to in #1 above first, failing that do dst_alloc()
   as is done currently.

3) If timer expires, rt_free() any entries in the global list.

The missing trick is how to ensure RCU semantics when reallocating
from the global list.

The idea is that an active system will immediately repopulate itself
with all of these entries just flushed from the table.

RCU really doesn't handle this kind of problem very well.  It truly
excels when work is generated by process context work, not interrupt
work.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH, RFC] RCU : OOM avoidance and lower latency
  2006-01-07  8:36                             ` David S. Miller
@ 2006-01-07 20:30                               ` Paul E. McKenney
  0 siblings, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2006-01-07 20:30 UTC (permalink / raw)
  To: David S. Miller
  Cc: dada1, ak, alan, torvalds, linux-kernel, dipankar, manfred, netdev

On Sat, Jan 07, 2006 at 12:36:25AM -0800, David S. Miller wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 07 Jan 2006 08:53:52 +0100
> 
> > I have no problem with this, since the biggest server I have is 4
> > way, but are you sure big machines wont suffer from this single
> > spinlock ?
> 
> It is the main question.
> 
> > Also I dont understand what you want to do after this single
> > spinlock patch.  How is it supposed to help the 'ip route flush
> > cache' problem ?  In my case, I have about 600.000 dst-entries :
> 
> I don't claim to have a solution to this problem currently.
> 
> Doing RCU and going through the whole DST GC machinery is overkill for
> an active system.  So, perhaps a very simple solution will do:
> 
> 1) On rt_run_flush(), do not rt_free(), instead collect all active
>    routing cache entries onto a global list, begin a timer to
>    fire in 10 seconds (or some sysctl configurable amount).
> 
> 2) When a new routing cache entry is needed, check the global
>    list appended to in #1 above first, failing that do dst_alloc()
>    as is done currently.
> 
> 3) If timer expires, rt_free() any entries in the global list.
> 
> The missing trick is how to ensure RCU semantics when reallocating
> from the global list.

The straightforward ways of doing this require a per-entry lock in
addition to the dst_entry reference count -- lots of read-side overhead.

More complex approaches use a generation number that is incremented
when adding to or removing from the global list.  When the generation
number overflows, unconditionally rt_free() it rather than adding
to the global list again.  Then there needs to be some clever code
on the read side to detect the case when the generation number
changes while acquiring a reference.  And memory barriers.  Also
lots of read-side overhead.  Also, it is now -always- necessary to
acquire a reference on the read-side.

> The idea is that an active system will immediately repopulate itself
> with all of these entries just flushed from the table.
> 
> RCU really doesn't handle this kind of problem very well.  It truly
> excels when work is generated by process context work, not interrupt
> work.

Sounds like a challenge to me.  ;-)

Well, one possible way to attack Eric's workload might be the following:

o	Size the hash table to strike the appropriate balance between
	read-side search overhead and memory consumption.  Call the
	number of hash-chain headers N.

o	Create a hashed array of locks sized to allow the update to
	proceed sufficiently quickly.  Call the number of locks M,
	probably a power of two.  This means that M CPUs can be doing
	the update in parallel.

o	Create an array of M^2 list headers (call it xfer[][]), but
	since this is only needed during an update, it can be allocated
	and deallocated if need be.  (Me, with my big-server experience,
	would probably just create the array, since M is not likely to
	be too large.  But your mileage may vary.  And you really only
	need M*(M-1) list headers, but that makes the index calculation
	a bit more annoying.)

o	Use a two-phase update.  In the first phase, each updating
	CPU acquires the corresponding lock and removes entries from
	the corresponding partition of the hash table.  If the new
	location of a given entry falls into the same partition, it
	is added back to the appropriate hash chain of that partition.
	Otherwise, add the entry to xfer[dst][src], where "src" and 
	"dst" are indexes of the corresponding partitions.

o	When all CPUs finish removing entries from their partition,
	they check into a barrier.  Once all have checked in, they
	can start the second phase of the update.

o	In the second phase, each CPU removes the entries from the
	xfer array that are destined for its partition and adds them
	to the hash chain that they are destined for.

Some commentary and variations, in the hope that this inspires someone
to come up with an even better idea:

o	Unless M is at least three, there is no performance gain
	over a single global lock with a single CPU doing the update,
	since each element must now undergo four list operations rather
	than just two.

o	The xfer[][] array must have each entry cache-aligned, or
	you lose big on cacheline effects.  Note that it is -not-
	sufficient to simply align the rows or the columns, since
	each CPU has its own column when inserting and its own
	row when removing from xfer[][].

o	And the data-skew effects are less severe if this procedure
	runs from process context.  A spinning barrier must be used
	otherwise.  But note that the per-partition locks could remain
	spinlocks, only the barrier need involve sleeping (in case
	that helps, am getting a bit ahead of my understanding of
	this part of the kernel).

	So I half-agree with Dave -- this works better if the bulk
	update is done in process context, however, updates involving
	single entries being individually added and removed from the
	hash table can be done from interrupt context.

o	One (more complex!) way to reduce the effect of data skew to
	partition the array based on the number of elements in each
	partition rather than by the number of chains, but this would
	require a second barrier that is completed before dropping locks
	in order to avoid messing up concurrent single-element updates.

	And this only handles the phase-1 data-skew effects. It is
	possible to handle phase-2 data skew as well, but it takes an
	up-front full-table scan and so it is not clear how it could
	possibly be a performance win.

o	Note that routing slows down significantly while an update is
	in progress, because on average only half of the entries are
	in the hash table during the update.  I have no idea whether
	or not this is acceptable.  I am assuming that the same
	mechanism that prevents two concurrent route-cache misses
	from inserting two entries from the same route would also
	prevent adding a new entry when one was already in the
	xfer array.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2006-01-07 20:29 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20060105235845.967478000@sorel.sous-sol.org>
2006-01-05 21:47 ` [PATCH 0/6] -stable review Chris Wright
2006-01-06  0:45   ` [PATCH 1/6] drivers/net/sungem.c: gem_remove_one mustnt be __devexit Chris Wright
2006-01-06  0:45   ` [PATCH 2/6] ieee80211_crypt_tkip depends on NET_RADIO Chris Wright
2006-01-06  0:45   ` [PATCH 3/6] Insanity avoidance in /proc (CVE-2005-4605) Chris Wright
2006-01-06  0:45   ` [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0 Chris Wright
2006-01-06  1:30     ` Linus Torvalds
2006-01-06  3:40       ` Chris Wright
2006-01-06 10:17       ` [PATCH, RFC] RCU : OOM avoidance and lower latency Eric Dumazet
2006-01-06 12:52         ` [PATCH, RFC] RCU : OOM avoidance and lower latency (Version 2), HOTPLUG_CPU fix Eric Dumazet
2006-01-06 12:58         ` [PATCH, RFC] RCU : OOM avoidance and lower latency Andi Kleen
2006-01-06 13:09           ` Eric Dumazet
2006-01-06 19:26           ` Lee Revell
2006-01-06 22:18             ` Andi Kleen
2006-01-06 13:37         ` Alan Cox
2006-01-06 14:00           ` Eric Dumazet
2006-01-06 14:45             ` Alan Cox
2006-01-06 16:47           ` Paul E. McKenney
2006-01-06 17:19             ` Eric Dumazet
2006-01-06 20:26               ` Paul E. McKenney
2006-01-06 20:33                 ` David S. Miller
2006-01-06 20:57                 ` Andi Kleen
2006-01-07  0:17                   ` David S. Miller
2006-01-07  1:09                     ` Andi Kleen
2006-01-07  7:10                       ` David S. Miller
2006-01-07  7:34                       ` Eric Dumazet
2006-01-07  7:44                         ` David S. Miller
2006-01-07  7:53                           ` Eric Dumazet
2006-01-07  8:36                             ` David S. Miller
2006-01-07 20:30                               ` Paul E. McKenney
2006-01-07  8:30                     ` Eric Dumazet
2006-01-06 19:24         ` Lee Revell
2006-01-06  0:46   ` [PATCH 5/6] UFS: inode->i_sem is not released in error path Chris Wright
2006-01-06  0:46   ` [PATCH 6/6] [ATYFB]: Fix onboard video on SPARC Blade 100 for 2.6.{13,14,15} Chris Wright
2006-01-06  0:53   ` [PATCH 0/6] -stable review Chris Wright

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).