linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: must-fix list for 2.6.0
       [not found] <20030429155731.07811707.akpm@digeo.com.suse.lists.linux.kernel>
@ 2003-04-30  1:36 ` Andi Kleen
  2003-04-30 18:09   ` Pavel Machek
  0 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2003-04-30  1:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


I found a new bad class of bugs (slowly working on fixing them, also
present in 2.4) 

Machine Check handlers use printk in an NMI like (ignoring cli) situation.
This can deadlock on the console or low level character driver (serial, vga) 
locks. Not all MCEs are fatal (e.g. corrected ECC errors) and the kernel
should be safely able to continue.

Need to buffer the printk in an atomic fashion (e.g. in a ring buffer managed
with cmpxchg) and cause an self IPI that triggers an interrupt after
the next sti. This is easy with x86/APIC mode, but difficult with PIC
(the 8259 supports it in theory, but it's not clear that all clones in various
chipsets do; also changing the programming may be risky). Fallback: pick it 
up with the next timer interrupt by adding a check there.

New entries for the x86-64 list
(actually I'm not sure they are all x86-64 specific, just that the
bug has been seen there)

- 32bit core dumps do not dump 32bit SSE data currently. they should
- AT_GID/AT_UID ELF environment vector contains crap currently
This breaks debugging of the shared linker for suid programs because
ld.so always thinks it is suid/not called by root and ignores environment
variables.
- NIS/ypbind breaks with an abort() in glibc. Only happens on 2.5, 2.4 
is fine.
- need /proc/kcore access for kernel mappings that are outside vmalloc
(in particular the kernel and the modules are special mappings on x86-64;
other architectures have the same problem) 
Best would be to put them in the vmalloc mappings list, but that requires
some more fixes in other code that uses it. Also /proc/kcore seems to have
some 64bit signedness bugs (patch for 2.4 exists) 

Generic item: 

- need to share the ioctl 32bit emulation handlers between ports. 
Pavel has a patch, but he's running into difficulties with merging it.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30  1:36 ` must-fix list for 2.6.0 Andi Kleen
@ 2003-04-30 18:09   ` Pavel Machek
  2003-04-30 18:15     ` Andi Kleen
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Machek @ 2003-04-30 18:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel

Hi!

> Generic item: 
> 
> - need to share the ioctl 32bit emulation handlers between ports. 
> Pavel has a patch, but he's running into difficulties with merging it.

Its in now.
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 18:09   ` Pavel Machek
@ 2003-04-30 18:15     ` Andi Kleen
  2003-04-30 19:11       ` Pavel Machek
  0 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2003-04-30 18:15 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andi Kleen, Andrew Morton, linux-kernel

On Wed, Apr 30, 2003 at 08:09:22PM +0200, Pavel Machek wrote:
> Hi!
> 
> > Generic item: 
> > 
> > - need to share the ioctl 32bit emulation handlers between ports. 
> > Pavel has a patch, but he's running into difficulties with merging it.
> 
> Its in now.

Yes and nothing compiles anymore because linux/compat_ioctl.h is missing.

And really the table merge is not enough - all the functions need to 
be shared too.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 18:15     ` Andi Kleen
@ 2003-04-30 19:11       ` Pavel Machek
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2003-04-30 19:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel

Hi!

> > > Generic item: 
> > > 
> > > - need to share the ioctl 32bit emulation handlers between ports. 
> > > Pavel has a patch, but he's running into difficulties with merging it.
> > 
> > Its in now.
> 
> Yes and nothing compiles anymore because linux/compat_ioctl.h is
> missing.

Oops, sorry. Patch is on its way to Linus.

> And really the table merge is not enough - all the functions need to 
> be shared too.

Yes, I know. And it is going to be quite a big task, but table merge
is good first step.
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-02 16:28         ` Carl-Daniel Hailfinger
@ 2003-05-02 21:14           ` David S. Miller
  0 siblings, 0 replies; 47+ messages in thread
From: David S. Miller @ 2003-05-02 21:14 UTC (permalink / raw)
  To: c-d.hailfinger.kernel.2003; +Cc: fw, linux-kernel, kuznet, hadi, robert.olsson

   From: Carl-Daniel Hailfinger <c-d.hailfinger.kernel.2003@gmx.net>
   Date: Fri, 02 May 2003 18:28:56 +0200

   David S. Miller wrote:
   
   >>Shall I post the exploit?
   > 
   > You can't expect us to act on anything based upon vague references
   > to "dst cache DoS" and things like that.
   
   http://marc.theaimsgroup.com/?l=linux-kernel&m=104956079213417

Yes, that thing.  We're working on a fix.

And, please, don't be silly wrt. exploit posting.  One exists
publicly for some time, just google for juno-z.101f.c  All the
script kiddies know where this thing is and what it does.  So if you
have something better, just post it.

Current patch looks something like this (there will be changes):

--- net/ipv4/route.c.~1~	Thu May  1 06:18:00 2003
+++ net/ipv4/route.c	Thu May  1 10:30:13 2003
@@ -194,19 +194,46 @@ struct rt_hash_bucket {
 static struct rt_hash_bucket 	*rt_hash_table;
 static unsigned			rt_hash_mask;
 static int			rt_hash_log;
+static unsigned int		rt_hash_rnd;
 
 struct rt_cache_stat rt_cache_stat[NR_CPUS];
 
 static int rt_intern_hash(unsigned hash, struct rtable *rth,
 				struct rtable **res);
 
-static __inline__ unsigned rt_hash_code(u32 daddr, u32 saddr, u8 tos)
+/* This bit mixing code is by Bob Jenkins.  He has a great set of documents
+ * about hash function analysis at:
+ *
+ * http://burtleburtle.net/bob/hash/
+ */
+
+#define __mix(a, b, c) \
+{ \
+  a -= b; a -= c; a ^= (c>>13); \
+  b -= c; b -= a; b ^= (a<<8); \
+  c -= a; c -= b; c ^= (b>>13); \
+  a -= b; a -= c; a ^= (c>>12);  \
+  b -= c; b -= a; b ^= (a<<16); \
+  c -= a; c -= b; c ^= (b>>5); \
+  a -= b; a -= c; a ^= (c>>3);  \
+  b -= c; b -= a; b ^= (a<<10); \
+  c -= a; c -= b; c ^= (b>>15); \
+}
+
+static unsigned int rt_hash_code(u32 daddr, u32 saddr, u8 tos)
 {
-	unsigned hash = ((daddr & 0xF0F0F0F0) >> 4) |
-			((daddr & 0x0F0F0F0F) << 4);
-	hash ^= saddr ^ tos;
-	hash ^= (hash >> 16);
-	return (hash ^ (hash >> 8)) & rt_hash_mask;
+	u32 a, b, c;
+
+	a = b = 0x9e3779b9;
+	c = rt_hash_rnd;
+
+	a += daddr;
+	b += saddr;
+	c += (u32) tos;
+
+	__mix(a, b, c);
+
+	return (c & rt_hash_mask);
 }
 
 static int rt_cache_get_info(char *buffer, char **start, off_t offset,
@@ -2461,6 +2488,9 @@ static int ip_rt_acct_read(char *buffer,
 void __init ip_rt_init(void)
 {
 	int i, order, goal;
+
+	rt_hash_rnd = (int) ((num_physpages ^ (num_physpages>>8)) ^
+			     (jiffies ^ (jiffies >> 7)));
 
 #ifdef CONFIG_NET_CLS_ROUTE
 	for (order = 0;
--- net/ipv4/netfilter/ip_conntrack_core.c.~1~	Thu May  1 09:24:17 2003
+++ net/ipv4/netfilter/ip_conntrack_core.c	Thu May  1 16:48:57 2003
@@ -104,20 +104,45 @@ ip_conntrack_put(struct ip_conntrack *ct
 	nf_conntrack_put(&ct->infos[0]);
 }
 
-static inline u_int32_t
+/* This bit mixing code is by Bob Jenkins.  He has a great set of documents
+ * about hash function analysis at:
+ *
+ * http://burtleburtle.net/bob/hash/
+ */
+
+#define __mix(a, b, c) \
+{ \
+  a -= b; a -= c; a ^= (c>>13); \
+  b -= c; b -= a; b ^= (a<<8); \
+  c -= a; c -= b; c ^= (b>>13); \
+  a -= b; a -= c; a ^= (c>>12);  \
+  b -= c; b -= a; b ^= (a<<16); \
+  c -= a; c -= b; c ^= (b>>5); \
+  a -= b; a -= c; a ^= (c>>3);  \
+  b -= c; b -= a; b ^= (a<<10); \
+  c -= a; c -= b; c ^= (b>>15); \
+}
+
+static int ip_conntrack_hash_rnd;
+
+static u_int32_t
 hash_conntrack(const struct ip_conntrack_tuple *tuple)
 {
+	unsigned int a, b, c;
+
 #if 0
 	dump_tuple(tuple);
 #endif
-	/* ntohl because more differences in low bits. */
-	/* To ensure that halves of the same connection don't hash
-	   clash, we add the source per-proto again. */
-	return (ntohl(tuple->src.ip + tuple->dst.ip
-		     + tuple->src.u.all + tuple->dst.u.all
-		     + tuple->dst.protonum)
-		+ ntohs(tuple->src.u.all))
-		% ip_conntrack_htable_size;
+	a = b = 0x9e3779b9;
+	c = ip_conntrack_hash_rnd;
+
+	a += tuple->src.ip;
+	b += (tuple->dst.ip ^ tuple->dst.protonum);
+	c += (tuple->src.u.all | (tuple->dst.u.all << 16));
+
+	__mix(a, b, c);
+
+	return (c % ip_conntrack_htable_size);
 }
 
 inline int
@@ -1411,6 +1436,9 @@ int __init ip_conntrack_init(void)
 {
 	unsigned int i;
 	int ret;
+
+	ip_conntrack_hash_rnd = (int) ((num_physpages ^ (num_physpages>>8)) ^
+				       (jiffies ^ (jiffies >> 7)));
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 256 buckets.  >= 1GB machines have 8192 buckets. */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 12:05       ` David S. Miller
@ 2003-05-02 16:28         ` Carl-Daniel Hailfinger
  2003-05-02 21:14           ` David S. Miller
  0 siblings, 1 reply; 47+ messages in thread
From: Carl-Daniel Hailfinger @ 2003-05-02 16:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: Florian Weimer, linux-kernel, kuznet

David S. Miller wrote:

> On Thu, 2003-05-01 at 04:27, Florian Weimer wrote:
> 
>>"David S. Miller" <davem@redhat.com> writes:
>>
>>>On Tue, 2003-04-29 at 21:55, Florian Weimer wrote:
>>>
>>>>Andrew Morton <akpm@digeo.com> writes:
>>>>
>>>>
>>>>>net/
>>>>>----
>>>>
>>>>What about the dst cache DoS attack?
>>>
>>>Thanks for the lack of detailed description of the problem.
>>>Without it nobody can help you.
>>
>>Shall I post the exploit?
> 
> You can't expect us to act on anything based upon vague references
> to "dst cache DoS" and things like that.

http://marc.theaimsgroup.com/?l=linux-kernel&m=104956079213417


Regards,
Carl-Daniel


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 18:47     ` Christoph Hellwig
@ 2003-05-02  1:57       ` Andreas Boman
  0 siblings, 0 replies; 47+ messages in thread
From: Andreas Boman @ 2003-05-02  1:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel

On Thu, 2003-05-01 at 13:47, Christoph Hellwig wrote:
> On Thu, May 01, 2003 at 11:42:29AM -0700, Andrew Morton wrote:
> > I had two concerns with smalldevfs:
> > 
> > - It's dropping a semaphore (i_sem?) during its synchronous userspace
> >   callout.  That was for deadlock avoidance and may have introduced a race.
> 
> That's a design bug carried over from the old devfs and needs fixing by
> changing the way userspace notification works.
> 
> > - The new userspace doesn't support the compatibility names.  Just some
> >   config file, or a tarball or a dang shell script full of `ln -s'
> >   calls would fix that up, I think.
> 
> Well, that's easily fixable.  Does someone actually have a copy if the
> devfs_helper tarball around?  Adam's seems to have vanished and his
> ftp server is down, too.
> 
I think this is the latest tarball Adam had, though i could be mistaken.
http://users.eiwaz.com/~aboman/files/misc/devfs_helper-0.2.tar.gz

	Andreas



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 18:42   ` Andrew Morton
@ 2003-05-01 18:47     ` Christoph Hellwig
  2003-05-02  1:57       ` Andreas Boman
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2003-05-01 18:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, May 01, 2003 at 11:42:29AM -0700, Andrew Morton wrote:
> I had two concerns with smalldevfs:
> 
> - It's dropping a semaphore (i_sem?) during its synchronous userspace
>   callout.  That was for deadlock avoidance and may have introduced a race.

That's a design bug carried over from the old devfs and needs fixing by
changing the way userspace notification works.

> - The new userspace doesn't support the compatibility names.  Just some
>   config file, or a tarball or a dang shell script full of `ln -s'
>   calls would fix that up, I think.

Well, that's easily fixable.  Does someone actually have a copy if the
devfs_helper tarball around?  Adam's seems to have vanished and his
ftp server is down, too.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 14:33 ` Christoph Hellwig
@ 2003-05-01 18:42   ` Andrew Morton
  2003-05-01 18:47     ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2003-05-01 18:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:
>
> drivers/scsi/
> 
>  - large parts of the locking are hosed or not existant
>       o shost->my_devices isn't locked down at all
>       o the host list ist locked but not refcounted, mess can
>          happen when the spinlock is dropped
>       o there are lots of members of struct Scsi_Host/scsi_device/scsi_cmnd
>         with very unclear locking, many of them probably want to become
> 	atomic_t's or bitmaps (for the 1bit bitfields).
>       o there's lots of volatile abuse in the scsi code that needs to
>         be thought about.
>       o there's some global variables incremented without any locks

Thanks.

> fs/devfs/
> 
>  - there's a fundamental lookup vs devfsd race that's only fixable
>    by introducing a lookup vs devfs deadlock.  I can't see how this
>    is fixable without getting rid of the current devfsd design.
>    Mandrake seems to have a workaround for this so this is at least
>    not triggered so easily, but that's not what I'd considere a fix..

Look.  Please.  If you have the time, let's just put it out of its misery.

I had two concerns with smalldevfs:

- It's dropping a semaphore (i_sem?) during its synchronous userspace
  callout.  That was for deadlock avoidance and may have introduced a race.

- The new userspace doesn't support the compatibility names.  Just some
  config file, or a tarball or a dang shell script full of `ln -s'
  calls would fix that up, I think.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:41       ` Robert Love
@ 2003-05-01 16:50         ` Hubertus Franke
  0 siblings, 0 replies; 47+ messages in thread
From: Hubertus Franke @ 2003-05-01 16:50 UTC (permalink / raw)
  To: Robert Love, Rick Lindsley
  Cc: Andrew Morton, Maciej Soltysiak, linux-kernel, frankeh

In good/bad I simply tried to identify what behavior will be observed. 
Let's put OpenOffice to rest as sooner or later folks will get to use the 
corrected version. 

What is the semantics of yield?    (again) 
Two things need to be considered. 
(a) continuation of execution 
(b) length of time slice at continuation 

In the OLD version (a) would be run after all other tasks and (b) default 
timeslice reset. 
In the new version (a) would be on the same level and (b) would be no 
changes to the current timeslice 

To me both seems wrong. If you do the OLD version, one clearly limits the 
forward progress of the task, namely by revoking its current timeslice and 
deferring continuation for some time. 
The new version does exactly the opposite. Defer execution for a short 
time, but don't revoke any timeslice.

So it boils down to the common use of sched_yield(). 
(i) create some interactivity and allow all others to run based on some 
app knowledge. 
(ii) yielding based on locking and the potential lock hold time. 

If (i) then the old version seems better. 
If (ii) then it depends on the lock hold time and arrival rate and the 
optimism that one wants to assume. 


The conservative method is to move to expired, however the lock might have 
been reacquired. 
The aggressive method is to move the task one slot down in the active 
queue.  The NEW method is somewhat in between but on the aggressive side

Dropping the effective priority seems a reasonable medium ground, because 
moving to the expired list is simply a bit worse then dropping the effective 
priority to the lowest level. So why not drop the 
effective priority based on sched_yield invocation frequency or recency. 
This is a gradual step towards the "harsh" solution to move to the expired, 
but at the same time avoids potential cpu hogging. 
I am afraid there are simply contradictory situations and it will be 
difficult to serve both perfectly. 

Hubertus Franke,   IBM Research
email: frankeh@us.ibm.com



On Wednesday 30 April 2003 19:41, Robert Love wrote:
> On Wed, 2003-04-30 at 19:11, Rick Lindsley wrote:
> >    OLD: when sched_yield() is called the task moves to expired,
> > 	every other task in the active queue will run first before the
> > 	yielding task will run again.
>
> I really think this is the right way.
>
> >    NEW: move the yielding task to the end of its current priority level,
> > 	but keeps it active not expired.
>
> This takes us back to the problem we saw in earlier sched_yield()
> implementations.  A group of yielding threads just round-robin between
> themselves, yielding over and over.  Worse, even a single task alone in
> a priorty level will show up as a CPU hog if it keeps calling
> sched_yield() in a loop.
>
> It goes on.  Assume we have two runnable tasks, one that does whatever
> it wants (hopefully something useful), and the other which does:
>
> 	while(1)
> 		sched_yield();
>
> With the current sched_yield(), the second will receive much less
> processor time than the first (nearly none vs. most of the processor).
> With the sched_yield() mentioned above, they will receive identical
> amounts of processor time.  That does not seem sane to me.
>
> I think it is important that sched_yield() give processor time to all
> tasks, and not just between multiple yielding tasks.
>
> The current implementation does this.  If an application (*cough* Open
> Office *cough*) calls sched_yield() over and over, what does it expect?
>
> Now that we have futexes, sched_yield() no longer needs to be used as a
> poor replacement for blocking, and it can have sane semantics, such as
> _really_ yielding the processor.
>
> > 	What else could be done?
> > 	(a) drop the effective priority of the yielding task by a percentile,
> > 	    but don't reduce the time slice!
>
> This works, too.  We used to do this..
>
> There are a couple bits that need to be added, though, to deal with
> threads that call sched_yield() over and over (which are the ones where
> we have problems).  We need to drop the task a priority level every time
> it calls sched_yield().  Eventually it will reach the lowest priority
> (or some earlier threshold we want to check for) and then we need to put
> it on the expired list, like the current behavior.
>
> So for the big offenders, I think this ends up being the same, no?
>
> Also, this approach does not work for real-time tasks, for whom we must
> not change their priority... so we end up just requeing them, too.
>
> Just my thoughts...
>
> 	Robert Love
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  6:27             ` Christoph Hellwig
  2003-05-01  5:49               ` Shawn
@ 2003-05-01 15:44               ` Robert Love
  1 sibling, 0 replies; 47+ messages in thread
From: Robert Love @ 2003-05-01 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, viro, ricklind, solt, linux-kernel, frankeh

On Thu, 2003-05-01 at 02:27, Christoph Hellwig wrote:

> No, they're doing it themselves.  The RedHat OO package has a patch to
> fix this mess (and two dozend other patches to work around OO braindamage..)

Right, Open Office is its own problem.

But LinuxThreads uses sched_yield() to do synchronization (yuck), since
it lacked something like futexes at the time.

	Robert Love


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-29 22:57 Andrew Morton
                   ` (3 preceding siblings ...)
  2003-04-30 10:16 ` Maciej Soltysiak
@ 2003-05-01 14:33 ` Christoph Hellwig
  2003-05-01 18:42   ` Andrew Morton
  4 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2003-05-01 14:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

drivers/scsi/

 - large parts of the locking are hosed or not existant
      o shost->my_devices isn't locked down at all
      o the host list ist locked but not refcounted, mess can
         happen when the spinlock is dropped
      o there are lots of members of struct Scsi_Host/scsi_device/scsi_cmnd
        with very unclear locking, many of them probably want to become
	atomic_t's or bitmaps (for the 1bit bitfields).
      o there's lots of volatile abuse in the scsi code that needs to
        be thought about.
      o there's some global variables incremented without any locks

fs/devfs/

 - there's a fundamental lookup vs devfsd race that's only fixable
   by introducing a lookup vs devfs deadlock.  I can't see how this
   is fixable without getting rid of the current devfsd design.
   Mandrake seems to have a workaround for this so this is at least
   not triggered so easily, but that's not what I'd considere a fix..

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 11:27     ` Florian Weimer
  2003-05-01 12:03       ` David S. Miller
@ 2003-05-01 12:05       ` David S. Miller
  2003-05-02 16:28         ` Carl-Daniel Hailfinger
  1 sibling, 1 reply; 47+ messages in thread
From: David S. Miller @ 2003-05-01 12:05 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-kernel, kuznet

On Thu, 2003-05-01 at 04:27, Florian Weimer wrote:
> "David S. Miller" <davem@redhat.com> writes:
> 
> > On Tue, 2003-04-29 at 21:55, Florian Weimer wrote:
> >> Andrew Morton <akpm@digeo.com> writes:
> >> 
> >> > net/
> >> > ----
> >> 
> >> What about the dst cache DoS attack?
> >
> > Thanks for the lack of detailed description of the problem.
> > Without it nobody can help you.
> 
> Shall I post the exploit?

Don't let me stop you.

You can't expect us to act on anything based upon vague references
to "dst cache DoS" and things like that.

I also would appreciate it if you'd actually at least add the
networking maintainers to the CC: list when asking/discussing
such problems.  Bringing it up on places like linux-net and
netdev@oss.sgi.com would be a good idea too.

Random blather on linux-kernel tends to get ignored.

-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 11:27     ` Florian Weimer
@ 2003-05-01 12:03       ` David S. Miller
  2003-05-01 12:05       ` David S. Miller
  1 sibling, 0 replies; 47+ messages in thread
From: David S. Miller @ 2003-05-01 12:03 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-kernel

On Thu, 2003-05-01 at 04:27, Florian Weimer wrote:
> "David S. Miller" <davem@redhat.com> writes:
> 
> > On Tue, 2003-04-29 at 21:55, Florian Weimer wrote:
> >> Andrew Morton <akpm@digeo.com> writes:
> >> 
> >> > net/
> >> > ----
> >> 
> >> What about the dst cache DoS attack?
> >
> > Thanks for the lack of detailed description of the problem.
> > Without it nobody can help you.
> 
> Shall I post the exploit?


-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  5:40               ` Dave Hansen
@ 2003-05-01 11:57                 ` Bill Huey
  0 siblings, 0 replies; 47+ messages in thread
From: Bill Huey @ 2003-05-01 11:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Gerrit Huizenga, Robert Love, viro, Andrew Morton, Rick Lindsley,
	solt, linux-kernel, frankeh, Bill Huey (Hui)

On Wed, Apr 30, 2003 at 10:40:22PM -0700, Dave Hansen wrote:
> Gerrit Huizenga wrote:
> > Which affects JVM in most cases.  NPTL based JVMs will possibly
> > obviate that problem.  My guess is that in the JVM case, they have
> > a bad locking model (er, a simpler 2-tier locking model instead of
> > a more correct and complex 3-tier locking model) for their threading
> > operations.  As a result, they use either sched_yield() or used
> > to use pause() to relinquish the processor so the world could change
> > and they could acquire the locks they wanted.
> 
> The JVM's extensive use of sched_yield(), plus the HT scheduler causes
> some pretty undesirable behaviour in SPECjbb(tm) (see disclaimer).  It
> starves some pieces of the benchmark so badly, that the benchmark
> results are invalid.  We also start to get tons of idle time as the load
> goes up.

Have the Blackdown folks fix that. The Solaris Threads implementation
suppresses the actual call to a yield in the HotSpot VM if it gets too
many of them bunched together in short period of time. It's really a problem
not with the JVM itself, but the Linux implementaion of their threading
glue logic... Make'm fix it. :)

I've heard that a number of folks in Blackdown want to try out the new
threading model, so this might be a good opportunity to do that... add
special thread suspension support, etc...

:)

bill


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 11:24   ` David S. Miller
@ 2003-05-01 11:27     ` Florian Weimer
  2003-05-01 12:03       ` David S. Miller
  2003-05-01 12:05       ` David S. Miller
  0 siblings, 2 replies; 47+ messages in thread
From: Florian Weimer @ 2003-05-01 11:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

"David S. Miller" <davem@redhat.com> writes:

> On Tue, 2003-04-29 at 21:55, Florian Weimer wrote:
>> Andrew Morton <akpm@digeo.com> writes:
>> 
>> > net/
>> > ----
>> 
>> What about the dst cache DoS attack?
>
> Thanks for the lack of detailed description of the problem.
> Without it nobody can help you.

Shall I post the exploit?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30  4:55 ` Florian Weimer
  2003-04-30 10:18   ` Maciej Soltysiak
@ 2003-05-01 11:24   ` David S. Miller
  2003-05-01 11:27     ` Florian Weimer
  1 sibling, 1 reply; 47+ messages in thread
From: David S. Miller @ 2003-05-01 11:24 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-kernel

On Tue, 2003-04-29 at 21:55, Florian Weimer wrote:
> Andrew Morton <akpm@digeo.com> writes:
> 
> > net/
> > ----
> 
> What about the dst cache DoS attack?

Thanks for the lack of detailed description of the problem.
Without it nobody can help you.

-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  5:49               ` Shawn
@ 2003-05-01 10:14                 ` Christoph Hellwig
  2003-05-01  9:47                   ` Alan Cox
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2003-05-01 10:14 UTC (permalink / raw)
  To: Shawn; +Cc: Andrew Morton, viro, ricklind, solt, linux-kernel, frankeh

On Thu, May 01, 2003 at 01:49:56AM -0400, Shawn wrote:
> This is a very useful conversation for the OO guys themselves to hear
> about. Anyone care to make them aware of the will of the kernel gods?

If redhat fixed it I assume they sent it upstream.  OTOH if you read
the RH bashing on the OO lists it's another question whether it was
applied.. :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01 10:14                 ` Christoph Hellwig
@ 2003-05-01  9:47                   ` Alan Cox
  0 siblings, 0 replies; 47+ messages in thread
From: Alan Cox @ 2003-05-01  9:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shawn, Andrew Morton, viro, ricklind, solt,
	Linux Kernel Mailing List, frankeh

On Iau, 2003-05-01 at 11:14, Christoph Hellwig wrote:
> On Thu, May 01, 2003 at 01:49:56AM -0400, Shawn wrote:
> > This is a very useful conversation for the OO guys themselves to hear
> > about. Anyone care to make them aware of the will of the kernel gods?
> 
> If redhat fixed it I assume they sent it upstream.  OTOH if you read
> the RH bashing on the OO lists it's another question whether it was
> applied.. :)

OpenOffice can't even get its website working with ECN even though
nowdays its an IETF standard. That probably means they can't even
exchange email with some of the kernel developers let alone we hope
they are listening.

I am sure given reason they can be persuaded to fix their sched_yield 
stuff however.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  8:42           ` Andrew Morton
@ 2003-05-01  8:47             ` Arjan van de Ven
  0 siblings, 0 replies; 47+ messages in thread
From: Arjan van de Ven @ 2003-05-01  8:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: arjanv, ricklind, solt, linux-kernel, frankeh

On Thu, May 01, 2003 at 01:42:12AM -0700, Andrew Morton wrote:
> Arjan van de Ven <arjanv@redhat.com> wrote:
> >
> > Nuking a kernel feature
> > (basically making sched_yield() more posix compliant) for ONE
> > broken-since-fixed app doesn't sound like a good plan to me.
> 
> You're promising there are no others?

I'm saying that about half the others will expect the new (posix) behavior
and half will expect the old linux behavior of yielding only 1 spot.
Whatever you do you can't win for everything; and the vast majority of the
apps out there will not even call this function themselves. Ever.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  8:36         ` Arjan van de Ven
@ 2003-05-01  8:42           ` Andrew Morton
  2003-05-01  8:47             ` Arjan van de Ven
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2003-05-01  8:42 UTC (permalink / raw)
  To: arjanv; +Cc: ricklind, solt, linux-kernel, frankeh

Arjan van de Ven <arjanv@redhat.com> wrote:
>
> Nuking a kernel feature
> (basically making sched_yield() more posix compliant) for ONE
> broken-since-fixed app doesn't sound like a good plan to me.

You're promising there are no others?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:21       ` Andrew Morton
  2003-04-30 23:47         ` viro
  2003-04-30 23:53         ` Robert Love
@ 2003-05-01  8:36         ` Arjan van de Ven
  2003-05-01  8:42           ` Andrew Morton
  2 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2003-05-01  8:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rick Lindsley, solt, linux-kernel, frankeh

[-- Attachment #1: Type: text/plain, Size: 1065 bytes --]

On Thu, 2003-05-01 at 01:21, Andrew Morton wrote:
> Rick Lindsley <ricklind@us.ibm.com> wrote:
> >
> > 	Why is this bad?
> > 	(a) if it does busy looping through sched_yield it will eat cycles which
> > 	    might not have happened
> 
> Things like OpenOffice _do_ busy loop on sched_yield().  It appears with
> that patch, OO will sit there chewing ~1% of CPU.  Not great, but not bad
> either..
> 
> A few kernels ago, OpenOffice would take sixty seconds to just flop down a
> menu if there was a kernel build happening at the same time.  That is just
> utterly broken, so if we're going to leave the sched.c code as-is then we
> *require* that all applications be updated to not spin on sched_yield.
> 
> There's just no question about that.  It may end up not being acceptable.

actually this is an ooffice bug and is since fixed..... newer ooffice
versions don't have this behavior anymore. Nuking a kernel feature
(basically making sched_yield() more posix compliant) for ONE
broken-since-fixed app doesn't sound like a good plan to me.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:59           ` Andrew Morton
@ 2003-05-01  6:27             ` Christoph Hellwig
  2003-05-01  5:49               ` Shawn
  2003-05-01 15:44               ` Robert Love
  0 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2003-05-01  6:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: viro, ricklind, solt, linux-kernel, frankeh

On Wed, Apr 30, 2003 at 04:59:14PM -0700, Andrew Morton wrote:
> I think it's happening down inside the old linuxthreads library.  No idea
> who, what, where or why.

No, they're doing it themselves.  The RedHat OO package has a patch to
fix this mess (and two dozend other patches to work around OO braindamage..)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  3:21     ` Mike Galbraith
@ 2003-05-01  6:26       ` Mike Galbraith
  0 siblings, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-01  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Maciej Soltysiak, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 864 bytes --]

At 05:21 AM 5/1/2003 +0200, Mike Galbraith wrote:
>At 12:11 PM 4/30/2003 -0700, Andrew Morton wrote:
>>Maciej Soltysiak <solt@dns.toxicfilms.tv> wrote:
>> >
>> >
>> > Also there is one issue, i am not sure if this may be a kernel issue,
>> > but with setiathome running in a X desktop environment all apps work fine,
>> > but when i run openoffice, openoffice responds with 5 second delay.
>>
>>That'll be the changed sched_yield() semantics.
>>
>>The below patch should fix that up, but we need to decide whether the (rather
>>unclear) advantages of the sched_yield() change outweigh the breakage which
>>it caused linuxthreads applications.

<snip>

Anyway, attached is a patchlet that works for me if anyone wants to try 
it.  I removed printk's, and whatnot but it'll have some offsets because of 
other butchery in my X-para-mental tree ;-)

         -Mike   

[-- Attachment #2: yield.diff --]
[-- Type: application/octet-stream, Size: 1541 bytes --]

--- kernel/sched.c.org	Fri Apr 25 06:24:34 2003
+++ kernel/sched.c	Thu May  1 07:38:00 2003
@@ -1265,7 +1323,7 @@
 	runqueue_t *rq;
 	prio_array_t *array;
 	struct list_head *queue;
-	int idx;
+	int idx = 0;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -1330,7 +1388,19 @@
 		rq->expired_timestamp = 0;
 	}
 
-	idx = sched_find_first_bit(array->bitmap);
+	if (!idx || idx >= MAX_PRIO)
+		idx = sched_find_first_bit(array->bitmap);
+	else {
+		idx = find_next_bit(array->bitmap, MAX_PRIO, idx + 1);
+		if (idx >= MAX_PRIO) {
+			idx = 0;
+			spin_unlock_irq(&rq->lock);
+			reacquire_kernel_lock(current);
+			preempt_enable_no_resched();
+			goto need_resched;
+		}
+	}
+
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
 
@@ -1984,19 +2054,12 @@
 	prio_array_t *array = current->array;
 
 	/*
-	 * We implement yielding by moving the task into the expired
-	 * queue.
-	 *
-	 * (special rule: RT tasks will just roundrobin in the active
-	 *  array.)
+	 * We implement yielding by moving the task to the back of
+	 * the queue.
 	 */
-	if (likely(!rt_task(current))) {
-		dequeue_task(current, array);
-		enqueue_task(current, rq->expired);
-	} else {
-		list_del(&current->run_list);
-		list_add_tail(&current->run_list, array->queue + current->prio);
-	}
+	list_del(&current->run_list);
+	list_add_tail(&current->run_list, array->queue + current->prio);
+	set_tsk_need_resched(current);
 	/*
 	 * Since we are going to call schedule() anyway, there's
 	 * no need to preempt:

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  6:27             ` Christoph Hellwig
@ 2003-05-01  5:49               ` Shawn
  2003-05-01 10:14                 ` Christoph Hellwig
  2003-05-01 15:44               ` Robert Love
  1 sibling, 1 reply; 47+ messages in thread
From: Shawn @ 2003-05-01  5:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, viro, ricklind, solt, linux-kernel, frankeh

This is a very useful conversation for the OO guys themselves to hear
about. Anyone care to make them aware of the will of the kernel gods?

On Thu, 2003-05-01 at 02:27, Christoph Hellwig wrote:
> On Wed, Apr 30, 2003 at 04:59:14PM -0700, Andrew Morton wrote:
> > I think it's happening down inside the old linuxthreads library.  No idea
> > who, what, where or why.
> 
> No, they're doing it themselves.  The RedHat OO package has a patch to
> fix this mess (and two dozend other patches to work around OO braindamage..)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  0:51             ` Gerrit Huizenga
@ 2003-05-01  5:40               ` Dave Hansen
  2003-05-01 11:57                 ` Bill Huey
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2003-05-01  5:40 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Robert Love, viro, Andrew Morton, Rick Lindsley, solt,
	linux-kernel, frankeh

Gerrit Huizenga wrote:
> Which affects JVM in most cases.  NPTL based JVMs will possibly
> obviate that problem.  My guess is that in the JVM case, they have
> a bad locking model (er, a simpler 2-tier locking model instead of
> a more correct and complex 3-tier locking model) for their threading
> operations.  As a result, they use either sched_yield() or used
> to use pause() to relinquish the processor so the world could change
> and they could acquire the locks they wanted.

The JVM's extensive use of sched_yield(), plus the HT scheduler causes
some pretty undesirable behaviour in SPECjbb(tm) (see disclaimer).  It
starves some pieces of the benchmark so badly, that the benchmark
results are invalid.  We also start to get tons of idle time as the load
goes up.

In case anybody is curious, we're trying to share more of the data that
we collect when we run the benchmarks.  Most of it us useless, but
someone might find a gem or two.  Here are two runs, one with HT, and
the other without.  There's also a pretty busy gnuplot graph in there:
http://www.sr71.net/prof/jbb/elm3a2/

The benchmark results can be found in:
<run-name>/benchmark/SPECjbb.*

Disclaimer:
SPEC (tm) and the benchmark name SPECjbb (tm) are registered trademarks
of the Standard Performance Evaluation Corporation. The benchmarking was
conducted for research purposes only and were non-compliant with the
following deviations from the rules:

  1. It was run on hardware that does not meet the SPEC
  availability-to-the public criteria. The machine was an
  engineering sample.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30  4:45   ` Andrew Morton
  2003-04-30  4:52     ` William Lee Irwin III
@ 2003-05-01  4:32     ` William Lee Irwin III
  1 sibling, 0 replies; 47+ messages in thread
From: William Lee Irwin III @ 2003-05-01  4:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>> With that in addition to the OOM killer locking patch I posted and
>> another to completely eliminate mm-less processes from consideration
>> 64GB ia32 (with, of course, my oversized out-of-tree patch) recovers
>> from OOM instead of deadlocking after a mass-killing with swap online.

On Tue, Apr 29, 2003 at 09:45:37PM -0700, Andrew Morton wrote:
> Wanna send patch?

out_of_memory() needs to check to be sure ZONE_NORMAL isn't exhausted.
In order to avoid potential regressions on non-highmem systems, make
the check (#ifdef-lessly) conditional on highmem support.

vs. current bk (which has the locking for oom's operational variables)

diff -prauN linux-2.5.68-9/mm/oom_kill.c oom-2.5.68-2/mm/oom_kill.c
--- linux-2.5.68-9/mm/oom_kill.c        Wed Apr 23 03:15:53 2003
+++ oom-2.5.68-2/mm/oom_kill.c  Wed Apr 30 21:06:48 2003
@@ -16,6 +16,7 @@
  */
 
 #include <linux/mm.h>
+#include <linux/bootmem.h> /* for max_pfn and max_low_pfn */
 #include <linux/sched.h>
 #include <linux/swap.h>
 #include <linux/timex.h>
@@ -217,9 +218,10 @@
 	unsigned long now, since;
 
 	/*
-	 * Enough swap space left?  Not OOM.
+	 * Enough swap space and ZONE_NORMAL left? Not OOM.
 	 */
-	if (nr_swap_pages > 0)
+	if (nr_swap_pages > 0 &&
+			(max_pfn == max_low_pfn || nr_free_buffer_pages() > 0))
 		return;
 
 	spin_lock(&oom_lock);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 19:11   ` Andrew Morton
  2003-04-30 23:11     ` Rick Lindsley
@ 2003-05-01  3:21     ` Mike Galbraith
  2003-05-01  6:26       ` Mike Galbraith
  1 sibling, 1 reply; 47+ messages in thread
From: Mike Galbraith @ 2003-05-01  3:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Maciej Soltysiak, linux-kernel

At 12:11 PM 4/30/2003 -0700, Andrew Morton wrote:
>Maciej Soltysiak <solt@dns.toxicfilms.tv> wrote:
> >
> >
> > Also there is one issue, i am not sure if this may be a kernel issue,
> > but with setiathome running in a X desktop environment all apps work fine,
> > but when i run openoffice, openoffice responds with 5 second delay.
>
>That'll be the changed sched_yield() semantics.
>
>The below patch should fix that up, but we need to decide whether the (rather
>unclear) advantages of the sched_yield() change outweigh the breakage which
>it caused linuxthreads applications.
>
>
>diff -puN kernel/sched.c~sched_yield-hack kernel/sched.c
>--- 25/kernel/sched.c~sched_yield-hack  2003-04-30 12:08:51.000000000 -0700
>+++ 25-akpm/kernel/sched.c      2003-04-30 12:09:11.000000000 -0700
>@@ -1992,7 +1992,7 @@ asmlinkage long sys_sched_yield(void)
>         */
>         if (likely(!rt_task(current))) {
>                 dequeue_task(current, array);
>-               enqueue_task(current, rq->expired);
>+               enqueue_task(current, array);
>         } else {
>                 list_del(&current->run_list);
>                 list_add_tail(&current->run_list, array->queue + 
> current->prio);
>
>_

That won't work, because the scheduler will keep re-selecting the yielding 
task if it's interactive.  I tried this yesterday.  (besides, don't you 
need to set_tsk_need_resched(current) there?)  An easy way to see it not 
work as expected, is to change CHILD_PENALTY to 99, and add 
current->sleep_avg=MAX_SLEEP_AVG  to sched_init() before 
wake_up_forked_process().  Your next boot will 100% guaranteed hang while 
starting ksoftirqd until the parent gets expired.  You'll read "POSIX 
conformance testing by UNIFIX" until then :)

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-05-01  0:09           ` Robert Love
@ 2003-05-01  0:51             ` Gerrit Huizenga
  2003-05-01  5:40               ` Dave Hansen
  0 siblings, 1 reply; 47+ messages in thread
From: Gerrit Huizenga @ 2003-05-01  0:51 UTC (permalink / raw)
  To: Robert Love
  Cc: viro, Andrew Morton, Rick Lindsley, solt, linux-kernel, frankeh

On 30 Apr 2003 20:09:13 EDT, Robert Love wrote:
> On Wed, 2003-04-30 at 19:47, viro@parcelfarce.linux.theplanet.co.uk
> wrote:
> 
> > Excuse me, but WTF do they spin on the sched_yield() in the first place?
> > _That_ sounds like utterly broken...
> 
> I agree it is broken, but it was considered a method of implementing
> user-space locking for a long time..
> 
> The problem is in LinuxThreads mostly, I guess, according to Andrew.

Which affects JVM in most cases.  NPTL based JVMs will possibly
obviate that problem.  My guess is that in the JVM case, they have
a bad locking model (er, a simpler 2-tier locking model instead of
a more correct and complex 3-tier locking model) for their threading
operations.  As a result, they use either sched_yield() or used
to use pause() to relinquish the processor so the world could change
and they could acquire the locks they wanted.

Sounds stupid, but that was the most obvious linuxthreads implementation.
Futexes are also likely to help here, btw...

Of course, all of this is my own heresay, so if anyone has better
details, feel free to add them.

gerrit

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:47         ` viro
  2003-04-30 23:59           ` Andrew Morton
@ 2003-05-01  0:09           ` Robert Love
  2003-05-01  0:51             ` Gerrit Huizenga
  1 sibling, 1 reply; 47+ messages in thread
From: Robert Love @ 2003-05-01  0:09 UTC (permalink / raw)
  To: viro; +Cc: Andrew Morton, Rick Lindsley, solt, linux-kernel, frankeh

On Wed, 2003-04-30 at 19:47, viro@parcelfarce.linux.theplanet.co.uk
wrote:

> Excuse me, but WTF do they spin on the sched_yield() in the first place?
> _That_ sounds like utterly broken...

I agree it is broken, but it was considered a method of implementing
user-space locking for a long time..

The problem is in LinuxThreads mostly, I guess, according to Andrew.

But the big offender we hear about ten times a day is Open Office, which
calls sched_yield() after a lot of GUI operations, seemingly in the name
of interactivity.  It is busted and Red Hat shipped a yield-less Open
Office in RH9 which works fine.

	Robert Love


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:47         ` viro
@ 2003-04-30 23:59           ` Andrew Morton
  2003-05-01  6:27             ` Christoph Hellwig
  2003-05-01  0:09           ` Robert Love
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2003-04-30 23:59 UTC (permalink / raw)
  To: viro; +Cc: ricklind, solt, linux-kernel, frankeh

viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> On Wed, Apr 30, 2003 at 04:21:08PM -0700, Andrew Morton wrote:
> > menu if there was a kernel build happening at the same time.  That is just
> > utterly broken, so if we're going to leave the sched.c code as-is then we
> > *require* that all applications be updated to not spin on sched_yield.
> 
> Excuse me, but WTF do they spin on the sched_yield() in the first place?
> _That_ sounds like utterly broken...

I think it's happening down inside the old linuxthreads library.  No idea
who, what, where or why.

There are quite a few places in the kernel which do it, too.  Usually when
waiting for memory to come free.  These are being gradually removed, in
favour of blk_congestion_wait() calls.

That leaves behind the very performance-critical sched_yield() in ext3
transaction batching.  That was designed to allow other processes to join a
transaction before the calling one closes the transaction.  With the new
yield() it was causing horrid starvation and was lamely replaced with a
schedule().  It needs to be resurrected for real, but I'm not sure how. 
Probably just a sleep(0.01).


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:21       ` Andrew Morton
  2003-04-30 23:47         ` viro
@ 2003-04-30 23:53         ` Robert Love
  2003-05-01  8:36         ` Arjan van de Ven
  2 siblings, 0 replies; 47+ messages in thread
From: Robert Love @ 2003-04-30 23:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rick Lindsley, solt, linux-kernel, frankeh

On Wed, 2003-04-30 at 19:21, Andrew Morton wrote:

> A few kernels ago, OpenOffice would take sixty seconds to just flop down a
> menu if there was a kernel build happening at the same time.  That is just
> utterly broken, so if we're going to leave the sched.c code as-is then we
> *require* that all applications be updated to not spin on sched_yield.

Just as a note (I know its not an excuse), Red Hat 9 has Open Office
with the dumb sched_yield() calls removed.  It runs quite nice.

> Has anyone looked at what Andrea did in -aa?  I assume some suitable
> compromise was achieved there.

Well, his base O(1) scheduler does not have 2.5's sched_yield()... but
he has a patch (I guess that he wrote) on top which changes the
semantics a bit.  It looks like he drops the task one priority level
each call, but if it is ever to-be-moved to a queue all by its lonesome,
the task is put on the expired array instead.

Also, he has a check at the start that, if it is in a queue all by
itself (even before it is moved) the call just returns.

Not sure what all these changes add up to...

	Robert Love


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:21       ` Andrew Morton
@ 2003-04-30 23:47         ` viro
  2003-04-30 23:59           ` Andrew Morton
  2003-05-01  0:09           ` Robert Love
  2003-04-30 23:53         ` Robert Love
  2003-05-01  8:36         ` Arjan van de Ven
  2 siblings, 2 replies; 47+ messages in thread
From: viro @ 2003-04-30 23:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rick Lindsley, solt, linux-kernel, frankeh

On Wed, Apr 30, 2003 at 04:21:08PM -0700, Andrew Morton wrote:
> menu if there was a kernel build happening at the same time.  That is just
> utterly broken, so if we're going to leave the sched.c code as-is then we
> *require* that all applications be updated to not spin on sched_yield.

Excuse me, but WTF do they spin on the sched_yield() in the first place?
_That_ sounds like utterly broken...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:11     ` Rick Lindsley
  2003-04-30 23:21       ` Andrew Morton
@ 2003-04-30 23:41       ` Robert Love
  2003-05-01 16:50         ` Hubertus Franke
  1 sibling, 1 reply; 47+ messages in thread
From: Robert Love @ 2003-04-30 23:41 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: Andrew Morton, Maciej Soltysiak, linux-kernel, frankeh

On Wed, 2003-04-30 at 19:11, Rick Lindsley wrote:

>    OLD: when sched_yield() is called the task moves to expired,
> 	every other task in the active queue will run first before the
> 	yielding task will run again.

I really think this is the right way.

>    NEW: move the yielding task to the end of its current priority level,
> 	but keeps it active not expired.

This takes us back to the problem we saw in earlier sched_yield()
implementations.  A group of yielding threads just round-robin between
themselves, yielding over and over.  Worse, even a single task alone in
a priorty level will show up as a CPU hog if it keeps calling
sched_yield() in a loop.

It goes on.  Assume we have two runnable tasks, one that does whatever
it wants (hopefully something useful), and the other which does:

	while(1)
		sched_yield();

With the current sched_yield(), the second will receive much less
processor time than the first (nearly none vs. most of the processor). 
With the sched_yield() mentioned above, they will receive identical
amounts of processor time.  That does not seem sane to me.

I think it is important that sched_yield() give processor time to all
tasks, and not just between multiple yielding tasks.

The current implementation does this.  If an application (*cough* Open
Office *cough*) calls sched_yield() over and over, what does it expect?

Now that we have futexes, sched_yield() no longer needs to be used as a
poor replacement for blocking, and it can have sane semantics, such as
_really_ yielding the processor.

> 	What else could be done?
> 	(a) drop the effective priority of the yielding task by a percentile,
> 	    but don't reduce the time slice!

This works, too.  We used to do this..

There are a couple bits that need to be added, though, to deal with
threads that call sched_yield() over and over (which are the ones where
we have problems).  We need to drop the task a priority level every time
it calls sched_yield().  Eventually it will reach the lowest priority
(or some earlier threshold we want to check for) and then we need to put
it on the expired list, like the current behavior.

So for the big offenders, I think this ends up being the same, no?

Also, this approach does not work for real-time tasks, for whom we must
not change their priority... so we end up just requeing them, too.

Just my thoughts...

	Robert Love


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 23:11     ` Rick Lindsley
@ 2003-04-30 23:21       ` Andrew Morton
  2003-04-30 23:47         ` viro
                           ` (2 more replies)
  2003-04-30 23:41       ` Robert Love
  1 sibling, 3 replies; 47+ messages in thread
From: Andrew Morton @ 2003-04-30 23:21 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: solt, linux-kernel, frankeh

Rick Lindsley <ricklind@us.ibm.com> wrote:
>
> 	Why is this bad?
> 	(a) if it does busy looping through sched_yield it will eat cycles which
> 	    might not have happened

Things like OpenOffice _do_ busy loop on sched_yield().  It appears with
that patch, OO will sit there chewing ~1% of CPU.  Not great, but not bad
either..

A few kernels ago, OpenOffice would take sixty seconds to just flop down a
menu if there was a kernel build happening at the same time.  That is just
utterly broken, so if we're going to leave the sched.c code as-is then we
*require* that all applications be updated to not spin on sched_yield.

There's just no question about that.  It may end up not being acceptable.

Has anyone looked at what Andrea did in -aa?  I assume some suitable
compromise was achieved there.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 19:11   ` Andrew Morton
@ 2003-04-30 23:11     ` Rick Lindsley
  2003-04-30 23:21       ` Andrew Morton
  2003-04-30 23:41       ` Robert Love
  2003-05-01  3:21     ` Mike Galbraith
  1 sibling, 2 replies; 47+ messages in thread
From: Rick Lindsley @ 2003-04-30 23:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Maciej Soltysiak, linux-kernel, frankeh

    The below patch should fix that up, but we need to decide whether
    the (rather unclear) advantages of the sched_yield() change outweigh
    the breakage which it caused linuxthreads applications.

Exactly right; we've gone back and forth on this a few times.  What fixes
one seems to break the other.  Hubertus Franke (frankeh@us.ibm.com)
has been trying to reply with this succinct summary of
advantages/disadvantages but is having some sort of DNS issues right now so
I'll post it for him:

   This goes back to the semantics of sched_yield().

   OLD: when sched_yield() is called the task moves to expired,
	every other task in the active queue will run first before the
	yielding task will run again.

   NEW: move the yielding task to the end of its current priority level,
	but keeps it active not expired.

	Why is this good?
	(a) the task will not loose its timeslice length, because moving it to
	    expired effectively does that.
	(b) it keeps the task responsive

	Why is this bad?
	(a) if it does busy looping through sched_yield it will eat cycles which
	    might not have happened

	What else could be done?
	(a) drop the effective priority of the yielding task by a percentile,
	    but don't reduce the time slice!

   Hubertus Franke
   email: frankeh@us.ibm.com
   (w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30 10:16 ` Maciej Soltysiak
@ 2003-04-30 19:11   ` Andrew Morton
  2003-04-30 23:11     ` Rick Lindsley
  2003-05-01  3:21     ` Mike Galbraith
  0 siblings, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2003-04-30 19:11 UTC (permalink / raw)
  To: Maciej Soltysiak; +Cc: linux-kernel

Maciej Soltysiak <solt@dns.toxicfilms.tv> wrote:
>
> > kernel/
> > -------
> >
> > - O(1) scheduler starvation, poor behaviour seems unresolved.
> >
> >   Jens: "I've been running 2.5.67-mm3 on my workstation for two days, and
> >   it still doesn't feel as good as 2.4.  It's not a disaster like some
> >   revisisons ago, but it still has occasional CPU "stalls" where it feels
> >   like a process waits for half a second of so for CPU time.  That's is very
> >   noticable."
> Well, i had similar problems with 2.5 stalling, but now that i disabled
> preemtible kernel, it is better now. Are there no complaints about preemt?

I have not heard of any, apart from yours.

> Also there is one issue, i am not sure if this may be a kernel issue,
> but with setiathome running in a X desktop environment all apps work fine,
> but when i run openoffice, openoffice responds with 5 second delay.

That'll be the changed sched_yield() semantics.

The below patch should fix that up, but we need to decide whether the (rather
unclear) advantages of the sched_yield() change outweigh the breakage which
it caused linuxthreads applications.


diff -puN kernel/sched.c~sched_yield-hack kernel/sched.c
--- 25/kernel/sched.c~sched_yield-hack	2003-04-30 12:08:51.000000000 -0700
+++ 25-akpm/kernel/sched.c	2003-04-30 12:09:11.000000000 -0700
@@ -1992,7 +1992,7 @@ asmlinkage long sys_sched_yield(void)
 	 */
 	if (likely(!rt_task(current))) {
 		dequeue_task(current, array);
-		enqueue_task(current, rq->expired);
+		enqueue_task(current, array);
 	} else {
 		list_del(&current->run_list);
 		list_add_tail(&current->run_list, array->queue + current->prio);

_


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30  4:55 ` Florian Weimer
@ 2003-04-30 10:18   ` Maciej Soltysiak
  2003-05-01 11:24   ` David S. Miller
  1 sibling, 0 replies; 47+ messages in thread
From: Maciej Soltysiak @ 2003-04-30 10:18 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-kernel

> > net/
> > ----
>
> What about the dst cache DoS attack?
What about IPv6 SYN attacks?
tcp_ipv6.c: static int tcp_v6_conn_request(...)
...
        /*
         *      There are no SYN attacks on IPv6, yet...
         */

Nmap is ip6 enabled and is getting its razor sharpened.

Regards,
Maciej


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-29 22:57 Andrew Morton
                   ` (2 preceding siblings ...)
  2003-04-30  8:30 ` Benjamin Herrenschmidt
@ 2003-04-30 10:16 ` Maciej Soltysiak
  2003-04-30 19:11   ` Andrew Morton
  2003-05-01 14:33 ` Christoph Hellwig
  4 siblings, 1 reply; 47+ messages in thread
From: Maciej Soltysiak @ 2003-04-30 10:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

> kernel/
> -------
>
> - O(1) scheduler starvation, poor behaviour seems unresolved.
>
>   Jens: "I've been running 2.5.67-mm3 on my workstation for two days, and
>   it still doesn't feel as good as 2.4.  It's not a disaster like some
>   revisisons ago, but it still has occasional CPU "stalls" where it feels
>   like a process waits for half a second of so for CPU time.  That's is very
>   noticable."
Well, i had similar problems with 2.5 stalling, but now that i disabled
preemtible kernel, it is better now. Are there no complaints about preemt?

Also there is one issue, i am not sure if this may be a kernel issue,
but with setiathome running in a X desktop environment all apps work fine,
but when i run openoffice, openoffice responds with 5 second delay.
I remember somebody noticing a problem with evolution that was related to
a kernel problem. Maybe i could help sorting it out?

Regards,
Maciej


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-29 22:57 Andrew Morton
  2003-04-29 23:22 ` John Bradford
  2003-04-30  3:19 ` William Lee Irwin III
@ 2003-04-30  8:30 ` Benjamin Herrenschmidt
  2003-04-30 10:16 ` Maciej Soltysiak
  2003-05-01 14:33 ` Christoph Hellwig
  4 siblings, 0 replies; 47+ messages in thread
From: Benjamin Herrenschmidt @ 2003-04-30  8:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel mailing list

On Wed, 2003-04-30 at 00:57, Andrew Morton wrote:

> - IDE suspend/resume without races (Ben is looking at this a little)

I have something that work not too badly for PPC already but that need
some cleanup, to be tested/adapted to Pat's new work (especially tested
against his swsusp, and we shall still verify if it fits x86 needs)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
       [not found] <20030429231009$1e6b@gated-at.bofh.it>
@ 2003-04-30  4:55 ` Florian Weimer
  2003-04-30 10:18   ` Maciej Soltysiak
  2003-05-01 11:24   ` David S. Miller
  0 siblings, 2 replies; 47+ messages in thread
From: Florian Weimer @ 2003-04-30  4:55 UTC (permalink / raw)
  To: linux-kernel

Andrew Morton <akpm@digeo.com> writes:

> net/
> ----

What about the dst cache DoS attack?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30  4:45   ` Andrew Morton
@ 2003-04-30  4:52     ` William Lee Irwin III
  2003-05-01  4:32     ` William Lee Irwin III
  1 sibling, 0 replies; 47+ messages in thread
From: William Lee Irwin III @ 2003-04-30  4:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>> I didn't notice anything specific here about sys_remap_file_pages() vs.
>> truncate() (sans objrmap); did a fix fly by that I didn't notice,
>> or was it less of an issue than I thought it was?

On Tue, Apr 29, 2003 at 09:45:37PM -0700, Andrew Morton wrote:
> That's just a bug.  We can either go through and unmap all the pages via
> their rmap chains, or mark the vma as nonlinear and just anonymise the pages
> and to heck with the SIGBUS.  I'm not particularly fussed either way
> really...

Okay, I'll just fill in if no one else appears to do the busywork.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> Also, the OOM killer fails to check lowmem; basically it just needs
>> -       if (nr_swap_pages > 0)
>> +       if (nr_swap_pages > 0 && nr_free_buffer_pages() > 0)
>> With that in addition to the OOM killer locking patch I posted and
>> another to completely eliminate mm-less processes from consideration
>> 64GB ia32 (with, of course, my oversized out-of-tree patch) recovers
>> from OOM instead of deadlocking after a mass-killing with swap online.

On Tue, Apr 29, 2003 at 09:45:37PM -0700, Andrew Morton wrote:
> Wanna send patch?

Absolutely; I'll arrange a more organized presentation around Thursday
(yes, I'm among the last-minute OLS people -- it couldn't be helped).


William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'd be interested in more detailed descriptions of the user-level no
>> overcommit, dcacheicache, and truncated ext3 page issues after Thursday.

On Tue, Apr 29, 2003 at 09:45:37PM -0700, Andrew Morton wrote:
> The arithmetic in vm_enough_memory() is woefully inaccurate.  If you have no
> swap and then build up a lot of icache/dcache, vm_enough_memory()
> underestimates the amount of reclaimable memory by a lot and big mallocs
> fail.  If the i/dcache has internal fragmentation it gets even worse.
> I had a brief poke at that a while ago and decided it was basically hopeless.
> I suspect that assuming "all slab pages are reclaimable" would be the best
> fix here.

It sounds like some thought may be necessary if the above approach is
to be improved upon.


On Tue, Apr 29, 2003 at 09:45:37PM -0700, Andrew Morton wrote:
> The ext3 truncate pages are those pages which are on the LRU and have
> buffers, but that's _all_ they have.  They are instantly reclaimable and are
> basically free memory.  Only nobody knows that yet, so vm_enough_memory()
> gets it wrong.  The fix would be to nail these pages more aggressively in
> journal_unmap_buffer(), or to account for them and include that accounting in
> vm_emough_memory().  I'd prefer to just free the dang pages in
> journal_unmap_buffer().

Noted.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> The latter sounds easy to address. It actually sounds like a 2.4.x
>> compatibility fix.

On Tue, Apr 29, 2003 at 09:45:37PM -0700, Andrew Morton wrote:
> davem thinks we shouldn't need it, and I've seen no bug reports that indicate
> that we _do_ need it, but Andi says we do.
> Certainly something needs to be done in that area - a ppc64 box with 16G of
> memory (all ZONE_DMA) cruises along with just 1M of memory free.

Okay, I'll classify that as a back-burner issue.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-30  3:19 ` William Lee Irwin III
@ 2003-04-30  4:45   ` Andrew Morton
  2003-04-30  4:52     ` William Lee Irwin III
  2003-05-01  4:32     ` William Lee Irwin III
  0 siblings, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2003-04-30  4:45 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Tue, Apr 29, 2003 at 03:57:31PM -0700, Andrew Morton wrote:
> > mm/
> > ---
> > - Overcommit accounting gets wrong answers
> >   - underestimates reclaimable slab, gives bogus failures when
> >     dcache&icache are large.
> >   - gets confused by reclaimable-but-not-freed truncated ext3 pages. 
> >     Lame fix exists in -mm.
> > - Proper user level no overcommit also requires a root margin adding
> 
> I didn't notice anything specific here about sys_remap_file_pages() vs.
> truncate() (sans objrmap); did a fix fly by that I didn't notice,
> or was it less of an issue than I thought it was?

That's just a bug.  We can either go through and unmap all the pages via
their rmap chains, or mark the vma as nonlinear and just anonymise the pages
and to heck with the SIGBUS.  I'm not particularly fussed either way
really...


> Also, the OOM killer fails to check lowmem; basically it just needs
> -       if (nr_swap_pages > 0)
> +       if (nr_swap_pages > 0 && nr_free_buffer_pages() > 0)
> 
> With that in addition to the OOM killer locking patch I posted and
> another to completely eliminate mm-less processes from consideration
> 64GB ia32 (with, of course, my oversized out-of-tree patch) recovers
> from OOM instead of deadlocking after a mass-killing with swap online.

Wanna send patch?

> I'd be interested in more detailed descriptions of the user-level no
> overcommit, dcacheicache, and truncated ext3 page issues after Thursday.

The arithmetic in vm_enough_memory() is woefully inaccurate.  If you have no
swap and then build up a lot of icache/dcache, vm_enough_memory()
underestimates the amount of reclaimable memory by a lot and big mallocs
fail.  If the i/dcache has internal fragmentation it gets even worse.

I had a brief poke at that a while ago and decided it was basically hopeless.
I suspect that assuming "all slab pages are reclaimable" would be the best
fix here.


The ext3 truncate pages are those pages which are on the LRU and have
buffers, but that's _all_ they have.  They are instantly reclaimable and are
basically free memory.  Only nobody knows that yet, so vm_enough_memory()
gets it wrong.  The fix would be to nail these pages more aggressively in
journal_unmap_buffer(), or to account for them and include that accounting in
vm_emough_memory().  I'd prefer to just free the dang pages in
journal_unmap_buffer().


> > - Readd and make /proc/sys/vm/freepages writable again so that boxes can be
> >   tuned for heavy interrupt load.
> 
> The latter sounds easy to address. It actually sounds like a 2.4.x
> compatibility fix.

davem thinks we shouldn't need it, and I've seen no bug reports that indicate
that we _do_ need it, but Andi says we do.

Certainly something needs to be done in that area - a ppc64 box with 16G of
memory (all ZONE_DMA) cruises along with just 1M of memory free.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-29 22:57 Andrew Morton
  2003-04-29 23:22 ` John Bradford
@ 2003-04-30  3:19 ` William Lee Irwin III
  2003-04-30  4:45   ` Andrew Morton
  2003-04-30  8:30 ` Benjamin Herrenschmidt
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 47+ messages in thread
From: William Lee Irwin III @ 2003-04-30  3:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Tue, Apr 29, 2003 at 03:57:31PM -0700, Andrew Morton wrote:
> mm/
> ---
> - Overcommit accounting gets wrong answers
>   - underestimates reclaimable slab, gives bogus failures when
>     dcache&icache are large.
>   - gets confused by reclaimable-but-not-freed truncated ext3 pages. 
>     Lame fix exists in -mm.
> - Proper user level no overcommit also requires a root margin adding

I didn't notice anything specific here about sys_remap_file_pages() vs.
truncate() (sans objrmap); did a fix fly by that I didn't notice,
or was it less of an issue than I thought it was?

Also, the OOM killer fails to check lowmem; basically it just needs
-       if (nr_swap_pages > 0)
+       if (nr_swap_pages > 0 && nr_free_buffer_pages() > 0)

With that in addition to the OOM killer locking patch I posted and
another to completely eliminate mm-less processes from consideration
64GB ia32 (with, of course, my oversized out-of-tree patch) recovers
from OOM instead of deadlocking after a mass-killing with swap online.
Not that I'd consider 64GB ia32 a supported platform for 2.5/2.6 (it's
a design limitation IMHO); it merely "stresses the OOM killer harder"
for the purposes of this discussion. Some kind of investigation is
probably needed to determine why eliminating mm-less processes from
consideration is necessary to obtain the desired behavior.

I'd be interested in more detailed descriptions of the user-level no
overcommit, dcacheicache, and truncated ext3 page issues after Thursday.


On Tue, Apr 29, 2003 at 03:57:31PM -0700, Andrew Morton wrote:
> mm/
> ---
> - objrmap: concerns over page reclaim performance at high sharing levels,
>   and interoperation with nonlinear mappings is hairy.
> - Readd and make /proc/sys/vm/freepages writable again so that boxes can be
>   tuned for heavy interrupt load.

The latter sounds easy to address. It actually sounds like a 2.4.x
compatibility fix.


-- wli

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-29 23:22 ` John Bradford
@ 2003-04-29 23:37   ` Andrew Morton
  0 siblings, 0 replies; 47+ messages in thread
From: Andrew Morton @ 2003-04-29 23:37 UTC (permalink / raw)
  To: John Bradford; +Cc: linux-kernel

John Bradford <john@grabjohn.com> wrote:
>
> Is it too early for a 2.7 outline?

Yes.  Anything which gets booted from 2.6 implicitly gets another run in
2.7, or can be backported later.  Let's keep it tight and not disappear
down ratholes.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: must-fix list for 2.6.0
  2003-04-29 22:57 Andrew Morton
@ 2003-04-29 23:22 ` John Bradford
  2003-04-29 23:37   ` Andrew Morton
  2003-04-30  3:19 ` William Lee Irwin III
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 47+ messages in thread
From: John Bradford @ 2003-04-29 23:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

> I shall be maintaining list this so we can understand where we are with
> respect to 2.6 readiness.  And so we can look at features and say "no". 
> And so we can look at bugs and say "not gating 2.6.0".

Is it too early for a 2.7 outline?  It might help to keep things that
are not ready out of 2.6 :-)

John.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* must-fix list for 2.6.0
@ 2003-04-29 22:57 Andrew Morton
  2003-04-29 23:22 ` John Bradford
                   ` (4 more replies)
  0 siblings, 5 replies; 47+ messages in thread
From: Andrew Morton @ 2003-04-29 22:57 UTC (permalink / raw)
  To: linux-kernel


Below is a first cut at tracking the major work items which should be
completed for a 2.6 release.

When considering these items it would be useful to have a clear idea of
what a 2.6.0 release is actually _for_.  Obviously, 2.6.0 doesn't mean
"it's finished, ship it".

I'd propose that 2.6.0 means that users can migrate from 2.4.x with a good
expectation that everything which they were using in 2.4 will continue to
work, and that the kernel doesn't crash, doesn't munch their data and
doesn't run like a dog.  Other definitions are welcome.


I shall be maintaining list this so we can understand where we are with
respect to 2.6 readiness.  And so we can look at features and say "no". 
And so we can look at bugs and say "not gating 2.6.0".

Things we should not track here are:

- Regular old bugs.  Please use bugzilla.

- Wishlist items.  This list is not a route for getting commitment for
  inclusion of $FAVEFEATURE.  In fact it's probably a good way of getting the
  feature shot down ;)

- Driver problems.  Most important drivers mostly work OK now.  Please use
  bugzilla.

Things which we should track here are significantly-sized outstanding
development activities which resolve big bugs or which address missing
features & speedups.

I've organised it into three main sections:

a) must-fix bugs which require significant amounts of work/restructuring
   to fix.

b) late features and speedups.

c) Important driver bugs.  This wasn't supposed to be here, but various
   contributors sent me a lot of details, and it would be sad to lose them.



The list is already very long, and very incomplete.  Additions (and
removals!!!) are sought.   Thanks.

And thanks to the various contributors who helped pull this together.



Must-fix bugs
=============

drivers/char/
-------------

- TTY locking is broken (see FIXME in do_tty_hangup())

  "One bug that was found is that the dropping of lock_kernel from do_exit
   caused races in the exit tty cleanup.  There was a patch for that, but I'm
   not sure it was merged."


drivers/block/
--------------

- RAID0 dies on strangely aligned BIOs

  - Need to hoist BIO-split code out of device mapper, use that.

 (neilb)

 1/ RAID5 should work fine.  It accepts any sort of bio and always
    submits a 1-page bio to the underlying device, and if my
    understanding is correct, every device must be able to handle a
    single page bio, no matter what the alignment (which is why raid0
    has a problem - it doesn't). 

 2/ RAID1 works pretty well.  The only improvement needed is to define
    a merge_bvec_fn function which passes the question down to lower
    layers.  This should be easy except for the small fact that it is
    impossible :-)  There is no enforced pairing between calls to
    merge_bvec_fn and submit_bh, so it is possible that a hot spare
    with different restrictions could get swapped in between the one
    and the other and could confuse things.  I suspect that can be
    worked around somehow though...

       Someone sent me a patch that is sorely needed - it allows you
       to simply call blk_queue_stack() (or somethink like that), and it will
       get your stacked limits set appropriately.

 3/ I just realised that raid0 is easier than I had previously
    thought.  We don't need the completely functional bio splitting
    that dm has.  We only need to be able to split a bio that has just
    one page as the use of merge_bvec_fn will ensure that we never get
    a larger bio that we cannot handle.  And splitting a bio with only
    one page is a lot easier.  I now have code in my tree that
    implements this quite cleanly and will probably post a patch
    during the week.

- ideraid hasn't been ported to 2.5 at all yet.

- CD burning.  There are still a few quirks to solve wrt SG_IO and ide-cd.

  Jens: The basic hang has been solved (double fault in ide-cd), there still
  seems to be some cases that don't work too well.  Don't really have a
  handle on those :/

- IDE tcq. Either kill it or fix it. Not a "big todo", as such.

drivers/video/
--------------

- Lots of drivers don't compile, others do but don't work.

fs/
---

- NFS client gets an OOM deadlock.

  - Some fixes exist in -mm.  Seem to mostly work.

- NFS client runs very slowly consuming 100% CPU under heavy writeout.

  - Unsubtle fix exists in -mm.  (Looks like it's fixed anyway).

- ext3 data=journal mode is bust.

- ext3/htree doesn't play right with NFS server.  90% fixed in -mm.

- AIO/direct-IO writes can race with truncate and wreck filesystems.

  - Easy fix is to only allow the feature for S_ISBLK files.

- davej: NFS seems to have a really bad time for some people.  (Including
  myself on one testbox).  The common factor seems to be a high spec client
  torturing an underpowered NFS server with lots of IO.  (fsx/fsstress etc
  show this up).  Lots of "NFS server cheating" messages get dumped, and a
  whole lot of bogus packets start appearing.  They look severely corrupted,
  (they even crashed ethereal once 8-)


kernel/
-------

- O(1) scheduler starvation, poor behaviour seems unresolved.

  Jens: "I've been running 2.5.67-mm3 on my workstation for two days, and
  it still doesn't feel as good as 2.4.  It's not a disaster like some
  revisisons ago, but it still has occasional CPU "stalls" where it feels
  like a process waits for half a second of so for CPU time.  That's is very
  noticable."

   Also see Mike Galbraith's work.

- Alan: 32bit uid support is *still* broken for process accounting.

  (Test case?)

mm/
---

- Overcommit accounting gets wrong answers

  - underestimates reclaimable slab, gives bogus failures when
    dcache&icache are large.

  - gets confused by reclaimable-but-not-freed truncated ext3 pages. 
    Lame fix exists in -mm.

- Proper user level no overcommit also requires a root margin adding

modules
-------

  (Rusty)

- The .modinfo patch needs to go in.  It's trivial, but it's the major
  missing functionality vs. 2.4.  Keeps bouncing off Linus.

- __module_get(): "I know I have a refcount already and I don't care
  if they're doing rmmod --wait, gimme.".  Keeps bouncing off Linus.

- Per-cpu support inside modules (have patch, in testing).

- driver class code is getting redone.  I have this now working, and will
  send it out in a few days.

net/
----

  (davem)

- UDP apps can in theory deadlock, because the ip_append_data path can end
  up sleeping while the socket lock is held.

  It is OK to sleep with the socket held held, normally.  But in this case
  the sleep happens while waiting for socket memory/space to become
  available, if another context needs to take the socket lock to free up the
  space we could hang.

  I sent a rough patch on how to fix this to Alexey, and he is analyzing
  the situation.  I expect a final fix from him next week or so.

- Semantics for IPSEC during operations such as TCP connect suck currently.

  When we first try to connect to a destination, we may need to ask the
  IPSEC key management daemon to resolve the IPSEC routes for us.  For the
  purposes of what the kernel needs to do, you can think of it like ARP.  We
  can't send the packet out properly until we resolve the path.

  What happens now for IPSEC is basically this:

  O_NONBLOCK: returns -EAGAIN over and over until route is resolved

  !O_NONBLOCK: Sleeps until route is resolved

  These semantics are total crap.  The solution, which Alexey is working
  on, is to allow incomplete routes to exist.  These "incomplete" routes
  merely put the packet onto a "resolution queue", and once the key manager
  does it's thing we finish the output of the packet.  This is precisely how
  ARP works.

  I don't know when Alexey will be done with this.

- There are those mysterious TCP hangs of established state sockets. 
  Someone has to get a good log in order for us to effectively debug this.



net/*/netfilter/
----------------

  (Rusty)

- Handle non-linear skbs everywhere.  This is going in via Dave now.

- Rework conntrack hashing.

- Module relationship bogosity fix (trivial, have patch).


global
------

- Lots of 2.4 fixes including some security are not in 2.5

- There are about 60 or 70 security related checks that need doing
  (copy_user etc) from Stanford tools

- A couple of hundred real looking bugzilla bugs




Not-ready features and speedups
===============================


drivers/block/
--------------

- Framework for selecting IO schedulers.  This is the main one really. 
  Once this is in place we can drop in new schedulers any old time, no risk.

- Dynamic disk request allocation.  Patch exists.

- Runtime-selectable disk scheduler framework.

- Anticipatory scheduler.  Working OK now, still has problems with seeky
  OLTP-style loads.

- CFQ scheduler.  Seems to work but Jens planning significant rework.

- The feral.com qlogic driver: needs work.


fs/
---

- reiserfs_file_write() speedup.  There are concerns that some applications
  do the wrong thing with large stat.st_blksize.

- ext3 lock_kernel() removal: that part works OK and is mergeable.  But
  we'll also need to make lock_journal() a spinlock, and that's deep surgery.

- 32bit quota needs a lot more testing but may work now

- Integrate Chris Mason's 2.4 reiserfs ordered data and data journaling
  patches.  They make reiserfs a lot safer.

- (Trond:) Yes: I'm still working on an atomic "open()", i.e.  one
           where we short-circuit the usual VFS path_walk() + lookup() +
           permission() + create() + ....  bullsh*t...

           I have several reasons for wanting to do this (all of
           them related to NFS of course, but much of the reasoning applies
           to *all* networked file systems).

   1) The above sequence is simply not atomic on *any* networked
      filesystem.

   2) It introduces a sh*tload of completely unnecessary RPC calls (why
      do a 'permission' RPC call when the server is in *any* case going to
      tell you whether or not this operations is allowed.  Why do a
      'lookup()' when the 'create()' call can be made to tell you whether or
      not a file already exists).

   3) It is incompatible with some operations: the current create()
      doesn't pass an 'EXCLUSIVE' flag down to the filesystems.

   4) (NFS specific?) open() has very different cache consistency
      requirements when compared to most other VFS operations.

   I'd very much like for something like Peter Braam's 'lookup with
   intent' or (better yet) for a proper dentry->open() to be integrated with
   path_walk()/open_namei().  I'm still working on the latter (Peter has
   already completed the lookup with intent stuff).


kernel/
-------

  (Rusty)

- Zippel's Reference count simplification.  Tricky code, but cuts about 120
  lines from module.c.  Patch exists, needs stressing.

- /proc/kallsyms.  What most people really wanted from /proc/ksyms.  Patch
  exists.

- Fix module-failed-init races by starting module "disabled".  Patch
  exists, requires some subsystems (ie.  add_partition) to explicitly say
  "make module live now".  Without patch we are no worse off than 2.4 etc. 

- Integrate userspace irq balancing daemon.

mm/
---

- objrmap: concerns over page reclaim performance at high sharing levels,
  and interoperation with nonlinear mappings is hairy.

- Readd and make /proc/sys/vm/freepages writable again so that boxes can be
  tuned for heavy interrupt load.

net/
----

  (davem)

- Real serious use of IPSEC is hampered by lack of MPLS support.  MPLS is a
  switching technology that works by switching based upon fixed length labels
  prepended to packets.  Many people use this and IPSEC to implement VPNs
  over public networks, it is also used for things like traffic engineering.

  A good reference site is:

	http://www.mplsrc.com/

  Anyways, an existing (crappy) implementation exists.  I've almost
  completed a rewrite, I should have something in the tree next week.

- Sometimes we generate IP fragments when it truly isn't necessary.

  The way IP fragmentation is specified, each fragment must be modulo 8
  bytes in length.  So suppose the device has an MTU that is not 0 modulo 8,
  ethernet even classifies in this way.  1500 == (8 * 187) + 4

  Our IP fragmenting engine can fragment on packets that are sized within
  the last modulo 8 bytes of the MTU.  This happens in obscure cases, but it
  does happen.

  I've proposed a fix to Alexey, whereby very late in the output path we
  check the packet, if we fragmented but the data length would fit into the
  MTU we unfragment the packet.

  This is low priority, because technically it creates suboptimal behavior
  rather than mis-operation.

- IPV4 output engine changes for IPSEC need to be moved over to IPV6.

  IPV6 ipsec works but gravely suboptimally in some cases.  It is also for
  this reason that the zerocopy UDP stuff isn't functional on the ipv6 side.

  The USAGI project (www.linux-ipv6.org) is working with Alexey on this
  work.

net/*/netfilter/
----------------

- Lots of misc. cleanups, which are happening slowly.

- davem: Netfilter needs to stop linearizing packets as much as possible.

  Zerocopy output packets are basically undone by netfilter becuase all of
  it assumed it was working with linear socket buffers.

  Rusty is fixing this piece by piece.  He is nearly done with this work. 

power management
----------------

  (Pat) There is some preliminary work at bk://ldm.bkbits.net/linux-2.5-power,
  though I'm currently in the process of reworking it.  

  It includes: 

- New device power management core code, both for individual devices, 
  and for global state transitions. 

- A generic user interface for triggering system power state transitions.

- Arch-independent code for performing state transitions, that calls 
  platform-specific methods along the way. 

- A better suspend-to-disk mechanism that swsusp. 

  There are various other details to be worked out, which are the real fun
  part.  And of course, driver support, but that is something that can happen
  at any time.  

  (Alan)

- PCI locking

- Frame buffer restore codepaths (that requires some deep PCI magic)

- XFree86 hooks

- AGP restoration

- DRI restoration

- IDE suspend/resume without races (Ben is looking at this a little)

- How to deal with devices that babble (some stuff we have to global IRQ
  off to save, and global IRQ on -after- we recover with APM)

- Pat's swsusp rework?

arch/i386/
----------

- Andi: i386 sub architectures for common boxes (in particular bigsmp and
  summit) need to be runtime probed options, not compile time.  Vendors
  cannot ship an own kernel rpm for all these cases.  (patch is in -mm, works
  OK).

- Also PC9800 merge needs finishing to the point we want for 2.6 (not all).

- ES7000 wants merging (now we are all happy with it).  That shouldn't be a
  big problem.

global
------

- 64-bit dev_t.  Seems almost ready, but it's not really known how much
  work is still to do.  Patches exist in -mm but with the recent rise of the
  neo-viro I'm not sure where things are at.

- We need a kernel side API for reporting error events to userspace (could
  be async to 2.6 itself)

  (Prototype core based on netlink exists)

- Kai: Introduce a sane, easy and standard way to build external modules

- Kai: Allow separate src/objdir





drivers
=======

- Alan: PCI random reordering from 2.4 to 2.5 isnt understood yet (might be
  fixed now?)

- Alan: We have multiple drivers walking the pci device lists and also
  using things like pci_find_device in unsafe ways with no refcounting.  I
  think we have to make pci_find_device etc refcount somewhere and add
  pci_device_put as was done with networking.

- Lots of network drivers don't even build

- Alan: PCI hotplug is unsafe (locking is totally screwed)

- Ditto cardbus

- Alan: Cardbus/PCMCIA requires all Russell's stuff is merged to do
  multiheader right and so on

drivers/acpi/
-------------

- davej: ACPI has a number of failures right now.  There are a number of
  entries in bugzilla which could all be the same bug.  It manifests as a
  "network card doesn't recieve packets" booting with 'acpi=off noapic' fixes
  it.

- davej: There's also another nasty 'doesnt boot' bug which quite a few
  people (myself included) are seeing on some boxes (especially laptops).

drivers/block/
--------------

- Alan: Partition handling is hosed for DM users.  (I have some partly
  debugged patches in the -ac tree, but Andries objects to them and I think
  his user knows magic options hack is unacceptable too.  Mostly this is
  figuring out the right answer)

- Floppy is almost unusably buggy still

drivers/char/
-------------

- Alan: Multiple serious bugs in the DRI drivers (most now with patches
  thankfully).  "The badness I know about is almost entirely IRQ mishandling.
   DRI failing to mask PCI irqs on exit paths."

- Various suspect things in AGP.

drivers/ide/
------------

  (Alan)

- IDE requires bio walking

- IDE PIO has occasional unexplained PIO disk eating reports

- IDE has multiple zillions of races/hangs in 2.5 still

- IDE eats disks with HPT372N on 2.5.x

- IDE scsi needs rewriting

- IDE needs significant reworking to handle Simplex right

- IDE hotplug handling for 2.5 is completely broken still

drivers/isdn/
-------------

  (Kai, rmk)

- isdn_tty locking is completely broken (cli() and friends)

- fix lots of remaining bugs in the isdn link layer / hisax protocol layer
  / hisax subdrivers, so that at least 99% of the users have a usable ISDN
  subsystem

- fix other drivers

- lots more cleanups, adaption to recent APIs etc

- fixup tty-based ISDN drivers which provide TIOCM* ioctls (see my recent
  3-set patch for serial stuff)

  Alternatively, we could re-introduce the fallback to driver ioctl parsing
  for these if not enough drivers get updated.

- fixup the usb-serial core and drivers to provide support for this
  patch.

drivers/net/
------------

- davej: Either Wireless network drivers or PCMCIA broke somewhen.  A
  configuration that worked fine under 2.4 doesn't receive any packets.  Need
  to look into this more to make sure I don't have any misconfiguration that
  just 'happened to work' under 2.4


drivers/scsi/
-------------

- Half of SCSI doesn't compile

arch/i386/
----------

- 2.5.x won't boot on some 440GX

- 2.5.x doesn't handle VIA APIC right yet - dont know why

- ACPI needs the relax patches merging to work on lots of laptops

- ECC driver questions are not yet sorted (DaveJ is working on this)

arch/x86_64/
------------

  (Andi)

- time handling is broken. Need to move up 2.4 time.c code.

- memory corruption with IOMMU pci_free_consistent - often causes crashes
  at shutdown.  This is rather mysterious, the code is basically identical to
  2.4 which works fine.  Can only be seen on systems with >4GB of memory or
  with iommu=force

- Another report of a crash at shutdown on Simics with no iommu when all
  memory was used.  Could be related to the one above.

- change_page_attr corrupts memory/crashes. Breaks some AGP users.

- NMI watchdog seems to tick too fast

- some fixes from 2.4 still need to be merged

- not very well tested. probably more bugs lurking.



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2003-05-02 22:08 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20030429155731.07811707.akpm@digeo.com.suse.lists.linux.kernel>
2003-04-30  1:36 ` must-fix list for 2.6.0 Andi Kleen
2003-04-30 18:09   ` Pavel Machek
2003-04-30 18:15     ` Andi Kleen
2003-04-30 19:11       ` Pavel Machek
     [not found] <20030429231009$1e6b@gated-at.bofh.it>
2003-04-30  4:55 ` Florian Weimer
2003-04-30 10:18   ` Maciej Soltysiak
2003-05-01 11:24   ` David S. Miller
2003-05-01 11:27     ` Florian Weimer
2003-05-01 12:03       ` David S. Miller
2003-05-01 12:05       ` David S. Miller
2003-05-02 16:28         ` Carl-Daniel Hailfinger
2003-05-02 21:14           ` David S. Miller
2003-04-29 22:57 Andrew Morton
2003-04-29 23:22 ` John Bradford
2003-04-29 23:37   ` Andrew Morton
2003-04-30  3:19 ` William Lee Irwin III
2003-04-30  4:45   ` Andrew Morton
2003-04-30  4:52     ` William Lee Irwin III
2003-05-01  4:32     ` William Lee Irwin III
2003-04-30  8:30 ` Benjamin Herrenschmidt
2003-04-30 10:16 ` Maciej Soltysiak
2003-04-30 19:11   ` Andrew Morton
2003-04-30 23:11     ` Rick Lindsley
2003-04-30 23:21       ` Andrew Morton
2003-04-30 23:47         ` viro
2003-04-30 23:59           ` Andrew Morton
2003-05-01  6:27             ` Christoph Hellwig
2003-05-01  5:49               ` Shawn
2003-05-01 10:14                 ` Christoph Hellwig
2003-05-01  9:47                   ` Alan Cox
2003-05-01 15:44               ` Robert Love
2003-05-01  0:09           ` Robert Love
2003-05-01  0:51             ` Gerrit Huizenga
2003-05-01  5:40               ` Dave Hansen
2003-05-01 11:57                 ` Bill Huey
2003-04-30 23:53         ` Robert Love
2003-05-01  8:36         ` Arjan van de Ven
2003-05-01  8:42           ` Andrew Morton
2003-05-01  8:47             ` Arjan van de Ven
2003-04-30 23:41       ` Robert Love
2003-05-01 16:50         ` Hubertus Franke
2003-05-01  3:21     ` Mike Galbraith
2003-05-01  6:26       ` Mike Galbraith
2003-05-01 14:33 ` Christoph Hellwig
2003-05-01 18:42   ` Andrew Morton
2003-05-01 18:47     ` Christoph Hellwig
2003-05-02  1:57       ` Andreas Boman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).