linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: 2.6 upgrade overall failure report
@ 2005-05-02 15:31 Moore, Eric Dean
  0 siblings, 0 replies; 6+ messages in thread
From: Moore, Eric Dean @ 2005-05-02 15:31 UTC (permalink / raw)
  To: Hubert Tonneau, Andrew Morton; +Cc: linux-kernel, Jeff Garzik, David S. Miller

On Friday, April 29, 2005 3:57 AM, Hubert Tonneau wrote:
> Andrew Morton wrote:
> >
> > >  . I reported a year ago that SCSI fusion was unable to 
> properly recover from
> > >    tiny errors under 2.6 as opposed to 2.4 ... and got 
> hit by the same problem
> > >    6 monthes later
> > 
> > Please send a full report to Eric Moore and cc 
> linux-scsi@vger.kernel.org
> 
> Already done.
> My initial report to Eric Moore is dated mai 15 2004 (2.6.6), 
> so you can
> understand why I'm a bit afraid beeing hit by the same bug 
> more than 6 monthes
> and 3 stable kernel releases later.
> After the second report dated january 26 2005 (2.6.9),
> he sent me fusion 3.1.19 saying it solves the problem.
> As far as I could check, no official kernel is running 3.1.19
> 2.6.12 will jump to 3.1.20, so I can assume it's solved;
> also I have no way to verify it since the bug will append 
> only in case of
> something going wrong on the SCSI chain, what tends to append only
> once every several monthes, and not on all servers.


There have been improvements made in the error handling area in
the mpt fusion driver.  Basically the timers were removed.
This was done in the 3.01.19 driver, which was posted around
early February, 2005.  That support is still there in all drivers 
versions posted since then.

Eric


 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 upgrade overall failure report
  2005-04-29  9:17 ` Andrew Morton
@ 2005-04-29 15:56   ` David S. Miller
  0 siblings, 0 replies; 6+ messages in thread
From: David S. Miller @ 2005-04-29 15:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hubert.tonneau, linux-kernel, Eric.Moore, jgarzik

On Fri, 29 Apr 2005 02:17:56 -0700
Andrew Morton <akpm@osdl.org> wrote:

> >  . There is still a memory leak trouble (probably in tigon3 driver since others
> >    reported so on kernel mailing list, and tigon3 is not a geek hardware since
> >    most nowdays lowend servers use either tigon3 or pro1000)
> 
> Please send a report to David Miller and Jeff Garzik and cc netdev@oss.sgi.com

This is the first I've ever heard of any such leak, more likely
the leak is in the networking code somewhere.

> >  . Since 2.6.10, the TCP task does not work anymore with OSX (2 Mbps instead
> >    of 60 Mbps on a 100 Mbps wire)
> 
> Please send a full report to David Miller and cc netdev@oss.sgi.com.
> 
> Also please describe a simple way of reproducing this - I'll see if it
> happens here.

It only happens with OS-X and it has to do with how they handle the fast
path of TCP input.  It's a known problem but no satisfatory solution
exists yet.  When the fast path in OS-X TCP input is hit, they always
delay ACKs by a full 500ms, there isn't much Linux can do about broken
behavior like that.

We are thinking of possible workarounds, but this bug is very low priority
since it is really a MAC OS-X issue.

Anyways, I'm in Chicago until Monday so won't be able to look into anything
in detail until then.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 upgrade overall failure report
@ 2005-04-29  9:57 Hubert Tonneau
  0 siblings, 0 replies; 6+ messages in thread
From: Hubert Tonneau @ 2005-04-29  9:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Jeff Garzik, David S. Miller, Moore, Eric Dean

Andrew Morton wrote:
>
> >  . I reported a year ago that SCSI fusion was unable to properly recover from
> >    tiny errors under 2.6 as opposed to 2.4 ... and got hit by the same problem
> >    6 monthes later
> 
> Please send a full report to Eric Moore and cc linux-scsi@vger.kernel.org

Already done.
My initial report to Eric Moore is dated mai 15 2004 (2.6.6), so you can
understand why I'm a bit afraid beeing hit by the same bug more than 6 monthes
and 3 stable kernel releases later.
After the second report dated january 26 2005 (2.6.9),
he sent me fusion 3.1.19 saying it solves the problem.
As far as I could check, no official kernel is running 3.1.19
2.6.12 will jump to 3.1.20, so I can assume it's solved;
also I have no way to verify it since the bug will append only in case of
something going wrong on the SCSI chain, what tends to append only
once every several monthes, and not on all servers.

> >  . There is still a memory leak trouble (probably in tigon3 driver since others
> >    reported so on kernel mailing list, and tigon3 is not a geek hardware since
> >    most nowdays lowend servers use either tigon3 or pro1000)
> 
> Please send a report to David Miller and Jeff Garzik and cc netdev@oss.sgi.com

I must apology about this one because it's solved in 2.6.11
When I posted to kernel mailing list about it (january 28 2005) I could not
track the memory leak problem down to tigon3 driver. Only posts from others
led me to the conclusion, and after migrating from 2.6.10 to 2.4, I forgot to
retest with 2.6.11
My fault.

> >  . Since 2.6.10, the TCP task does not work anymore with OSX (2 Mbps instead
> >    of 60 Mbps on a 100 Mbps wire)
> 
> Please send a full report to David Miller and cc netdev@oss.sgi.com.
> 
> Also please describe a simple way of reproducing this - I'll see if it
> happens here.

They are very much awared about the problem (initial post dated january 5 2005)
and tracked it to the OSX surprising handling of delayed ack. So, as a
developper having to deal fairly often with interfacing closed systems I can
understand how their task is hard since they have to workaround instead of OSX
beeing fixed.

If you want to try to reproduce it, you need a gigabit connected PC (Intel
pro1000 in my case, but as far as I could understand it, it is not related
to pro1000 offload capabilities) pushing datas (TCP connection such as
libsmbclient) through a gigabit switch to a 100 Mbps connected Mac OSX 10.3
In short, you need a situation where flow control from Linux to OSX is needed.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 upgrade overall failure report
  2005-04-16 16:20 Hubert Tonneau
  2005-04-16 17:59 ` Alejandro Bonilla
@ 2005-04-29  9:17 ` Andrew Morton
  2005-04-29 15:56   ` David S. Miller
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2005-04-29  9:17 UTC (permalink / raw)
  To: Hubert Tonneau
  Cc: linux-kernel, Moore, Eric Dean, Jeff Garzik, David S. Miller

Hubert Tonneau <hubert.tonneau@fullpliant.org> wrote:
>
> Right from the beginning, the core 2.6 kernel was rock solid for me, so I had
>  no crash to complain, but ...
> 
>  . I reported a year ago that SCSI fusion was unable to properly recover from
>    tiny errors under 2.6 as opposed to 2.4 ... and got hit by the same problem
>    6 monthes later

Please send a full report to Eric Moore and cc linux-scsi@vger.kernel.org

>  . There is still a memory leak trouble (probably in tigon3 driver since others
>    reported so on kernel mailing list, and tigon3 is not a geek hardware since
>    most nowdays lowend servers use either tigon3 or pro1000)

Please send a report to David Miller and Jeff Garzik and cc netdev@oss.sgi.com

>  . There have been USB storage issues, also they are now solved

OK.

>  . Since 2.6.10, the TCP task does not work anymore with OSX (2 Mbps instead
>    of 60 Mbps on a 100 Mbps wire)

Please send a full report to David Miller and cc netdev@oss.sgi.com.

Also please describe a simple way of reproducing this - I'll see if it
happens here.

> the 2.6 development model

Is dependent upon the quality and promptness of reports from testers such
as yourself, as well as the testers' preparedness to respond to the
developers' questions.

Thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 upgrade overall failure report
  2005-04-16 16:20 Hubert Tonneau
@ 2005-04-16 17:59 ` Alejandro Bonilla
  2005-04-29  9:17 ` Andrew Morton
  1 sibling, 0 replies; 6+ messages in thread
From: Alejandro Bonilla @ 2005-04-16 17:59 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: linux-kernel

I usually never complain, or give negative motivation, but this is a 
reality.

>Now, what's wrong with that ?
>Well, the fact is that new hardware is only supported by latest kernel,
>so at the end, you have to upgrade, and so you get more and more complexity
>whether you like it or not.
>As an example, for servers, 2.4 is still fine, but laptops already require 2.6
>  
>
If it wouldn't be because my wifi card only works in 2.6 and cause the 
speedstep support for my laptop, I would be using 2.4 kernels.

>As a result, the complexity versus stability compromise is less and less
>suited for most real life uses.
>
>Now the problem with the kernel complexity is:
>. ultimate implementation requires much more testing than simple good one (TCP
>  sample)
>. it makes life harder for device drivers writers (tigon3 or fusion sample)
>
I have returned laptops to get them exchanged for one's that have a 
e100/e1000 instead the tigon3. It's a shame that manufacturers still use 
this chip on servers and laptops.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 2.6 upgrade overall failure report
@ 2005-04-16 16:20 Hubert Tonneau
  2005-04-16 17:59 ` Alejandro Bonilla
  2005-04-29  9:17 ` Andrew Morton
  0 siblings, 2 replies; 6+ messages in thread
From: Hubert Tonneau @ 2005-04-16 16:20 UTC (permalink / raw)
  To: linux-kernel

I started to move production servers to kernel 2.6 a year ago, but the
strange situation is that one year later, most of them are back to 2.4
This did not append with 2.0 -> 2.2 or 2.2 -> 2.4 upgrade.

Here are the factual technical reasons:

Right from the beginning, the core 2.6 kernel was rock solid for me, so I had
no crash to complain, but ...

. I reported a year ago that SCSI fusion was unable to properly recover from
  tiny errors under 2.6 as opposed to 2.4 ... and got hit by the same problem
  6 monthes later

. There is still a memory leak trouble (probably in tigon3 driver since others
  reported so on kernel mailing list, and tigon3 is not a geek hardware since
  most nowdays lowend servers use either tigon3 or pro1000)

. There have been USB storage issues, also they are now solved

. Since 2.6.10, the TCP task does not work anymore with OSX (2 Mbps instead
  of 60 Mbps on a 100 Mbps wire)

Each time I get a problem with a kind of hardware, I move to what I find the
most stable for it, until I'm sure the problem has been solved with up to date
kernel, and the current result is that I have more and more servers back to
2.4


So, I could conclude with many others that the 2.6 development model is
worse than the old one, but I don't think so. I think the problem to solve
is handling the complexity, and it's a new issue, and neither the old
development model nor the new one are suited at the moment because the
change is more fondamental: you now have to deal with complexity versus
stability.

In real world, there are very fiew situations where top performances on
the kernel side will change the situation. Most real life tasks are either
easily handled by modern hardware, so even a naive 2.0 kernel would be fine,
or too complex for the hardware, so the kernel cannot fill the gap; only new
harware will do.
On the other hand, selling people are very much interested in new features
or better benchmarks because it's what will make their job easier.
So, as a result of it's success, Linux kernel is focusing probably a bit
more than necessary on high performances. This is even more true since most
high profile developpers are now beeing given high end machines (Linus a
PowerPC, kernel.org a quad opteron, etc).

Now, what's wrong with that ?
Well, the fact is that new hardware is only supported by latest kernel,
so at the end, you have to upgrade, and so you get more and more complexity
whether you like it or not.
As an example, for servers, 2.4 is still fine, but laptops already require 2.6
As a result, the complexity versus stability compromise is less and less
suited for most real life uses.

Now the problem with the kernel complexity is:
. ultimate implementation requires much more testing than simple good one (TCP
  sample)
. it makes life harder for device drivers writers (tigon3 or fusion sample)

So, back to Linux kernel development model, what is now flowed is to assume
that the last development kernel will be the good candidate for the next
stable one. It was true as long as the overall complexity was low;
it's not any more.

The second bad attitude is to not require a new device drivers to be included
in the sable kernel (I'm assuming current stable is 2.4) before entering the
development kernel because it will make upgrade mandatory sooner.

So, basically, we need two trees, one conservative focusing on clean simple
implementation (single kernel lock such as in 2.0, good basic algorithms,
no more), and one focusing on top performances,
with each driver beeing written first for the simple tree, then ported to the
advanced one.

Now, we can't say, ok there are Linux alternatives that are more conservative,
so would fit my conservative kernel definition, because we also need concerted
design between the two so that porting from conservative to top performance be
as simple as possible, and even more important, so that running on conservative
or top performance be transparent for applications.

That's the second point where current model starts to fall short:
we need planned changed in user land interface (/proc, ifconfig, etc)
in both kernels because no change in stable kernel view from user land is
probably also not a good idea because it will make it unusable at some
point because applications that upgraded will not run fine on it anymore,
and upgrading applications is mandatory also because of security issues;
not talking about improvements that can also append in the core conservative
kernel because finding the simplest implementation is not easy, so takes
time.

So, we end with:
. two lines (conservative and high performance),
. a set of patches in each line pending in the unstable queue
. a set of patches in each line pending in the API change queue

Now you understand why I post right now. If you are designing a patch
handling tool, then it's time to think how to handle the all new picture
to get back in sync with users.

Now, on the other hand, if we remain with the current development model,
my bet is what will append is:
>From years, I've red staight forward messages such as 'Linux is more reliable
than Windows', but the question is: What Windows ?
If you talk about comodity hardware, this is true because Microsoft will never
publish informations such as: don't use this hardware, their driver is poor
and it will bring the all kernel down (what they can't publish in facts since
they have not necessary access to the source code).
On the other hand, you can buy reliable Windows system provided you pay it
10 times the price of comodity hardware because then the hardware provider
will have gone through serious auditing and testing of all peaces.
Now, the big problem is that to some extend, the Linux success with it's
corolary of Linux high profile developpers beeing given high end machines,
and Linux mainly focusing on top performances, is that this might come true
in Linux world also fairly soon.
Open development made it possible to fairly easily select the right hardware
in the 2.0 days, and get a top stable box, but with 2.6, complexity is 
something you can't avoid, so TCP issue as an example is something you won't
get easily rid of, and complex locking is something that makes many peaces
silently switch from absolutely stable to mostly stable, so that more
testing is needed.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-05-02 15:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-02 15:31 2.6 upgrade overall failure report Moore, Eric Dean
  -- strict thread matches above, loose matches on Subject: below --
2005-04-29  9:57 Hubert Tonneau
2005-04-16 16:20 Hubert Tonneau
2005-04-16 17:59 ` Alejandro Bonilla
2005-04-29  9:17 ` Andrew Morton
2005-04-29 15:56   ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).