All of lore.kernel.org
 help / color / mirror / Atom feed
* Boot failure since 3.3-rc?
@ 2012-04-21 20:45 Sune Mølgaard
  2012-04-21 21:11 ` Yinghai Lu
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sune Mølgaard @ 2012-04-21 20:45 UTC (permalink / raw)
  To: linux-kernel

Hiya,

My old AMD Duron system (i386 with 2G RAM) has been unable to boot 
recent kernels, and I have bisected it down to:

commit 321bf4ed5ff5f7c62ef59f33b7eec5b154391f0a
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Mon Jan 30 13:57:12 2012 -0800

     drivers/base/memory.c: fix memory_dev_init() long delay

     One system with 2048g ram, reported soft lockup on recent kernel.

[snip the trace of the bug that this should fix]

     Finally it takes about 55s to create 16400 memory entries.

     Root cause: for x86_64, 2048g (with 2g hole at [2g,4g), and TOP2 
will be 2050g), will have 16400 memory block.

     find_memory_block/subsys_find_device_by_id will be expensive with 
that many entries.

     Actually, we don't need to find that memory block for BOOT path.

     Skip that finding make it get back to normal.

     [   34.466696] cpu_dev_init done
     [   35.290080] memory_dev_init done

     Also solved the delay with topology_init when sections_per_block is 
not 1.

     Signed-off-by: Yinghai Lu <yinghai@kernel.org>
     Cc: Kay Sievers <kay.sievers@vrfy.org>
     Cc: Nathan Fontenot <nfont@austin.ibm.com>
     Cc: Robin Holt <holt@sgi.com>
     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

:040000 040000 95174f8192a2303d9e59e5f8523a58780b828e3e 
2de57c6dc44872ac11766616f1cf05d6070b60de M	drivers

Will be happy to test patches, but compilation is obviously slow on this 
machine, so some delay might occur.

Best regards,

Sune Mølgaard

-- 
To err is human--and to blame it on a computer is even more so.
- Orben - Current Comedy


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Boot failure since 3.3-rc?
  2012-04-21 20:45 Boot failure since 3.3-rc? Sune Mølgaard
@ 2012-04-21 21:11 ` Yinghai Lu
  2012-04-21 21:22   ` Sune Mølgaard
  2012-05-03 17:03 ` Sune Mølgaard
  2012-05-19  0:54 ` Sune Mølgaard
  2 siblings, 1 reply; 8+ messages in thread
From: Yinghai Lu @ 2012-04-21 21:11 UTC (permalink / raw)
  To: Sune Mølgaard; +Cc: linux-kernel

On Sat, Apr 21, 2012 at 1:45 PM, Sune Mølgaard <sune@molgaard.org> wrote:
> Hiya,
>
> My old AMD Duron system (i386 with 2G RAM) has been unable to boot recent
> kernels, and I have bisected it down to:
>
> commit 321bf4ed5ff5f7c62ef59f33b7eec5b154391f0a
> Author: Yinghai Lu <yinghai@kernel.org>
> Date:   Mon Jan 30 13:57:12 2012 -0800
>
>    drivers/base/memory.c: fix memory_dev_init() long delay
>
>    One system with 2048g ram, reported soft lockup on recent kernel.
>
> [snip the trace of the bug that this should fix]
>
>    Finally it takes about 55s to create 16400 memory entries.
>
>    Root cause: for x86_64, 2048g (with 2g hole at [2g,4g), and TOP2 will be
> 2050g), will have 16400 memory block.
>
>    find_memory_block/subsys_find_device_by_id will be expensive with that
> many entries.
>
>    Actually, we don't need to find that memory block for BOOT path.
>
>    Skip that finding make it get back to normal.
>
>    [   34.466696] cpu_dev_init done
>    [   35.290080] memory_dev_init done
>
>    Also solved the delay with topology_init when sections_per_block is not
> 1.
>
>    Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>    Cc: Kay Sievers <kay.sievers@vrfy.org>
>    Cc: Nathan Fontenot <nfont@austin.ibm.com>
>    Cc: Robin Holt <holt@sgi.com>
>    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> :040000 040000 95174f8192a2303d9e59e5f8523a58780b828e3e
> 2de57c6dc44872ac11766616f1cf05d6070b60de M      drivers
>
> Will be happy to test patches, but compilation is obviously slow on this
> machine, so some delay might occur.

So kernel with reverting that commit will work well?

can you post boot with "debug ignore_loglevel initcall_debug" with and
without reverting that patch?

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Boot failure since 3.3-rc?
  2012-04-21 21:11 ` Yinghai Lu
@ 2012-04-21 21:22   ` Sune Mølgaard
  2012-04-23 10:30     ` Tilman Schmidt
  0 siblings, 1 reply; 8+ messages in thread
From: Sune Mølgaard @ 2012-04-21 21:22 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: linux-kernel

Yinghai Lu wrote:

> So kernel with reverting that commit will work well?

Being new to reporting bisect results, I stupidly forgot to realise that 
this would be the final proof. Will report back when I know.

> can you post boot with "debug ignore_loglevel initcall_debug" with and
> without reverting that patch?

Need to read up on netconsole as it's a headless machine. Would 
netconsole work through a switch, or would I need to find a cross-over 
cable?

> Thanks
>
> Yinghai

Thanks for replying. Will report back when I have something to share.

Best regards,

Sune

-- 
Being powerful is like being a lady. If you have to tell people you
are, you aren't.
- Margaret Thatcher



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Boot failure since 3.3-rc?
  2012-04-21 21:22   ` Sune Mølgaard
@ 2012-04-23 10:30     ` Tilman Schmidt
  2012-05-05  8:54       ` Sune Mølgaard
  0 siblings, 1 reply; 8+ messages in thread
From: Tilman Schmidt @ 2012-04-23 10:30 UTC (permalink / raw)
  To: Sune Mølgaard; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 436 bytes --]

Am 21.04.2012 23:22, schrieb Sune Mølgaard:
> Need to read up on netconsole as it's a headless machine. Would
> netconsole work through a switch, or would I need to find a cross-over
> cable?

Netconsole works fine through a switch.

-- 
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Boot failure since 3.3-rc?
  2012-04-21 20:45 Boot failure since 3.3-rc? Sune Mølgaard
  2012-04-21 21:11 ` Yinghai Lu
@ 2012-05-03 17:03 ` Sune Mølgaard
  2012-05-11 15:46   ` Sune Mølgaard
  2012-05-19  0:54 ` Sune Mølgaard
  2 siblings, 1 reply; 8+ messages in thread
From: Sune Mølgaard @ 2012-05-03 17:03 UTC (permalink / raw)
  To: linux-kernel

Incidentally, I had to swap a wifi card, and bisecting now leads to a 
different bad commit(?)

This is what it says is the culprit now (I wonder if I should bisect 
again, and attempt booting maybe 3 or 4 times each time):

f94edacf998516ac9d849f7bc6949a703977a7f3 is the first bad commit
commit f94edacf998516ac9d849f7bc6949a703977a7f3
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Feb 17 21:48:54 2012 -0800

     i387: move TS_USEDFPU flag from thread_info to task_struct

     This moves the bit that indicates whether a thread has ownership of the
     FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
     (called 'has_fpu') in task_struct->thread.has_fpu.

     This fixes two independent bugs at the same time:

      - changing 'thread_info->status' from the scheduler causes nasty
        problems for the other users of that variable, since it is 
defined to
        be thread-synchronous (that's what the "TS_" part of the naming was
        supposed to indicate).

        So perfectly valid code could (and did) do

     	ti->status |= TS_RESTORE_SIGMASK;

        and the compiler was free to do that as separate load, or and store
        instructions.  Which can cause problems with preemption, since a 
task
        switch could happen in between, and change the TS_USEDFPU bit. The
        change to TS_USEDFPU would be overwritten by the final store.

        In practice, this seldom happened, though, because the 'status' 
field
        was seldom used more than once, so gcc would generally tend to
        generate code that used a read-modify-write instruction and thus
        happened to avoid this problem - RMW instructions are naturally low
        fat and preemption-safe.

      - On x86-32, the current_thread_info() pointer would, during 
interrupts
        and softirqs, point to a *copy* of the real thread_info, because
        x86-32 uses %esp to calculate the thread_info address, and thus the
        separate irq (and softirq) stacks would cause these kinds of odd
        thread_info copy aliases.

        This is normally not a problem, since interrupts aren't supposed to
        look at thread information anyway (what thread is running at
        interrupt time really isn't very well-defined), but it confused the
        heck out of irq_fpu_usable() and the code that tried to squirrel
        away the FPU state.

        (It also caused untold confusion for us poor kernel developers).

     It also turns out that using 'task_struct' is actually much more 
natural
     for most of the call sites that care about the FPU state, since they
     tend to work with the task struct for other reasons anyway (ie
     scheduling).  And the FPU data that we are going to save/restore is
     found there too.

     Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to
     the %esp issue.

     Cc: Arjan van de Ven <arjan@linux.intel.com>
     Reported-and-tested-by: Raphael Prevost <raphael@buro.asia>
     Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com>
     Tested-by: Peter Anvin <hpa@zytor.com>
     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 19548f49884c9745ecb3970321ff41b244d79b97 
ec8b1a02dd7ef354f1be4c68767e4353819dd5fa M	arch

For obvious reasons, this commit cannot be easily reverted, but help is 
much appreciated!

/sune

-- 
Unix is not an 'a-ha' experience, it is more of a 'holy-shit' experience.
- Colin McFadyen



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Boot failure since 3.3-rc?
  2012-04-23 10:30     ` Tilman Schmidt
@ 2012-05-05  8:54       ` Sune Mølgaard
  0 siblings, 0 replies; 8+ messages in thread
From: Sune Mølgaard @ 2012-05-05  8:54 UTC (permalink / raw)
  To: Tilman Schmidt; +Cc: linux-kernel

Tilman Schmidt wrote:
> Am 21.04.2012 23:22, schrieb Sune Mølgaard:
>> Need to read up on netconsole as it's a headless machine. Would
>> netconsole work through a switch, or would I need to find a cross-over
>> cable?
>
> Netconsole works fine through a switch.
>

Thank you. If, as per my last post, the offending commit is instead 
about i387, my guess would be that the kernel hangs immediately, way 
before init'ing the NIC.

The machine is rather old, but if it has sufficient NVRAM, is there a 
way to log failures (perhaps via grub) to there, and then retrieve them 
when booting a known good kernel?

Best,

Sune

-- 
Border relations between Canada and Mexico have never been better.
- G. W. Bush



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Boot failure since 3.3-rc?
  2012-05-03 17:03 ` Sune Mølgaard
@ 2012-05-11 15:46   ` Sune Mølgaard
  0 siblings, 0 replies; 8+ messages in thread
From: Sune Mølgaard @ 2012-05-11 15:46 UTC (permalink / raw)
  To: linux-kernel

Sune Mølgaard wrote:
> Incidentally, I had to swap a wifi card, and bisecting now leads to a
> different bad commit(?)
>
> This is what it says is the culprit now (I wonder if I should bisect
> again, and attempt booting maybe 3 or 4 times each time):
>
> f94edacf998516ac9d849f7bc6949a703977a7f3 is the first bad commit
> commit f94edacf998516ac9d849f7bc6949a703977a7f3

Would anyone happen to know if this has been backported to the 3.0-series?

Just tried booting the latest ubuntu 11.10 kernel (based on 3.0) which 
also failed. That should, naturally, be logged with the Ubuntu guys (and 
it will be), but until then, if someone can positively say that the 
above patch was backported, it might lend credence to the assumption 
that it is indeed the culprit.

I have, btw., ordered a small display to hook up to the machine in order 
to see where it fails.

Will report back...

Best regards,

Sune Mølgaard

-- 
First things first, but not necessarily in that order.
- Doctor Who



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Boot failure since 3.3-rc?
  2012-04-21 20:45 Boot failure since 3.3-rc? Sune Mølgaard
  2012-04-21 21:11 ` Yinghai Lu
  2012-05-03 17:03 ` Sune Mølgaard
@ 2012-05-19  0:54 ` Sune Mølgaard
  2 siblings, 0 replies; 8+ messages in thread
From: Sune Mølgaard @ 2012-05-19  0:54 UTC (permalink / raw)
  To: linux-kernel

Problem solved.

At some point between 3.3-rc3 and 3.3-rc4, something has made the uuids 
in /dev/disk/by-uuid/ different from the uuids that are embedded in md 
raid member v 0.90 metadata.

Failure at my end was caused by my adding (mdN) entries to 
/bot/grub/device.map at some time in the past, mapping them to the 
entries there.

Changing said entries to point to symlinks in /dev/disk/by-id solved 
this for me.

Problem was compunded by the use of initrd since, as one will readily 
surmise post-hoc, a kernel with no other problems will boot if the 
corresponding initrd image was built under a kernel with the old 
behaviour wrt. to /dev/disk/by-uuid mappings, thus prompting one to "git 
bisect good", whereas the kernel built while being booted in that 
kernel, with the initrd image being built under it will fail if it 
exhibits the new behaviour, leading to an off-by-one in the binary 
search of git bisect.

This mail is meant jointly as a "problem solved" message to this list, 
and as explanation for posterity if someone else happens to run into 
similar problems.

Initial confusion was mainly due to the lack of an attached monitor, the 
acquirement of which showed the eroor to be one of failure to mount the 
root fs, and even if that might have been seen via netconsole, I decided 
to order one before realising that some of the kernels would label my 
two ethernet cards differently than what udev picked up at one time and 
made permanent...

In closing: Thank you all for your input and sorry for the noise.

Best regards,

Sune Mølgaard

-- 
Nothing exists except atoms and empty space; everything else is
opinion.
- Democritus



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-05-19  0:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-21 20:45 Boot failure since 3.3-rc? Sune Mølgaard
2012-04-21 21:11 ` Yinghai Lu
2012-04-21 21:22   ` Sune Mølgaard
2012-04-23 10:30     ` Tilman Schmidt
2012-05-05  8:54       ` Sune Mølgaard
2012-05-03 17:03 ` Sune Mølgaard
2012-05-11 15:46   ` Sune Mølgaard
2012-05-19  0:54 ` Sune Mølgaard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.