All of lore.kernel.org
 help / color / mirror / Atom feed
* oops at mount
@ 2013-05-30 11:17 Papp Tamas
  2013-05-30 12:32 ` Josef Bacik
  2013-05-30 20:08 ` Stefan Behrens
  0 siblings, 2 replies; 47+ messages in thread
From: Papp Tamas @ 2013-05-30 11:17 UTC (permalink / raw)
  To: linux-btrfs

hi All,

I'm new on the list.

System:
Distributor ID:	Ubuntu
Description:	Ubuntu 13.04
Release:	13.04
Codename:	raring

Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

The symptom is the same with Saucy 3.9 kernel.

ii  btrfs-tools                               0.20~git20130524~650e656-0daily13~raring1 amd64 
  Checksumming Copy on Write Filesystem utilities


I also tried btrfs-tools v0.19 before with no luck.


$ btrfsck --repair /dev/sda1
enabling repair mode
parent transid verify failed on 430612480 wanted 81016 found 81011
parent transid verify failed on 430612480 wanted 81016 found 81011
parent transid verify failed on 430612480 wanted 81016 found 81011
parent transid verify failed on 430612480 wanted 81016 found 81011
Ignoring transid failure
Checking filesystem on /dev/sda1
UUID: deed1ffb-27bb-4555-b5ce-8a3c8ee5612c
checking extents
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
checking csums
checking root refs
found 67570520064 bytes used err is 0
total csum bytes: 65168792
total tree bytes: 789745664
total fs tree bytes: 651145216
total extent tree bytes: 50372608
btree space waste bytes: 192929190
file data blocks allocated: 80764424192
  referenced 69347667968
Btrfs v0.20-rc1


If I mount, I get an oops message. The machine is not completely freezed, but I have to reboot it to 
be able to use it again.


    69.257107] btrfsck[2703]: segfault at 7ff069802710 ip 00007ff063ceecbd sp 00007fff9bb5db70 error 
4 in libc-2.17.so[7ff063c6f000+1be000]
[  480.799981] device fsid deed1ffb-27bb-4555-b5ce-8a3c8ee5612c devid 1 transid 81010 /dev/sda1
[  480.802507] btrfs: disk space caching is enabled
[  480.851534] Btrfs detected SSD devices, enabling SSD mode
[  480.863245] btrfs bad tree block start 0 413601792
[  480.863320] btrfs bad tree block start 0 413601792
[  480.863389] ------------[ cut here ]------------
[  480.863426] Kernel BUG at ffffffffa03d3b6a [verbose debug info unavailable]
[  480.863459] invalid opcode: 0000 [#1] SMP
[  480.863490] Modules linked in: ip6table_filter(F) ip6_tables(F) xt_state(F) ipt_REJECT(F) 
xt_CHECKSUM(F) iptable_mangle(F) xt_tcpudp(F) iptable_filter(F) ipt_MASQUERADE(F) iptable_nat(F) 
nf_conntrack_ipv4(F) nf_defrag_ipv4(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack(F) ip_tables(F) 
x_tables(F) bridge(F) stp(F) llc(F) pci_stub vboxpci(OF) vboxnetadp(OF) vboxnetflt(OF) vboxdrv(OF) 
rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_idt binfmt_misc(F) qcserial usb_wwan usbserial 
pata_pcmcia arc4(F) hid_generic coretemp kvm_intel iwldvm kvm mac80211 ghash_clmulni_intel(F) 
aesni_intel(F) aes_x86_64(F) xts(F) lrw(F) gf128mul(F) ablk_helper(F) cryptd(F) usbhid hid joydev(F) 
tpm_infineon hp_wmi sparse_keymap uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core 
videodev pcmcia microcode(F) btusb bluetooth psmouse(F) serio_raw(F) intel_ips btrfs(F) tpm_tis 
libcrc32c(F) zlib_deflate(F) sdhci_pci snd_hda_intel sdhci snd_hda_codec snd_hwdep(F) snd_pcm(F) 
firewire_ohci snd_page_alloc(F) firewire_core snd_seq_midi(F) snd_seq_midi_event(F) crc_itu_t(F) 
yenta_socket pcmcia_rsrc i915 pcmcia_core snd_rawmidi(F) drm_kms_helper snd_seq(F) hp_accel drm 
lis3lv02d snd_seq_device(F) input_polldev snd_timer(F) wmi iwlwifi snd(F) video(F) mac_hid cfg80211 
lpc_ich i2c_algo_bit mei e1000e(F) soundcore(F) lp(F) parport(F) ahci(F) libahci(F)
[  480.864322] CPU 3
[  480.864338] Pid: 5550, comm: mount Tainted: GF          O 3.8.0-19-generic #30-Ubuntu 
Hewlett-Packard HP EliteBook 2540p/7008
[  480.864386] RIP: 0010:[<ffffffffa03d3b6a>]  [<ffffffffa03d3b6a>] 
btrfs_recover_log_trees+0x23a/0x390 [btrfs]
[  480.864474] RSP: 0018:ffff88012ad41b40  EFLAGS: 00010282
[  480.864499] RAX: 00000000fffffffb RBX: ffff88018b91c000 RCX: 00000001801c001b
[  480.864531] RDX: 00000001801c001c RSI: 00000000801c001b RDI: ffff8801b20b3900
[  480.864563] RBP: ffff88012ad41bf0 R08: 0000000000000000 R09: 0000000000000001
[  480.864594] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88014fc0a5a0
[  480.864625] R13: ffff88011d2f0e40 R14: ffff88018b91a800 R15: ffff8801ab3ea000
[  480.864656] FS:  00007fb531818840(0000) GS:ffff8801bbcc0000(0000) knlGS:0000000000000000
[  480.864693] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  480.864718] CR2: 00000000006a5000 CR3: 000000016800b000 CR4: 00000000000007e0
[  480.864750] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  480.864781] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  480.864813] Process mount (pid: 5550, threadinfo ffff88012ad40000, task ffff880128522e80)
[  480.864847] Stack:
[  480.864860]  ffff8801b0e5ce40 ffff88012ad41b98 fffffffa00000000 ffffff84ffffffff
[  480.864905]  fffffaffffffffff 010684ffffffffff 0106000000000000 ff84000000000000
[  480.864947]  faffffffffffffff 84ffffffffffffff 0000000000000106 0000000000000000
[  480.864990] Call Trace:
[  480.865019]  [<ffffffffa03d1a10>] ? fixup_inode_link_counts+0x150/0x150 [btrfs]
[  480.865061]  [<ffffffffa0398a2c>] open_ctree+0x171c/0x1da0 [btrfs]
[  480.865095]  [<ffffffff81331461>] ? disk_name+0x61/0xc0
[  480.865126]  [<ffffffffa0371a83>] btrfs_mount+0x613/0x750 [btrfs]
[  480.865160]  [<ffffffff81197c43>] mount_fs+0x43/0x1b0
[  480.865187]  [<ffffffff811b2457>] ? alloc_vfsmnt+0xd7/0x1b0
[  480.865214]  [<ffffffff811b25e4>] vfs_kern_mount+0x74/0x110
[  480.865240]  [<ffffffff811b495f>] do_mount+0x21f/0xac0
[  480.865270]  [<ffffffff8114a46b>] ? strndup_user+0x5b/0x80
[  480.865296]  [<ffffffff811b528e>] sys_mount+0x8e/0xe0
[  480.865323]  [<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[  480.865352] Code: ef e8 bb 9e ff ff 85 c0 75 21 83 7d b8 02 0f 85 ad fe ff ff 48 8b 75 c0 4c 89 
e2 4c 89 ef e8 5e dd ff ff 85 c0 0f 84 96 fe ff ff <0f> 0b 0f 1f 40 00 4c 89 e7 e8 b8 13 fa ff 44 8b 
5d b4 45 85 db
[  480.865680] RIP  [<ffffffffa03d3b6a>] btrfs_recover_log_trees+0x23a/0x390 [btrfs]
[  480.865743]  RSP <ffff88012ad41b40>
[  481.887687] ---[ end trace 6d9b536c1234c5bc ]---



The storage is an Intel X18-M/X25-M/X25-V G2 SSD and a similar error was there a couple of weeks ago.
It's a root partition with 3 subvolumes. Now I use a secondary system on the drive.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
   4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
   5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       4
   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9718
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3532
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       486
225 Host_Writes_32MiB       0x0030   200   200   000    Old_age   Offline      -       172848
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       1823
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       4
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       1156268736
232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Reserved (0x80)     Completed without error       00%      8925         -
# 2  Reserved (0x18)     Completed without error       00%      8921         -
# 3  Vendor (0xb8)       Completed without error       00%      8920         -
# 4  Reserved (0x30)     Completed without error       00%      8065         -
# 5  Vendor (0xd0)       Completed without error       00%      3530         -
# 6  Offline             Completed without error       00%        38         -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever 
been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
     1        0        0  Not_testing
     2        0        0  Not_testing
     3        0        0  Not_testing
     4        0        0  Not_testing
     5        0        0  Not_testing


Is it the SSD or rather a bug?


Thanks,
tamas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 11:17 oops at mount Papp Tamas
@ 2013-05-30 12:32 ` Josef Bacik
  2013-05-30 12:55   ` Stefan Behrens
  2013-05-31 14:55   ` Papp Tamas
  2013-05-30 20:08 ` Stefan Behrens
  1 sibling, 2 replies; 47+ messages in thread
From: Josef Bacik @ 2013-05-30 12:32 UTC (permalink / raw)
  To: Papp Tamas; +Cc: linux-btrfs

On Thu, May 30, 2013 at 05:17:06AM -0600, Papp Tamas wrote:
> hi All,
> 
> I'm new on the list.
> 
> System:
> Distributor ID:	Ubuntu
> Description:	Ubuntu 13.04
> Release:	13.04
> Codename:	raring
> 
> Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> 
> The symptom is the same with Saucy 3.9 kernel.

Can you try btrfs-next

git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git

if it's still not fixed please file a bug at bugzilla.kernel.org and make sure
the component is set to btrfs.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 12:32 ` Josef Bacik
@ 2013-05-30 12:55   ` Stefan Behrens
  2013-05-30 14:03     ` Chris Mason
  2013-06-03 11:56     ` Papp Tamas
  2013-05-31 14:55   ` Papp Tamas
  1 sibling, 2 replies; 47+ messages in thread
From: Stefan Behrens @ 2013-05-30 12:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Papp Tamas, linux-btrfs

On Thu, 30 May 2013 08:32:35 -0400, Josef Bacik wrote:
> On Thu, May 30, 2013 at 05:17:06AM -0600, Papp Tamas wrote:
>> hi All,
>>
>> I'm new on the list.
>>
>> System:
>> Distributor ID:	Ubuntu
>> Description:	Ubuntu 13.04
>> Release:	13.04
>> Codename:	raring
>>
>> Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>>
>> The symptom is the same with Saucy 3.9 kernel.
> 
> Can you try btrfs-next
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git
> 
> if it's still not fixed please file a bug at bugzilla.kernel.org and make sure
> the component is set to btrfs.  Thanks,

Papp is using an Intel X18-M/X25-M/X25-V G2 SSD. At least with an Intel
X25 SSD that identifies itself with "INTEL SSDSA2M080" and on one with
the ID "INTEL SSDSA2M040", I've tested whether they honor the flush
request. And these two SSDs don't do so, they ignore it. If you cut the
power after a flush request completes, the data that was written before
the flush request is gone, the write cache was _not_ flushed.

You can only disable the write cache during/after every boot "hdparm -W
0 /dev/sd..." (which reduces the SSDs write speed to about 4 MB/s), or
avoid such SSDs, or prepare to restore from backup occasionally.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 12:55   ` Stefan Behrens
@ 2013-05-30 14:03     ` Chris Mason
  2013-05-30 14:59       ` Stefan Behrens
  2013-06-03 11:56     ` Papp Tamas
  1 sibling, 1 reply; 47+ messages in thread
From: Chris Mason @ 2013-05-30 14:03 UTC (permalink / raw)
  To: Stefan Behrens, Josef Bacik; +Cc: Papp Tamas, linux-btrfs

Quoting Stefan Behrens (2013-05-30 08:55:58)
> On Thu, 30 May 2013 08:32:35 -0400, Josef Bacik wrote:
> > On Thu, May 30, 2013 at 05:17:06AM -0600, Papp Tamas wrote:
> >> hi All,
> >>
> >> I'm new on the list.
> >>
> >> System:
> >> Distributor ID:      Ubuntu
> >> Description: Ubuntu 13.04
> >> Release:     13.04
> >> Codename:    raring
> >>
> >> Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> >>
> >> The symptom is the same with Saucy 3.9 kernel.
> > 
> > Can you try btrfs-next
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git
> > 
> > if it's still not fixed please file a bug at bugzilla.kernel.org and make sure
> > the component is set to btrfs.  Thanks,
> 
> Papp is using an Intel X18-M/X25-M/X25-V G2 SSD. At least with an Intel
> X25 SSD that identifies itself with "INTEL SSDSA2M080" and on one with
> the ID "INTEL SSDSA2M040", I've tested whether they honor the flush
> request. And these two SSDs don't do so, they ignore it. If you cut the
> power after a flush request completes, the data that was written before
> the flush request is gone, the write cache was _not_ flushed.
> 
> You can only disable the write cache during/after every boot "hdparm -W
> 0 /dev/sd..." (which reduces the SSDs write speed to about 4 MB/s), or
> avoid such SSDs, or prepare to restore from backup occasionally.

Hi Stefan,

How did you verify this?  I'm sure intel will want to hear about it if
we can reproduce on all filesystems.

-chris


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 14:03     ` Chris Mason
@ 2013-05-30 14:59       ` Stefan Behrens
  2013-05-30 16:37         ` Chris Mason
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Behrens @ 2013-05-30 14:59 UTC (permalink / raw)
  To: Chris Mason; +Cc: Josef Bacik, Papp Tamas, linux-btrfs

On Thu, 30 May 2013 10:03:29 -0400, Chris Mason wrote:
> Quoting Stefan Behrens (2013-05-30 08:55:58)
>> Papp is using an Intel X18-M/X25-M/X25-V G2 SSD. At least with an Intel
>> X25 SSD that identifies itself with "INTEL SSDSA2M080" and on one with
>> the ID "INTEL SSDSA2M040", I've tested whether they honor the flush
>> request. And these two SSDs don't do so, they ignore it. If you cut the
>> power after a flush request completes, the data that was written before
>> the flush request is gone, the write cache was _not_ flushed.
>>
>> You can only disable the write cache during/after every boot "hdparm -W
>> 0 /dev/sd..." (which reduces the SSDs write speed to about 4 MB/s), or
>> avoid such SSDs, or prepare to restore from backup occasionally.
> 
> Hi Stefan,
> 
> How did you verify this?  I'm sure intel will want to hear about it if
> we can reproduce on all filesystems.
> 
> -chris
> 

We have written a kernel module that (among others) is able to write 4KB
block of random data at random locations on an SSD, and in a second step
to read and verify that data.

The test procedure to check SSDs is:
1. Write 4KB blocks of random data to random locations on the disk. Send
a submit_bio(REQ_FLUSH) after each 4KB block. Log the completion of the
write request and of the flush request together with the result value.
2. Somewhere in the middle of operation, switch off all power, drive
presence and SAS data pins between the SSD and the SATA host controller.
3. Wait some time, afterwards enable the connection between the SSD and
the host controller again.
4. Read back the 4KB blocks of random data at random locations using the
same seed value that was used to generate the contents and location when
the blocks were written. Verify the data, log whether the verification
succeeded or failed.
5. Compare the log of the write and flush request completion with the
one of the read and verify process.

SSDs that honor the flush request don't cause verify errors for blocks
where the write bio and the flush bio completed successfully. Those two
Intel SSDs that I mentioned failed this test. Other Intel SSD types
succeeded the test.

Maybe a firmware update would fix this issue, I suppose it will, I have
never tried it. My intention was not to blame the SSD manufacturer, in
fact, I like their SSDs very much and buy and use them frequently. I
just wanted to prevent Josef from the headache to question the Btrfs
implementation. The issue that Papp described looks just like a power
failure in conjunction with a storage device that ignores flush requests.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 14:59       ` Stefan Behrens
@ 2013-05-30 16:37         ` Chris Mason
  0 siblings, 0 replies; 47+ messages in thread
From: Chris Mason @ 2013-05-30 16:37 UTC (permalink / raw)
  To: Stefan Behrens; +Cc: Josef Bacik, Papp Tamas, linux-btrfs

Quoting Stefan Behrens (2013-05-30 10:59:59)
> On Thu, 30 May 2013 10:03:29 -0400, Chris Mason wrote:
> > Quoting Stefan Behrens (2013-05-30 08:55:58)
> >> Papp is using an Intel X18-M/X25-M/X25-V G2 SSD. At least with an Intel
> >> X25 SSD that identifies itself with "INTEL SSDSA2M080" and on one with
> >> the ID "INTEL SSDSA2M040", I've tested whether they honor the flush
> >> request. And these two SSDs don't do so, they ignore it. If you cut the
> >> power after a flush request completes, the data that was written before
> >> the flush request is gone, the write cache was _not_ flushed.
> >>
> >> You can only disable the write cache during/after every boot "hdparm -W
> >> 0 /dev/sd..." (which reduces the SSDs write speed to about 4 MB/s), or
> >> avoid such SSDs, or prepare to restore from backup occasionally.
> > 
> > Hi Stefan,
> > 
> > How did you verify this?  I'm sure intel will want to hear about it if
> > we can reproduce on all filesystems.
> > 
> > -chris
> > 
> 
> We have written a kernel module that (among others) is able to write 4KB
> block of random data at random locations on an SSD, and in a second step
> to read and verify that data.
> 
> The test procedure to check SSDs is:
> 1. Write 4KB blocks of random data to random locations on the disk. Send
> a submit_bio(REQ_FLUSH) after each 4KB block. Log the completion of the
> write request and of the flush request together with the result value.
> 2. Somewhere in the middle of operation, switch off all power, drive
> presence and SAS data pins between the SSD and the SATA host controller.
> 3. Wait some time, afterwards enable the connection between the SSD and
> the host controller again.
> 4. Read back the 4KB blocks of random data at random locations using the
> same seed value that was used to generate the contents and location when
> the blocks were written. Verify the data, log whether the verification
> succeeded or failed.
> 5. Compare the log of the write and flush request completion with the
> one of the read and verify process.
> 
> SSDs that honor the flush request don't cause verify errors for blocks
> where the write bio and the flush bio completed successfully. Those two
> Intel SSDs that I mentioned failed this test. Other Intel SSD types
> succeeded the test.
> 
> Maybe a firmware update would fix this issue, I suppose it will, I have
> never tried it. My intention was not to blame the SSD manufacturer, in
> fact, I like their SSDs very much and buy and use them frequently. I
> just wanted to prevent Josef from the headache to question the Btrfs
> implementation. The issue that Papp described looks just like a power
> failure in conjunction with a storage device that ignores flush requests.

It's definitely useful information.  The gen2's did have some problems
(mine failed as well) but I didn't realize how bad the powercut handling
was.

-chris


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 11:17 oops at mount Papp Tamas
  2013-05-30 12:32 ` Josef Bacik
@ 2013-05-30 20:08 ` Stefan Behrens
  1 sibling, 0 replies; 47+ messages in thread
From: Stefan Behrens @ 2013-05-30 20:08 UTC (permalink / raw)
  To: Papp Tamas; +Cc: linux-btrfs

On 05/30/2013 13:17, Papp Tamas wrote:
> hi All,
>
> I'm new on the list.
>
> System:
> Distributor ID:    Ubuntu
> Description:    Ubuntu 13.04
> Release:    13.04
> Codename:    raring
>
> Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013
> x86_64 x86_64 x86_64 GNU/Linux
>
> The symptom is the same with Saucy 3.9 kernel.
>
> ii  btrfs-tools
> 0.20~git20130524~650e656-0daily13~raring1 amd64  Checksumming Copy on
> Write Filesystem utilities
>
>
> I also tried btrfs-tools v0.19 before with no luck.
>
>
> $ btrfsck --repair /dev/sda1
> enabling repair mode
> parent transid verify failed on 430612480 wanted 81016 found 81011
> parent transid verify failed on 430612480 wanted 81016 found 81011
> parent transid verify failed on 430612480 wanted 81016 found 81011
> parent transid verify failed on 430612480 wanted 81016 found 81011
> Ignoring transid failure
> Checking filesystem on /dev/sda1
> UUID: deed1ffb-27bb-4555-b5ce-8a3c8ee5612c
> checking extents
> checking free space cache
> cache and super generation don't match, space cache will be invalidated
> checking fs roots
> checking csums
> checking root refs
> found 67570520064 bytes used err is 0
> total csum bytes: 65168792
> total tree bytes: 789745664
> total fs tree bytes: 651145216
> total extent tree bytes: 50372608
> btree space waste bytes: 192929190
> file data blocks allocated: 80764424192
>   referenced 69347667968
> Btrfs v0.20-rc1
>
>
> If I mount, I get an oops message. The machine is not completely
> freezed, but I have to reboot it to be able to use it again.
>
>
>     69.257107] btrfsck[2703]: segfault at 7ff069802710 ip
> 00007ff063ceecbd sp 00007fff9bb5db70 error 4 in
> libc-2.17.so[7ff063c6f000+1be000]
> [  480.799981] device fsid deed1ffb-27bb-4555-b5ce-8a3c8ee5612c devid 1
> transid 81010 /dev/sda1
> [  480.802507] btrfs: disk space caching is enabled
> [  480.851534] Btrfs detected SSD devices, enabling SSD mode
> [  480.863245] btrfs bad tree block start 0 413601792
> [  480.863320] btrfs bad tree block start 0 413601792
> [  480.863389] ------------[ cut here ]------------
> [  480.863426] Kernel BUG at ffffffffa03d3b6a [verbose debug info
> unavailable]
> [  480.863459] invalid opcode: 0000 [#1] SMP
> [  480.863490] Modules linked in: ip6table_filter(F) ip6_tables(F)
> xt_state(F) ipt_REJECT(F) xt_CHECKSUM(F) iptable_mangle(F) xt_tcpudp(F)
> iptable_filter(F) ipt_MASQUERADE(F) iptable_nat(F) nf_conntrack_ipv4(F)
> nf_defrag_ipv4(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack(F) ip_tables(F)
> x_tables(F) bridge(F) stp(F) llc(F) pci_stub vboxpci(OF) vboxnetadp(OF)
> vboxnetflt(OF) vboxdrv(OF) rfcomm bnep snd_hda_codec_hdmi
> snd_hda_codec_idt binfmt_misc(F) qcserial usb_wwan usbserial pata_pcmcia
> arc4(F) hid_generic coretemp kvm_intel iwldvm kvm mac80211
> ghash_clmulni_intel(F) aesni_intel(F) aes_x86_64(F) xts(F) lrw(F)
> gf128mul(F) ablk_helper(F) cryptd(F) usbhid hid joydev(F) tpm_infineon
> hp_wmi sparse_keymap uvcvideo videobuf2_vmalloc videobuf2_memops
> videobuf2_core videodev pcmcia microcode(F) btusb bluetooth psmouse(F)
> serio_raw(F) intel_ips btrfs(F) tpm_tis libcrc32c(F) zlib_deflate(F)
> sdhci_pci snd_hda_intel sdhci snd_hda_codec snd_hwdep(F) snd_pcm(F)
> firewire_ohci snd_page_alloc(F) firewire_core snd_seq_midi(F)
> snd_seq_midi_event(F) crc_itu_t(F) yenta_socket pcmcia_rsrc i915
> pcmcia_core snd_rawmidi(F) drm_kms_helper snd_seq(F) hp_accel drm
> lis3lv02d snd_seq_device(F) input_polldev snd_timer(F) wmi iwlwifi
> snd(F) video(F) mac_hid cfg80211 lpc_ich i2c_algo_bit mei e1000e(F)
> soundcore(F) lp(F) parport(F) ahci(F) libahci(F)
> [  480.864322] CPU 3
> [  480.864338] Pid: 5550, comm: mount Tainted: GF          O
> 3.8.0-19-generic #30-Ubuntu Hewlett-Packard HP EliteBook 2540p/7008
> [  480.864386] RIP: 0010:[<ffffffffa03d3b6a>]  [<ffffffffa03d3b6a>]
> btrfs_recover_log_trees+0x23a/0x390 [btrfs]
> [  480.864474] RSP: 0018:ffff88012ad41b40  EFLAGS: 00010282
> [  480.864499] RAX: 00000000fffffffb RBX: ffff88018b91c000 RCX:
> 00000001801c001b
> [  480.864531] RDX: 00000001801c001c RSI: 00000000801c001b RDI:
> ffff8801b20b3900
> [  480.864563] RBP: ffff88012ad41bf0 R08: 0000000000000000 R09:
> 0000000000000001
> [  480.864594] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff88014fc0a5a0
> [  480.864625] R13: ffff88011d2f0e40 R14: ffff88018b91a800 R15:
> ffff8801ab3ea000
> [  480.864656] FS:  00007fb531818840(0000) GS:ffff8801bbcc0000(0000)
> knlGS:0000000000000000
> [  480.864693] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  480.864718] CR2: 00000000006a5000 CR3: 000000016800b000 CR4:
> 00000000000007e0
> [  480.864750] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  480.864781] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [  480.864813] Process mount (pid: 5550, threadinfo ffff88012ad40000,
> task ffff880128522e80)
> [  480.864847] Stack:
> [  480.864860]  ffff8801b0e5ce40 ffff88012ad41b98 fffffffa00000000
> ffffff84ffffffff
> [  480.864905]  fffffaffffffffff 010684ffffffffff 0106000000000000
> ff84000000000000
> [  480.864947]  faffffffffffffff 84ffffffffffffff 0000000000000106
> 0000000000000000
> [  480.864990] Call Trace:
> [  480.865019]  [<ffffffffa03d1a10>] ?
> fixup_inode_link_counts+0x150/0x150 [btrfs]
> [  480.865061]  [<ffffffffa0398a2c>] open_ctree+0x171c/0x1da0 [btrfs]
> [  480.865095]  [<ffffffff81331461>] ? disk_name+0x61/0xc0
> [  480.865126]  [<ffffffffa0371a83>] btrfs_mount+0x613/0x750 [btrfs]
> [  480.865160]  [<ffffffff81197c43>] mount_fs+0x43/0x1b0
> [  480.865187]  [<ffffffff811b2457>] ? alloc_vfsmnt+0xd7/0x1b0
> [  480.865214]  [<ffffffff811b25e4>] vfs_kern_mount+0x74/0x110
> [  480.865240]  [<ffffffff811b495f>] do_mount+0x21f/0xac0
> [  480.865270]  [<ffffffff8114a46b>] ? strndup_user+0x5b/0x80
> [  480.865296]  [<ffffffff811b528e>] sys_mount+0x8e/0xe0
> [  480.865323]  [<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
> [  480.865352] Code: ef e8 bb 9e ff ff 85 c0 75 21 83 7d b8 02 0f 85 ad
> fe ff ff 48 8b 75 c0 4c 89 e2 4c 89 ef e8 5e dd ff ff 85 c0 0f 84 96 fe
> ff ff <0f> 0b 0f 1f 40 00 4c 89 e7 e8 b8 13 fa ff 44 8b 5d b4 45 85 db
> [  480.865680] RIP  [<ffffffffa03d3b6a>]
> btrfs_recover_log_trees+0x23a/0x390 [btrfs]
> [  480.865743]  RSP <ffff88012ad41b40>
> [  481.887687] ---[ end trace 6d9b536c1234c5bc ]---
>
>
>
> The storage is an Intel X18-M/X25-M/X25-V G2 SSD and a similar error was
> there a couple of weeks ago.
> It's a root partition with 3 subvolumes. Now I use a secondary system on
> the drive.
>
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    3 Spin_Up_Time            0x0020   100   100   000    Old_age
> Offline      -       0
>    4 Start_Stop_Count        0x0030   100   100   000    Old_age
> Offline      -       0
>    5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age
> Always       -       4
>    9 Power_On_Hours          0x0032   100   100   000    Old_age
> Always       -       9718
>   12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       3532
> 192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age
> Always       -       486
> 225 Host_Writes_32MiB       0x0030   200   200   000    Old_age
> Offline      -       172848
> 226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age
> Always       -       1823
> 227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age
> Always       -       4
> 228 Workload_Minutes        0x0032   100   100   000    Old_age
> Always       -       1156268736
> 232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail
> Always       -       0
> 233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age
> Always       -       0
> 184 End-to-End_Error        0x0033   100   100   099    Pre-fail
> Always       -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Reserved (0x80)     Completed without error       00%
> 8925         -
> # 2  Reserved (0x18)     Completed without error       00%
> 8921         -
> # 3  Vendor (0xb8)       Completed without error       00%
> 8920         -
> # 4  Reserved (0x30)     Completed without error       00%
> 8065         -
> # 5  Vendor (0xd0)       Completed without error       00%
> 3530         -
> # 6  Offline             Completed without error       00%
> 38         -
>
> Note: selective self-test log revision number (0) not 1 implies that no
> selective self-test has ever been run
> SMART Selective self-test log data structure revision number 0
> Note: revision number not 1 implies that no selective self-test has ever
> been run
>   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>      1        0        0  Not_testing
>      2        0        0  Not_testing
>      3        0        0  Not_testing
>      4        0        0  Not_testing
>      5        0        0  Not_testing
>
>
> Is it the SSD or rather a bug?

Try the procedures that are described in the Wiki:

https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_can.27t_mount_my_filesystem.2C_and_I_get_a_kernel_oops.21

https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#My_filesystem_won.27t_mount_and_none_of_the_above_helped._Is_there_any_hope_for_my_data.3F

And if the Wiki recommends to update the kernel and the btrfs progs, 
make sure to follow the advice :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 12:32 ` Josef Bacik
  2013-05-30 12:55   ` Stefan Behrens
@ 2013-05-31 14:55   ` Papp Tamas
  1 sibling, 0 replies; 47+ messages in thread
From: Papp Tamas @ 2013-05-31 14:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 05/30/2013 02:32 PM, Josef Bacik wrote:
>
> On Thu, May 30, 2013 at 05:17:06AM -0600, Papp Tamas wrote:
>> hi All,
>>
>> I'm new on the list.
>>
>> System:
>> Distributor ID:	Ubuntu
>> Description:	Ubuntu 13.04
>> Release:	13.04
>> Codename:	raring
>>
>> Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>>
>> The symptom is the same with Saucy 3.9 kernel.
>
> Can you try btrfs-next
>
> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git
>
> if it's still not fixed please file a bug at bugzilla.kernel.org and make sure
> the component is set to btrfs.  Thanks,

I did:

https://bugzilla.kernel.org/show_bug.cgi?id=59141

I'm doubt, it will be helpful, as unfortunately I'm not familiar with bug reports regarding to the 
kernel.


Thank you,
tamas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-05-30 12:55   ` Stefan Behrens
  2013-05-30 14:03     ` Chris Mason
@ 2013-06-03 11:56     ` Papp Tamas
  2013-06-03 12:13       ` Hugo Mills
  1 sibling, 1 reply; 47+ messages in thread
From: Papp Tamas @ 2013-06-03 11:56 UTC (permalink / raw)
  To: Stefan Behrens; +Cc: Josef Bacik, linux-btrfs

On 05/30/2013 02:55 PM, Stefan Behrens wrote:
>
> On Thu, 30 May 2013 08:32:35 -0400, Josef Bacik wrote:
>> On Thu, May 30, 2013 at 05:17:06AM -0600, Papp Tamas wrote:
>>> hi All,
>>>
>>> I'm new on the list.
>>>
>>> System:
>>> Distributor ID:	Ubuntu
>>> Description:	Ubuntu 13.04
>>> Release:	13.04
>>> Codename:	raring
>>>
>>> Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> The symptom is the same with Saucy 3.9 kernel.
>>
>> Can you try btrfs-next
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git
>>
>> if it's still not fixed please file a bug at bugzilla.kernel.org and make sure
>> the component is set to btrfs.  Thanks,
>
> Papp is using an Intel X18-M/X25-M/X25-V G2 SSD. At least with an Intel
> X25 SSD that identifies itself with "INTEL SSDSA2M080" and on one with
> the ID "INTEL SSDSA2M040", I've tested whether they honor the flush
> request. And these two SSDs don't do so, they ignore it. If you cut the
> power after a flush request completes, the data that was written before
> the flush request is gone, the write cache was _not_ flushed.
>
> You can only disable the write cache during/after every boot "hdparm -W
> 0 /dev/sd..." (which reduces the SSDs write speed to about 4 MB/s), or
> avoid such SSDs, or prepare to restore from backup occasionally.

Basically it means it's not safe to use this SSD?
I used it for 2 years with ext4 without any issue, before I switched to btrfs (on the root 
partition). In the meantime btrfs also was quite stable on my /data partition.

After I reinstalled thr system with btrfs, this issue happened two times.
But anyway, I thought cow should be able to handle these kind of issues by design. Am I wrong?


Thanks,
tamas


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: oops at mount
  2013-06-03 11:56     ` Papp Tamas
@ 2013-06-03 12:13       ` Hugo Mills
  0 siblings, 0 replies; 47+ messages in thread
From: Hugo Mills @ 2013-06-03 12:13 UTC (permalink / raw)
  To: Papp Tamas; +Cc: Stefan Behrens, Josef Bacik, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]

On Mon, Jun 03, 2013 at 01:56:10PM +0200, Papp Tamas wrote:
> On 05/30/2013 02:55 PM, Stefan Behrens wrote:
> >
> >On Thu, 30 May 2013 08:32:35 -0400, Josef Bacik wrote:
> >>On Thu, May 30, 2013 at 05:17:06AM -0600, Papp Tamas wrote:
> >>>hi All,
> >>>
> >>>I'm new on the list.
> >>>
> >>>System:
> >>>Distributor ID:	Ubuntu
> >>>Description:	Ubuntu 13.04
> >>>Release:	13.04
> >>>Codename:	raring
> >>>
> >>>Linux ctu 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> >>>
> >>>The symptom is the same with Saucy 3.9 kernel.
> >>
> >>Can you try btrfs-next
> >>
> >>git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git
> >>
> >>if it's still not fixed please file a bug at bugzilla.kernel.org and make sure
> >>the component is set to btrfs.  Thanks,
> >
> >Papp is using an Intel X18-M/X25-M/X25-V G2 SSD. At least with an Intel
> >X25 SSD that identifies itself with "INTEL SSDSA2M080" and on one with
> >the ID "INTEL SSDSA2M040", I've tested whether they honor the flush
> >request. And these two SSDs don't do so, they ignore it. If you cut the
> >power after a flush request completes, the data that was written before
> >the flush request is gone, the write cache was _not_ flushed.
> >
> >You can only disable the write cache during/after every boot "hdparm -W
> >0 /dev/sd..." (which reduces the SSDs write speed to about 4 MB/s), or
> >avoid such SSDs, or prepare to restore from backup occasionally.
> 
> Basically it means it's not safe to use this SSD?

   Correct.

> I used it for 2 years with ext4 without any issue, before I switched
> to btrfs (on the root partition). In the meantime btrfs also was
> quite stable on my /data partition.
> 
> After I reinstalled thr system with btrfs, this issue happened two times.
> But anyway, I thought cow should be able to handle these kind of issues by design. Am I wrong?

   CoW writes out everything that's going to be changed first, and
finally writes one piece of data which points to the new version of
the data. *Provided* you can guarantee that the final piece of data
(the superblock) gets written only after everything else has made it
to permanent storage, then everything is good.

   However, most hardware (and most operating systems) reorder the
data which is being sent to the disk, for performance reasons. This is
fine, as long as you can enforce the dependency in some way -- this is
what barriers/flushes do: they say "ensure that all of this is fully
written out to real permanent storage before you try to write the
superblock".

   If the hardware ignores flushes or barriers, there's no mechanism
for ensuring that the data is fully consistent, because you may find
that the superblock gets reordered to be written before some of the
other writes to the device. If that happens and then the power gets
cut before the rest of the data can be written, you have a corrupt
filesystem.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- In theory, theory and practice are the same. In ---         
                      practice,  they're different.                      

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: OOPS at mount
  2007-04-26 20:21                                         ` Joakim Tjernlund
@ 2007-04-27  9:48                                           ` David Woodhouse
  0 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2007-04-27  9:48 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 22:21 +0200, Joakim Tjernlund wrote:
> So, you still think this was all due to removing summary in kernel while
> still mounting an image that had summary nodes in it?

Yes, I'm fairly sure that's the case.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: OOPS at mount
  2007-04-26 19:39                                       ` David Woodhouse
@ 2007-04-26 20:21                                         ` Joakim Tjernlund
  2007-04-27  9:48                                           ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 20:21 UTC (permalink / raw)
  To: 'David Woodhouse'; +Cc: linux-mtd

> 
> On Thu, 2007-04-26 at 16:37 +0200, Joakim Tjernlund wrote:
> > Most pkg files seems OK, 
> > pkg --list /tftpboot/download/cuappl02a-r8a-070425_2.powerpc.pkg
> > is OK 
> > but pkg --list cuappl02a-r8a-070221_10.powerpc.pkg 
> > isn't, but thats an old file that could have been damaged earlier.
> 
> I strongly suspect it was deleted while you had summary support.

Since that one was built in februari I think it must have been deleted
at some point.

So, you still think this was all due to removing summary in kernel while
still mounting an image that had summary nodes in it?

> 
> > I don't get any CRC errors on the restored image
> 
> That's expected. You'll only see those if you mount with summary
> support. The summary leads the kernel to expect that those nodes are
> valid, but in fact they've been marked obsolete.

OK.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 14:37                                     ` Joakim Tjernlund
@ 2007-04-26 19:39                                       ` David Woodhouse
  2007-04-26 20:21                                         ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 19:39 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 16:37 +0200, Joakim Tjernlund wrote:
> Most pkg files seems OK, 
> pkg --list /tftpboot/download/cuappl02a-r8a-070425_2.powerpc.pkg
> is OK 
> but pkg --list cuappl02a-r8a-070221_10.powerpc.pkg 
> isn't, but thats an old file that could have been damaged earlier.

I strongly suspect it was deleted while you had summary support.

> I don't get any CRC errors on the restored image

That's expected. You'll only see those if you mount with summary
support. The summary leads the kernel to expect that those nodes are
valid, but in fact they've been marked obsolete.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:56                                   ` David Woodhouse
@ 2007-04-26 14:37                                     ` Joakim Tjernlund
  2007-04-26 19:39                                       ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 14:37 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 14:56 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 15:52 +0200, Joakim Tjernlund wrote:
> > Asked the lab again and it is unlikly that any files were deleted in the
> > /tftpboot/download dir. A few RPMs were created as they were downloaded
> > to that dir before extraction/installation.
> 
> Hm. So those RPM files should have still been present and complete on
> the flash?
> 

Most pkg files seems OK, 
pkg --list /tftpboot/download/cuappl02a-r8a-070425_2.powerpc.pkg
is OK 
but pkg --list cuappl02a-r8a-070221_10.powerpc.pkg 
isn't, but thats an old file that could have been damaged earlier.

I don't get any CRC errors on the restored image

 Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:52                                 ` Joakim Tjernlund
@ 2007-04-26 13:56                                   ` David Woodhouse
  2007-04-26 14:37                                     ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 13:56 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 15:52 +0200, Joakim Tjernlund wrote:
> Asked the lab again and it is unlikly that any files were deleted in the
> /tftpboot/download dir. A few RPMs were created as they were downloaded
> to that dir before extraction/installation.

Hm. So those RPM files should have still been present and complete on
the flash?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:44                               ` David Woodhouse
@ 2007-04-26 13:52                                 ` Joakim Tjernlund
  2007-04-26 13:56                                   ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 13:52 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 14:44 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 15:39 +0200, Joakim Tjernlund wrote:
> > Talked to the lab and they THINK that they upgraded both
> > app and kernel at the same time after which they rebooted to
> > activate both kernel and app.
> > The old kernel had SUMMARY on, but the new one didn't. 
> 
> And the RPM should have been deleted before the reboot? Sounds like that
> was the problem.
> 

Asked the lab again and it is unlikly that any files were deleted in the
/tftpboot/download dir. A few RPMs were created as they were downloaded
to that dir before extraction/installation.

  Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:39                             ` Joakim Tjernlund
@ 2007-04-26 13:44                               ` David Woodhouse
  2007-04-26 13:52                                 ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 13:44 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 15:39 +0200, Joakim Tjernlund wrote:
> Talked to the lab and they THINK that they upgraded both
> app and kernel at the same time after which they rebooted to
> activate both kernel and app.
> The old kernel had SUMMARY on, but the new one didn't. 

And the RPM should have been deleted before the reboot? Sounds like that
was the problem.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:15                           ` David Woodhouse
  2007-04-26 13:31                             ` David Woodhouse
@ 2007-04-26 13:39                             ` Joakim Tjernlund
  2007-04-26 13:44                               ` David Woodhouse
  1 sibling, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 13:39 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 14:15 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 15:13 +0200, Joakim Tjernlund wrote:
> > On Thu, 2007-04-26 at 14:08 +0100, David Woodhouse wrote:
> > > On Thu, 2007-04-26 at 15:00 +0200, Joakim Tjernlund wrote:
> > > > No, its gone. Is yours still there?
> > > 
> > > Nope. I think the offending node got marked obsolete after we fixed the
> > > kernel not to crash when it saw it.
> > 
> > Yes, thats what I am thinking too. I can restore to the image I sent you
> > and see if the errors return. Will that help in any way?
> 
> Only in that it'll confuse the hell out of me if they do. But go on, if
> it's not much trouble :)

restoring now, but its slow.

> 
> I haven't _quite_ placed my finger on a sequence of events which leads
> to you having a few older nodes of this inode and a later metadata node.
> 
> Had it been deleted while you were still using summary?

Talked to the lab and they THINK that they upgraded both
app and kernel at the same time after which they rebooted to
activate both kernel and app.
The old kernel had SUMMARY on, but the new one didn't.
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:15                           ` David Woodhouse
@ 2007-04-26 13:31                             ` David Woodhouse
  2007-04-26 13:39                             ` Joakim Tjernlund
  1 sibling, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 13:31 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 14:15 +0100, David Woodhouse wrote:
> I haven't _quite_ placed my finger on a sequence of events which leads
> to you having a few older nodes of this inode and a later metadata
> node.
> 
> Had it been deleted while you were still using summary?

If it had, that could explain it.

You start off with this file starting in some eraseblock, with its
dirent, then filling a few more eraseblocks...

 ( D for dirent, I for data node ), [...] is an eraseblock.

1 [......DIIIIIIII]
2 [IIIIIIIIIIIIIII]
3 [IIIIIIIIIIIIIII]
4 [IIII...........]

You seem to have something which will 'touch' these files
in /tftpboot/download when you boot -- which is why I think we aren't
seeing the problem in the image you took. Note the change time on ino
#9943 is yesterday. So somewhere entirely different on the flash you'll
have one mode metadata node for the inode in question.

5 [.......I.......]

Now, you delete the file. You can immediately erase eraseblock #2 and
#3, but because you're using SUMMARY, you don't mark the nodes in the
other eraseblocks as obsolete. You delete the file you downloaded
immediately after the one we oopsed on too, so you erase eraseblock #4.

You do write out a deletion dirent, but I suspect that got erased
somehow later (perhaps when you weren't using SUMMARY?) 

So you're left with this on the medium.
1 [......DIIIIIIII]
 ...
5 [.......I.......]

And that's what caused the crash.

I've handwaved a bit about the deletion dirent going missing, so I'm
still not 100% convinced but I'm fairly sure it's disabling summary
which was the problem.

I wonder if we should seek out obsolete nodes and mark them as such, if
we're mounted on a filesystem with summary nodes when summary support is
deleted.

Then again, why in hell is !jffs2_can_mark_obsolete() when we have
summary support anyway?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:13                         ` Joakim Tjernlund
@ 2007-04-26 13:15                           ` David Woodhouse
  2007-04-26 13:31                             ` David Woodhouse
  2007-04-26 13:39                             ` Joakim Tjernlund
  0 siblings, 2 replies; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 13:15 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 15:13 +0200, Joakim Tjernlund wrote:
> On Thu, 2007-04-26 at 14:08 +0100, David Woodhouse wrote:
> > On Thu, 2007-04-26 at 15:00 +0200, Joakim Tjernlund wrote:
> > > No, its gone. Is yours still there?
> > 
> > Nope. I think the offending node got marked obsolete after we fixed the
> > kernel not to crash when it saw it.
> 
> Yes, thats what I am thinking too. I can restore to the image I sent you
> and see if the errors return. Will that help in any way?

Only in that it'll confuse the hell out of me if they do. But go on, if
it's not much trouble :)

I haven't _quite_ placed my finger on a sequence of events which leads
to you having a few older nodes of this inode and a later metadata node.

Had it been deleted while you were still using summary?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:08                       ` David Woodhouse
@ 2007-04-26 13:13                         ` Joakim Tjernlund
  2007-04-26 13:15                           ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 13:13 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 14:08 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 15:00 +0200, Joakim Tjernlund wrote:
> > No, its gone. Is yours still there?
> 
> Nope. I think the offending node got marked obsolete after we fixed the
> kernel not to crash when it saw it.

Yes, thats what I am thinking too. I can restore to the image I sent you
and see if the errors return. Will that help in any way?

> 
> I strongly suspect that the root cause here is the fact that you turned
> SUMMARY off -- in particular the way that affected the setting of
> jffs2_can_mark_obsolete().
> 
> That will have meant that _some_ nodes were actually marked obsolete on
> the medium, while other nodes weren't. I'm trying to construct a
> detailed scenario in which that leads to the error we saw, so I can
> convince myself 100% that this was the problem....

Thanks,

    Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 13:00                     ` Joakim Tjernlund
@ 2007-04-26 13:08                       ` David Woodhouse
  2007-04-26 13:13                         ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 13:08 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 15:00 +0200, Joakim Tjernlund wrote:
> No, its gone. Is yours still there?

Nope. I think the offending node got marked obsolete after we fixed the
kernel not to crash when it saw it.

I strongly suspect that the root cause here is the fact that you turned
SUMMARY off -- in particular the way that affected the setting of
jffs2_can_mark_obsolete().

That will have meant that _some_ nodes were actually marked obsolete on
the medium, while other nodes weren't. I'm trying to construct a
detailed scenario in which that leads to the error we saw, so I can
convince myself 100% that this was the problem....

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 12:49                   ` David Woodhouse
@ 2007-04-26 13:00                     ` Joakim Tjernlund
  2007-04-26 13:08                       ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 13:00 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 13:49 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 14:43 +0200, Joakim Tjernlund wrote:
> > I get this:
> > Node at 20830232 is ino #26d7, version 2. Range 0x0-0x1000, isize 4096
> > Node at 20832264 is ino #26d7, version 3. Range 0x1000-0x2000, isize 8192
> > Node at 20833744 is ino #26d7, version 4. Range 0x2000-0x3000, isize 12288
> > Node at 20834972 is ino #26d7, version 5. Range 0x3000-0x4000, isize 16384
> > Node at 46094176 is ino #26d7, version 1172. Range 0x0-0x0, isize 16384 
> 
> Right.... and do you still get that complaint about truncation?
> 

No, its gone. Is yours still there?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 12:49                             ` Joakim Tjernlund
@ 2007-04-26 12:49                               ` David Woodhouse
  0 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 12:49 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 14:49 +0200, Joakim Tjernlund wrote:
> hmm, just checked and I don NOT see any CRC errors, I just get the
> inode debug in scan, if (ino == 9943), printout. 

Yeah, those you'll only see if you turn SUMMARY back on.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 12:43                 ` Joakim Tjernlund
@ 2007-04-26 12:49                   ` David Woodhouse
  2007-04-26 13:00                     ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 12:49 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 14:43 +0200, Joakim Tjernlund wrote:
> I get this:
> Node at 20830232 is ino #26d7, version 2. Range 0x0-0x1000, isize 4096
> Node at 20832264 is ino #26d7, version 3. Range 0x1000-0x2000, isize 8192
> Node at 20833744 is ino #26d7, version 4. Range 0x2000-0x3000, isize 12288
> Node at 20834972 is ino #26d7, version 5. Range 0x3000-0x4000, isize 16384
> Node at 46094176 is ino #26d7, version 1172. Range 0x0-0x0, isize 16384 

Right.... and do you still get that complaint about truncation?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 11:41                           ` David Woodhouse
@ 2007-04-26 12:49                             ` Joakim Tjernlund
  2007-04-26 12:49                               ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 12:49 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 12:41 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 13:35 +0200, Joakim Tjernlund wrote:
> > 
> > ehh, me think I forgot to look in the syslog, I only checked the console
> > and these does not end up there, blushing :(
> > 
> > I think the CRC are from earlier poweroffs(just pulling the board when
> > app is running)
> 
> No, they're something weirder. Node that they're _node_ header CRC
> failures. Although actually my tests seem to show that _isn't_ a failed
> CRC. It should be good. 
> 
> Did you see those errors too, when you did look in the syslog?
> 

hmm, just checked and I don NOT see any CRC errors, I just get the
inode debug in scan, if (ino == 9943), printout.

 Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 12:31               ` David Woodhouse
@ 2007-04-26 12:43                 ` Joakim Tjernlund
  2007-04-26 12:49                   ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 12:43 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 13:31 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 13:45 +0200, Joakim Tjernlund wrote:
> > > JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino
> > #9943 to 4627829 bytes failed because it only had 16384 bytes to start
> > with!
> 
> I don't understand this. Looking at the image you gave me, and
> instrumenting jffs2_scan_inode_node() to print every node it finds
> belonging to that inode, I see this:
> 
> Node at 20830232 is ino #9943, version 2. Range 0x0-0x1000, isize 4096
> Node at 20832264 is ino #9943, version 3. Range 0x1000-0x2000, isize 8192
> Node at 20833744 is ino #9943, version 4. Range 0x2000-0x3000, isize 12288
> Node at 20834972 is ino #9943, version 5. Range 0x3000-0x4000, isize 16384
> Node at 46094176 is ino #9943, version 1172. Range 0x0-0x0, isize 16384
> 
> So I have no clue why your system is trying to truncate that inode to
> 4627829 bytes. Can you run with the same thing in scan.c? It looks like
> this...
> 
> 	if (ino == 9943)
> 	printk(KERN_DEBUG "Node at %u is ino #%x, version %d. Range 0x%x-0x%x, isize %u\n",
> 	       ofs,
> 		  je32_to_cpu(ri->ino), je32_to_cpu(ri->version),
> 		  je32_to_cpu(ri->offset),
> 	       je32_to_cpu(ri->offset)+je32_to_cpu(ri->dsize),
> 	       je32_to_cpu(ri->isize));
> 

I get this:
Node at 20830232 is ino #26d7, version 2. Range 0x0-0x1000, isize 4096
Node at 20832264 is ino #26d7, version 3. Range 0x1000-0x2000, isize 8192
Node at 20833744 is ino #26d7, version 4. Range 0x2000-0x3000, isize 12288
Node at 20834972 is ino #26d7, version 5. Range 0x3000-0x4000, isize 16384
Node at 46094176 is ino #26d7, version 1172. Range 0x0-0x0, isize 16384

And yes I did use SUMMARY and tured it off some tme ago.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 11:45             ` Joakim Tjernlund
  2007-04-26 11:57               ` David Woodhouse
@ 2007-04-26 12:31               ` David Woodhouse
  2007-04-26 12:43                 ` Joakim Tjernlund
  1 sibling, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 12:31 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 13:45 +0200, Joakim Tjernlund wrote:
> > JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino
> #9943 to 4627829 bytes failed because it only had 16384 bytes to start
> with!

I don't understand this. Looking at the image you gave me, and
instrumenting jffs2_scan_inode_node() to print every node it finds
belonging to that inode, I see this:

Node at 20830232 is ino #9943, version 2. Range 0x0-0x1000, isize 4096
Node at 20832264 is ino #9943, version 3. Range 0x1000-0x2000, isize 8192
Node at 20833744 is ino #9943, version 4. Range 0x2000-0x3000, isize 12288
Node at 20834972 is ino #9943, version 5. Range 0x3000-0x4000, isize 16384
Node at 46094176 is ino #9943, version 1172. Range 0x0-0x0, isize 16384

So I have no clue why your system is trying to truncate that inode to
4627829 bytes. Can you run with the same thing in scan.c? It looks like
this...

	if (ino == 9943)
	printk(KERN_DEBUG "Node at %u is ino #%x, version %d. Range 0x%x-0x%x, isize %u\n",
	       ofs,
		  je32_to_cpu(ri->ino), je32_to_cpu(ri->version),
		  je32_to_cpu(ri->offset),
	       je32_to_cpu(ri->offset)+je32_to_cpu(ri->dsize),
	       je32_to_cpu(ri->isize));

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 11:45             ` Joakim Tjernlund
@ 2007-04-26 11:57               ` David Woodhouse
  2007-04-26 12:31               ` David Woodhouse
  1 sibling, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 11:57 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 13:45 +0200, Joakim Tjernlund wrote:
> searched for the inode's reported above and found:
> ~/jffs2 # find . -inum 9947 -print
> ./tftpboot/download/osappl02a-r8a-070207_1.powerpc.pkg
> ~/jffs2 # find . -inum 9049 -print
> ./tftpboot/download/cuboot01a-r8a-070129_1.ppc_8xx.rpm
> ~/jffs2 # find . -inum 9943 -print
> ./tftpboot/download/tuappl01a-r8a-070207_1.ppc_8xx.rpm
> ~/jffs2 # find . -inum 10580 -print
> ./tftpboot/download/cuappl01a-r8a-070213_10.ppc_8xx.rpm
> ~/jffs2 # find . -inum 11246 -print
> ./tftpboot/download/cuappl02a-r8a-070221_10.powerpc.pkg
> ~/jffs2 # find . -inum 6537 -print 
> ./tftpboot/download/osappl01a-r3b-1.ppc_8xx.rpm
> 
> These all old rpms of ours apps that has been downloaded and installed
> earlier.

Yeah, I'm seeing those.

I'm confused by the node CRC failure. Those nodes shouldn't even end up
in the lists because they should be discarded at scan time -- the reason
the CRC fails is because the 'JFFS2_NODE_ACCURATE' bit has been cleared.
They're obsolete nodes.

Oh. You're not using SUMMARY any more, are you? So you've been writing
to this with a mixture of jffs2_can_mark_obsolete() and 
!jffs2_can_mark_obsolete().  Which means that when I mount with summary
enabled, I expect those nodes to be valid. And you've since obsoleted
them with summaries disabled, and they're not.

I wonder if that's the root cause of the problem you were seeing too?
I'll build without summary and see...

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 17:39           ` Joakim Tjernlund
  2007-04-26  6:32             ` David Woodhouse
@ 2007-04-26 11:45             ` Joakim Tjernlund
  2007-04-26 11:57               ` David Woodhouse
  2007-04-26 12:31               ` David Woodhouse
  1 sibling, 2 replies; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 11:45 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Wed, 2007-04-25 at 19:39 +0200, Joakim Tjernlund wrote:
> On Wed, 2007-04-25 at 17:07 +0100, David Woodhouse wrote:
> > I've just committed something to the git tree which should fix the
> > symptoms --and I think it should apply OK to 2.6.20 which predates other
> > improvements in read_inode(). 
> > 
> > We should still work out the _cause_ though. Once you've confirmed that
> > this patch avoids the crash, you'll know the inode number (because it'll
> > bitch about it). If you don't want to wait for a full
> > CONFIG_JFFS2_FS_DEBUG=1 run, can you enable all the output in
> > jffs2_get_inode_nodes() _just_ for that inode, and show me what nodes
> > are found?
> 
> Seems to work, at least I can boot now but I get a few compliants:
> Starting services: te_server JFFS2 warning: (144) jffs2_do_read_inode_internal: 
> Truncating ino #6537 to 160109 bytes failed because it only had 138184 bytes to 
> start with!
> JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #9049 to 70747
>  bytes failed because it only had 65536 bytes to start with!
> JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #9943 to 46278
> 29 bytes failed because it only had 16384 bytes to start with!
> JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #9947 to 10240
>  bytes failed because it only had 0 bytes to start with!
> JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #10580 to 4800
> 712 bytes failed because it only had 4788600 bytes to start with!
> JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #11246 to 4925
> 440 bytes failed because it only had 1717152 bytes to start with!
> te_log te ntpd alib_psupd icn_server cp_cm ne_rc -initshm cp_cop cp_pbh eq_bh_su
> pv eq_bh voa_main alib_swusd mgmt_named mgmt_invd eq_rh mgmt_pmd mgmt_backupd mg
> mt_backup_tftpd ne_db_mgr mgmt_alarmd tosv_server tosv_supv Remove /var/net-snmp
> /snmpd.conf snmpd -f -c /etc/lumentis/cuappl02a-r8a-070425_2/snmpd.conf -l /tmp/
> snmpd.log  zebra ospfd sys_sysconfd sys_ipconfd aux_mgrd eq_tm_mgr eq_tgtm_mgr e
> q_tgltm_mgr eq_mxm_mgr eq_gbem_mgr eq_fcm_mgr eq_qm_mgr eq_oam_mgr eq_xm_mgr eq_
> fxm_mgr eq_ram_mgr eq_cpm_mgr eq_gxpm_mgr eq_qxpm_mgr eq_dtgm_mgr eq_xgbm_mgr eq
> _voam_mgr eq_pbm_mgr eq_bm_mgr eq_topm_mgr eq_spfm_mgr eq_pm_mgr eq_ppp_mgr ne_s
> wumd alib_memsupd ne_rc ne_rc_load ne_rc_supv pppd httpd JFFS2 warning: (316) jf
> fs2_do_read_inode_internal: Truncating ino #10580 to 4800712 bytes failed becaus
> e it only had 4788600 bytes to start with!
> rsvpd gmpls dropbear JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncati
> ng ino #11246 to 4925440 bytes failed because it only had 1717152 bytes to start
>  with!
> JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #9049 to 70747
>  bytes failed because it only had 65536 bytes to start with!
> Key is a DSS key
> Wrote key to '/opt/appl/cuappl02a-r8a-070425_2/etc/dropbear_dss_host_key'
> JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #6537 to 16010
> 9 bytes failed because it only had 138184 bytes to start with!
> JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #9947 to 10240
>  bytes failed because it only had 0 bytes to start with!
> JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #9943 to 46278
> 29 bytes failed because it only had 16384 bytes to start with!
> Key is a RSA key
> Wrote key to '/opt/appl/cuappl02a-r8a-070425_2/etc/dropbear_rsa_host_key'
> GC .
> Done.

searched for the inode's reported above and found:
~/jffs2 # find . -inum 9947 -print
./tftpboot/download/osappl02a-r8a-070207_1.powerpc.pkg
~/jffs2 # find . -inum 9049 -print
./tftpboot/download/cuboot01a-r8a-070129_1.ppc_8xx.rpm
~/jffs2 # find . -inum 9943 -print
./tftpboot/download/tuappl01a-r8a-070207_1.ppc_8xx.rpm
~/jffs2 # find . -inum 10580 -print
./tftpboot/download/cuappl01a-r8a-070213_10.ppc_8xx.rpm
~/jffs2 # find . -inum 11246 -print
./tftpboot/download/cuappl02a-r8a-070221_10.powerpc.pkg
~/jffs2 # find . -inum 6537 -print 
./tftpboot/download/osappl01a-r3b-1.ppc_8xx.rpm

These all old rpms of ours apps that has been downloaded and installed
earlier. 

 Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 11:35                         ` Joakim Tjernlund
@ 2007-04-26 11:41                           ` David Woodhouse
  2007-04-26 12:49                             ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 11:41 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 13:35 +0200, Joakim Tjernlund wrote:
> 
> ehh, me think I forgot to look in the syslog, I only checked the console
> and these does not end up there, blushing :(
> 
> I think the CRC are from earlier poweroffs(just pulling the board when
> app is running)

No, they're something weirder. Node that they're _node_ header CRC
failures. Although actually my tests seem to show that _isn't_ a failed
CRC. It should be good. 

Did you see those errors too, when you did look in the syslog?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 11:27                       ` David Woodhouse
@ 2007-04-26 11:35                         ` Joakim Tjernlund
  2007-04-26 11:41                           ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 11:35 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 12:27 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 13:26 +0200, Joakim Tjernlund wrote:
> > Actually I forgot these:
> > [JFFS2] Use yield() between GC passes in background thread.
> > [JFFS2] Check for all-zero node headers
> > 
> > Should be harmless though.
> 
> Should be.
> 
> I'm seeing a whole bunch of CRC failures...
> 
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3fd28d4. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f99d40. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f4e7b4. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f25bcc. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x38c9fd0. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f41ad4. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f43598. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f43554. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f432a4. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f45984. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x64368ec. {1985,c002,00000044,a4ef223e}
> JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x1710fd0. {1985,c002,00000044,a4ef223e}

ehh, me think I forgot to look in the syslog, I only checked the console
and these does not end up there, blushing :(

I think the CRC are from earlier poweroffs(just pulling the board when
app is running)

 Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26 11:26                     ` Joakim Tjernlund
@ 2007-04-26 11:27                       ` David Woodhouse
  2007-04-26 11:35                         ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26 11:27 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 13:26 +0200, Joakim Tjernlund wrote:
> Actually I forgot these:
> [JFFS2] Use yield() between GC passes in background thread.
> [JFFS2] Check for all-zero node headers
> 
> Should be harmless though.

Should be.

I'm seeing a whole bunch of CRC failures...

JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3fd28d4. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f99d40. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f4e7b4. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f25bcc. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x38c9fd0. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f41ad4. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f43598. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f43554. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f432a4. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x3f45984. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x64368ec. {1985,c002,00000044,a4ef223e}
JFFS2 notice: (27730) jffs2_get_inode_nodes: Node header CRC failed at 0x1710fd0. {1985,c002,00000044,a4ef223e}

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26  9:04                   ` Joakim Tjernlund
  2007-04-26  9:21                     ` David Woodhouse
@ 2007-04-26 11:26                     ` Joakim Tjernlund
  2007-04-26 11:27                       ` David Woodhouse
  1 sibling, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26 11:26 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 11:04 +0200, Joakim Tjernlund wrote:
> On Thu, 2007-04-26 at 09:29 +0100, David Woodhouse wrote:
> > On Thu, 2007-04-26 at 10:20 +0200, Joakim Tjernlund wrote:
> > > No this is a matter of installing a number of new execuables/libs in
> > > the FS and moving a few symlinks. Then a reboot.
> > > I suspect that the it is only one of these printout that actually
> > > makes the system crash, the other ones has problaby been there for a
> > > while. Does that make sense?
> > 
> > Yes. Only one of those was a complaint with "but the old size was 0",
> > which will have led to a NULL frag_last() since there were no frags.
> > 
> > > No, none acutally.
> > 
> > Hmmm.
> > 
> > > > 
> > > > This is with JFFS2 from 2.6.20, right? Not a bug in the read_inode code
> > > > I just committed a couple of days ago?
> > > 
> > > Plain 2.6.20 with my optimized scan you just commited.
> > 
> > By 'optimised scan' you just mean the fix to make it not crash, right?
> 
> No, sorry to be a bit unclear. I mean 
>   "[JFFS2] Speed up mount for directly-mapped NOR flash"
> and then the patch you did yesterday:
>   "[JFFS2] Handle inodes with only a single metadata node with non-zero isize"
> Had to apply than one by hand, but it was easy.

Actually I forgot these:
[JFFS2] Use yield() between GC passes in background thread.
[JFFS2] Check for all-zero node headers

Should be harmless though.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26  9:04                   ` Joakim Tjernlund
@ 2007-04-26  9:21                     ` David Woodhouse
  2007-04-26 11:26                     ` Joakim Tjernlund
  1 sibling, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2007-04-26  9:21 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 11:04 +0200, Joakim Tjernlund wrote:
> Since I can't reproduce how it happened we are stuck with the image we
> have, would a full CONFIG_JFFS2_FS_DEBUG=1 help now?

Might be enlightening. And/or a copy of the image if you can?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26  8:29                 ` David Woodhouse
@ 2007-04-26  9:04                   ` Joakim Tjernlund
  2007-04-26  9:21                     ` David Woodhouse
  2007-04-26 11:26                     ` Joakim Tjernlund
  0 siblings, 2 replies; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26  9:04 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 09:29 +0100, David Woodhouse wrote:
> On Thu, 2007-04-26 at 10:20 +0200, Joakim Tjernlund wrote:
> > No this is a matter of installing a number of new execuables/libs in
> > the FS and moving a few symlinks. Then a reboot.
> > I suspect that the it is only one of these printout that actually
> > makes the system crash, the other ones has problaby been there for a
> > while. Does that make sense?
> 
> Yes. Only one of those was a complaint with "but the old size was 0",
> which will have led to a NULL frag_last() since there were no frags.
> 
> > No, none acutally.
> 
> Hmmm.
> 
> > > 
> > > This is with JFFS2 from 2.6.20, right? Not a bug in the read_inode code
> > > I just committed a couple of days ago?
> > 
> > Plain 2.6.20 with my optimized scan you just commited.
> 
> By 'optimised scan' you just mean the fix to make it not crash, right?

No, sorry to be a bit unclear. I mean 
  "[JFFS2] Speed up mount for directly-mapped NOR flash"
and then the patch you did yesterday:
  "[JFFS2] Handle inodes with only a single metadata node with non-zero isize"
Had to apply than one by hand, but it was easy.

> 
> Not the _real_ optimisation I committed before that, which rewrites the
> entire read_inode() code path?
> 
> > > 
> > > > Wonder how the lab manged to get that many corrupted nodes?
> > > > One thing that is rather new in our system is that we trigger GC by
> > > > sending HUP to the GC thread from a script in user space at startup
> > > > and then every 24 hours.
> > > 
> > > That shouldn't make any difference. 
> > 
> > I know, but thats the only thing that I can think of that is somewhat
> > unique to our system.
> 
> Hm. I'm trying to think how it could trigger the problem -- even if we
> didn't block SIGHUP when we were in the GC routines, I still don't see
> how we could lose old nodes of a file without writing out new ones.
> 
> You can reproduce this at will by mounting a _clean_ filesystem, then
> "installing a number of new executables and moving a few symlinks" and
> then rebooting?

No, this SW upgrade procedure has been performed many times and this
is the first time we had this problem. 

> 
> Can you do that all with CONFIG_JFFS2_FS_DEBUG=1 and log it? I'll then
> see where _all_ the nodes for these problematics files are on the first
> boot, I'll see what happens when you make changes, and I'll see what we
> find on the second boot.
> 
> If you kill the GC thread (or hack the kernel not to start it) that'll
> make the dump a bit less noisy.

Since I can't reproduce how it happened we are stuck with the image we
have, would a full CONFIG_JFFS2_FS_DEBUG=1 help now?

 Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26  8:20               ` Joakim Tjernlund
@ 2007-04-26  8:29                 ` David Woodhouse
  2007-04-26  9:04                   ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26  8:29 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Thu, 2007-04-26 at 10:20 +0200, Joakim Tjernlund wrote:
> No this is a matter of installing a number of new execuables/libs in
> the FS and moving a few symlinks. Then a reboot.
> I suspect that the it is only one of these printout that actually
> makes the system crash, the other ones has problaby been there for a
> while. Does that make sense?

Yes. Only one of those was a complaint with "but the old size was 0",
which will have led to a NULL frag_last() since there were no frags.

> No, none acutally.

Hmmm.

> > 
> > This is with JFFS2 from 2.6.20, right? Not a bug in the read_inode code
> > I just committed a couple of days ago?
> 
> Plain 2.6.20 with my optimized scan you just commited.

By 'optimised scan' you just mean the fix to make it not crash, right?

Not the _real_ optimisation I committed before that, which rewrites the
entire read_inode() code path?

> > 
> > > Wonder how the lab manged to get that many corrupted nodes?
> > > One thing that is rather new in our system is that we trigger GC by
> > > sending HUP to the GC thread from a script in user space at startup
> > > and then every 24 hours.
> > 
> > That shouldn't make any difference. 
> 
> I know, but thats the only thing that I can think of that is somewhat
> unique to our system.

Hm. I'm trying to think how it could trigger the problem -- even if we
didn't block SIGHUP when we were in the GC routines, I still don't see
how we could lose old nodes of a file without writing out new ones.

You can reproduce this at will by mounting a _clean_ filesystem, then
"installing a number of new executables and moving a few symlinks" and
then rebooting?

Can you do that all with CONFIG_JFFS2_FS_DEBUG=1 and log it? I'll then
see where _all_ the nodes for these problematics files are on the first
boot, I'll see what happens when you make changes, and I'll see what we
find on the second boot.

If you kill the GC thread (or hack the kernel not to start it) that'll
make the dump a bit less noisy.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-26  6:32             ` David Woodhouse
@ 2007-04-26  8:20               ` Joakim Tjernlund
  2007-04-26  8:29                 ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-26  8:20 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Thu, 2007-04-26 at 07:32 +0100, David Woodhouse wrote:
> On Wed, 2007-04-25 at 19:39 +0200, Joakim Tjernlund wrote:
> > On Wed, 2007-04-25 at 17:07 +0100, David Woodhouse wrote:
> > > I've just committed something to the git tree which should fix the
> > > symptoms --and I think it should apply OK to 2.6.20 which predates other
> > > improvements in read_inode(). 
> > > 
> > > We should still work out the _cause_ though. Once you've confirmed that
> > > this patch avoids the crash, you'll know the inode number (because it'll
> > > bitch about it). If you don't want to wait for a full
> > > CONFIG_JFFS2_FS_DEBUG=1 run, can you enable all the output in
> > > jffs2_get_inode_nodes() _just_ for that inode, and show me what nodes
> > > are found?
> > 
> > Seems to work, at least I can boot now but I get a few compliants:
> 
> Hm, so you've lost a _lot_ of data. I don't suppose there's any chance
> they've been putting new images at the start of the flash without
> actually _erasing_ the remainder of the space?

No this is a matter of installing a number of new execuables/libs in
the FS and moving a few symlinks. Then a reboot.
I suspect that the it is only one of these printout that actually
makes the system crash, the other ones has problaby been there for a
while. Does that make sense?

> 
> Are you getting lots of messages about bad CRCs?

No, none acutally.

> 
> This is with JFFS2 from 2.6.20, right? Not a bug in the read_inode code
> I just committed a couple of days ago?

Plain 2.6.20 with my optimized scan you just commited.

> 
> > Wonder how the lab manged to get that many corrupted nodes?
> > One thing that is rather new in our system is that we trigger GC by
> > sending HUP to the GC thread from a script in user space at startup
> > and then every 24 hours.
> 
> That shouldn't make any difference. 

I know, but thats the only thing that I can think of that is somewhat
unique to our system.

 Jocke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 17:39           ` Joakim Tjernlund
@ 2007-04-26  6:32             ` David Woodhouse
  2007-04-26  8:20               ` Joakim Tjernlund
  2007-04-26 11:45             ` Joakim Tjernlund
  1 sibling, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-26  6:32 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Wed, 2007-04-25 at 19:39 +0200, Joakim Tjernlund wrote:
> On Wed, 2007-04-25 at 17:07 +0100, David Woodhouse wrote:
> > I've just committed something to the git tree which should fix the
> > symptoms --and I think it should apply OK to 2.6.20 which predates other
> > improvements in read_inode(). 
> > 
> > We should still work out the _cause_ though. Once you've confirmed that
> > this patch avoids the crash, you'll know the inode number (because it'll
> > bitch about it). If you don't want to wait for a full
> > CONFIG_JFFS2_FS_DEBUG=1 run, can you enable all the output in
> > jffs2_get_inode_nodes() _just_ for that inode, and show me what nodes
> > are found?
> 
> Seems to work, at least I can boot now but I get a few compliants:

Hm, so you've lost a _lot_ of data. I don't suppose there's any chance
they've been putting new images at the start of the flash without
actually _erasing_ the remainder of the space?

Are you getting lots of messages about bad CRCs?

This is with JFFS2 from 2.6.20, right? Not a bug in the read_inode code
I just committed a couple of days ago?

> Wonder how the lab manged to get that many corrupted nodes?
> One thing that is rather new in our system is that we trigger GC by
> sending HUP to the GC thread from a script in user space at startup
> and then every 24 hours.

That shouldn't make any difference. 

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 16:07         ` David Woodhouse
@ 2007-04-25 17:39           ` Joakim Tjernlund
  2007-04-26  6:32             ` David Woodhouse
  2007-04-26 11:45             ` Joakim Tjernlund
  0 siblings, 2 replies; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-25 17:39 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Wed, 2007-04-25 at 17:07 +0100, David Woodhouse wrote:
> I've just committed something to the git tree which should fix the
> symptoms --and I think it should apply OK to 2.6.20 which predates other
> improvements in read_inode(). 
> 
> We should still work out the _cause_ though. Once you've confirmed that
> this patch avoids the crash, you'll know the inode number (because it'll
> bitch about it). If you don't want to wait for a full
> CONFIG_JFFS2_FS_DEBUG=1 run, can you enable all the output in
> jffs2_get_inode_nodes() _just_ for that inode, and show me what nodes
> are found?

Seems to work, at least I can boot now but I get a few compliants:
Starting services: te_server JFFS2 warning: (144) jffs2_do_read_inode_internal: 
Truncating ino #6537 to 160109 bytes failed because it only had 138184 bytes to 
start with!
JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #9049 to 70747
 bytes failed because it only had 65536 bytes to start with!
JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #9943 to 46278
29 bytes failed because it only had 16384 bytes to start with!
JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #9947 to 10240
 bytes failed because it only had 0 bytes to start with!
JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #10580 to 4800
712 bytes failed because it only had 4788600 bytes to start with!
JFFS2 warning: (144) jffs2_do_read_inode_internal: Truncating ino #11246 to 4925
440 bytes failed because it only had 1717152 bytes to start with!
te_log te ntpd alib_psupd icn_server cp_cm ne_rc -initshm cp_cop cp_pbh eq_bh_su
pv eq_bh voa_main alib_swusd mgmt_named mgmt_invd eq_rh mgmt_pmd mgmt_backupd mg
mt_backup_tftpd ne_db_mgr mgmt_alarmd tosv_server tosv_supv Remove /var/net-snmp
/snmpd.conf snmpd -f -c /etc/lumentis/cuappl02a-r8a-070425_2/snmpd.conf -l /tmp/
snmpd.log  zebra ospfd sys_sysconfd sys_ipconfd aux_mgrd eq_tm_mgr eq_tgtm_mgr e
q_tgltm_mgr eq_mxm_mgr eq_gbem_mgr eq_fcm_mgr eq_qm_mgr eq_oam_mgr eq_xm_mgr eq_
fxm_mgr eq_ram_mgr eq_cpm_mgr eq_gxpm_mgr eq_qxpm_mgr eq_dtgm_mgr eq_xgbm_mgr eq
_voam_mgr eq_pbm_mgr eq_bm_mgr eq_topm_mgr eq_spfm_mgr eq_pm_mgr eq_ppp_mgr ne_s
wumd alib_memsupd ne_rc ne_rc_load ne_rc_supv pppd httpd JFFS2 warning: (316) jf
fs2_do_read_inode_internal: Truncating ino #10580 to 4800712 bytes failed becaus
e it only had 4788600 bytes to start with!
rsvpd gmpls dropbear JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncati
ng ino #11246 to 4925440 bytes failed because it only had 1717152 bytes to start
 with!
JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #9049 to 70747
 bytes failed because it only had 65536 bytes to start with!
Key is a DSS key
Wrote key to '/opt/appl/cuappl02a-r8a-070425_2/etc/dropbear_dss_host_key'
JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #6537 to 16010
9 bytes failed because it only had 138184 bytes to start with!
JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #9947 to 10240
 bytes failed because it only had 0 bytes to start with!
JFFS2 warning: (316) jffs2_do_read_inode_internal: Truncating ino #9943 to 46278
29 bytes failed because it only had 16384 bytes to start with!
Key is a RSA key
Wrote key to '/opt/appl/cuappl02a-r8a-070425_2/etc/dropbear_rsa_host_key'
GC .
Done.

Wonder how the lab manged to get that many corrupted nodes?
One thing that is rather new in our system is that we trigger GC by
sending HUP to the GC thread from a script in user space at startup and
then every 24 hours.

> 
> It sounds like some nodes have gone missing, leaving you with _only_ a
> metadata node -- with no data of its own, just other inode information
> including a non-zero i_size.
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 15:56       ` Joakim Tjernlund
@ 2007-04-25 16:07         ` David Woodhouse
  2007-04-25 17:39           ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-25 16:07 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

I've just committed something to the git tree which should fix the
symptoms --and I think it should apply OK to 2.6.20 which predates other
improvements in read_inode(). 

We should still work out the _cause_ though. Once you've confirmed that
this patch avoids the crash, you'll know the inode number (because it'll
bitch about it). If you don't want to wait for a full
CONFIG_JFFS2_FS_DEBUG=1 run, can you enable all the output in
jffs2_get_inode_nodes() _just_ for that inode, and show me what nodes
are found?

It sounds like some nodes have gone missing, leaving you with _only_ a
metadata node -- with no data of its own, just other inode information
including a non-zero i_size.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 15:44     ` David Woodhouse
@ 2007-04-25 15:56       ` Joakim Tjernlund
  2007-04-25 16:07         ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-25 15:56 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Wed, 2007-04-25 at 16:44 +0100, David Woodhouse wrote:
> On Wed, 2007-04-25 at 17:40 +0200, Joakim Tjernlund wrote:
> > That will OOPS in __jffs2_dbg_dump_fragtree: 
> 
> Heh. Yeah, the fragtree is, strangely, empty. What length is it being
> truncated to? What nodes exist for this inode?
> 
size into: jffs2_truncate_fragtree (c=0xcff2b800, list=0xc0237a10,
size=0x2800) 

dunno how to list to inode you are asking for.

On Wed, 2007-04-25 at 16:46 +0100, David Woodhouse wrote: 
> On Wed, 2007-04-25 at 17:40 +0200, Joakim Tjernlund wrote:
> > +       if (!list->rb_node)
> > +          __jffs2_dbg_dump_fragtree(frag); 
> 
> That one wouldn't have oopsed if you had passed it 'f' as you were
> supposed to, instead of passing it NULL.

ehh, what 'f'? there is no 'f' in jffs2_truncate_fragtree()

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 15:40   ` Joakim Tjernlund
  2007-04-25 15:44     ` David Woodhouse
@ 2007-04-25 15:46     ` David Woodhouse
  1 sibling, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2007-04-25 15:46 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Wed, 2007-04-25 at 17:40 +0200, Joakim Tjernlund wrote:
> +       if (!list->rb_node)
> +          __jffs2_dbg_dump_fragtree(frag); 

That one wouldn't have oopsed if you had passed it 'f' as you were
supposed to, instead of passing it NULL.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 15:40   ` Joakim Tjernlund
@ 2007-04-25 15:44     ` David Woodhouse
  2007-04-25 15:56       ` Joakim Tjernlund
  2007-04-25 15:46     ` David Woodhouse
  1 sibling, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-25 15:44 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Wed, 2007-04-25 at 17:40 +0200, Joakim Tjernlund wrote:
> That will OOPS in __jffs2_dbg_dump_fragtree: 

Heh. Yeah, the fragtree is, strangely, empty. What length is it being
truncated to? What nodes exist for this inode?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 15:23 ` David Woodhouse
@ 2007-04-25 15:40   ` Joakim Tjernlund
  2007-04-25 15:44     ` David Woodhouse
  2007-04-25 15:46     ` David Woodhouse
  0 siblings, 2 replies; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-25 15:40 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Wed, 2007-04-25 at 16:23 +0100, David Woodhouse wrote:
> On Wed, 2007-04-25 at 17:09 +0200, Joakim Tjernlund wrote:
> > --- a/fs/jffs2/nodelist.c
> > +++ b/fs/jffs2/nodelist.c
> > @@ -83,6 +83,10 @@ void jffs2_truncate_fragtree(struct jffs2_sb_info *c, struct 
> >          * REF_PRISTINE irrespective of its size.
> >          */
> >         frag = frag_last(list);
> > +       if (!frag) {
> > +          printk(KERN_ERR "frag==NULL\n");
> > +          BUG();
> > +       }
> >         if (frag->node && (frag->ofs & (PAGE_CACHE_SIZE - 1)) == 0) {
> >                 dbg_fragtree2("marking the last fragment 0x%08x-0x%08x REF_PRIST
> >                         frag->ofs, frag->ofs + frag->size);
> > 
> > and set a BP on the BUG().
> > Here is some data when stopping the BP:
> > (gdb) bt
> > #0  jffs2_truncate_fragtree (c=0xcff56800, list=0xc023d610, size=0x2800)
> >     at fs/jffs2/nodelist.c:88
> > #1  0xc00d23dc in jffs2_do_read_inode_internal (c=0xcff56800, f=0xc023d600, 
> >     latest_node=0xc0635ea0) at fs/jffs2/readinode.c:813
> > #2  0xc00d25cc in jffs2_do_crccheck_inode (c=0xcff56800, ic=0xcfe6d398)
> >     at fs/jffs2/readinode.c:971
> 
> So... you're truncating the inode in question to a non-zero length, but
> it didn't have any data nodes beforehand. How on earth did that happen?

Well, they said they updated the app SW and rebooted ...

> 
> Can you dump the fragtree before the truncation, and show the size it's
> being truncated to?
You mean something like this?
--- a/fs/jffs2/nodelist.c
+++ b/fs/jffs2/nodelist.c
@@ -59,7 +59,8 @@ void jffs2_truncate_fragtree(struct jffs2_sb_info *c,
struct r
        struct jffs2_node_frag *frag = jffs2_lookup_node_frag(list,
size);
 
        dbg_fragtree("truncating fragtree to 0x%08x bytes\n", size);
-
+       if (!list->rb_node)
+          __jffs2_dbg_dump_fragtree(frag);
        /* We know frag->ofs <= size. That's what lookup does for us */
        if (frag && frag->ofs != size) {
                if (frag->ofs+frag->size > size) {
@@ -83,6 +84,10 @@ void jffs2_truncate_fragtree(struct jffs2_sb_info *c,
struct 
         * REF_PRISTINE irrespective of its size.
         */
        frag = frag_last(list);
+       if (!frag) {
+          printk(KERN_ERR "frag==NULL\n");
+          BUG();
+       }

That will OOPS in __jffs2_dbg_dump_fragtree:
 Unable to handle kernel paging request for data at 
address 0x00000000
Faulting instruction address: 0xc00d960c
Oops: Kernel access of bad area, sig: 11 [#1]

NIP: C00D960C LR: C00CDC64 CTR: 00000000
REGS: c0635d30 TRAP: 0300   Not tainted  (2.6.20)
MSR: 00009032 <EE,ME,IR,DR>  CR: 22022042  XER: 20000000
DAR: 00000000, DSISR: 20000000
TASK = c04bd810[144] 'jffs2_gcd_mtd6' THREAD: c0634000
GPR00: C00CDC64 C0635DE0 C04BD810 00000000 00000000 00000000 00000000
00000000 
GPR08: C01DC618 00008000 D443E188 00000000 22022084 FCE9FEF7 C0635E38
C0635E2C 
GPR16: C0635E38 C01E0000 00000000 00000000 C0635E24 C0635E20 C0237A0C
CF8023D8 
GPR24: 82022022 C0635EA0 C0237A00 CFF2B800 C0237A10 00000000 00000000
00000000 
NIP [C00D960C] __jffs2_dbg_dump_fragtree+0x14/0x74
LR [C00CDC64] jffs2_truncate_fragtree+0xd0/0x128
Call Trace:
[C0635DE0] [C0635EA0] 0xc0635ea0 (unreliable)
[C0635DF0] [C00CDC64] jffs2_truncate_fragtree+0xd0/0x128
[C0635E10] [C00D1870] jffs2_do_read_inode_internal+0xee4/0x10a4
[C0635E90] [C00D1A88] jffs2_do_crccheck_inode+0x58/0xb4
[C0635F00] [C00D5668] jffs2_garbage_collect_pass+0x174/0x6cc
[C0635F50] [C00D6E90] jffs2_garbage_collect_thread+0xa0/0x11c
[C0635FF0] [C000FF70] kernel_thread+0x44/0x60
Instruction dump:
3c60c021 3ca0c01e 38a59ef8 3863e47c 4bf42b85 0fe00000 48000000 7c0802a6 
9421fff0 93e1000c 7c7f1b78 90010014 <7c001828> 3000ffff 7c00192d
40a2fff4
> 
> Sounds like the simple fix is 'if (!frag) return;', but I'd like to know
> what's actually happening.
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: OOPS at mount
  2007-04-25 15:09 OOPS " Joakim Tjernlund
@ 2007-04-25 15:23 ` David Woodhouse
  2007-04-25 15:40   ` Joakim Tjernlund
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2007-04-25 15:23 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: linux-mtd

On Wed, 2007-04-25 at 17:09 +0200, Joakim Tjernlund wrote:
> --- a/fs/jffs2/nodelist.c
> +++ b/fs/jffs2/nodelist.c
> @@ -83,6 +83,10 @@ void jffs2_truncate_fragtree(struct jffs2_sb_info *c, struct 
>          * REF_PRISTINE irrespective of its size.
>          */
>         frag = frag_last(list);
> +       if (!frag) {
> +          printk(KERN_ERR "frag==NULL\n");
> +          BUG();
> +       }
>         if (frag->node && (frag->ofs & (PAGE_CACHE_SIZE - 1)) == 0) {
>                 dbg_fragtree2("marking the last fragment 0x%08x-0x%08x REF_PRIST
>                         frag->ofs, frag->ofs + frag->size);
> 
> and set a BP on the BUG().
> Here is some data when stopping the BP:
> (gdb) bt
> #0  jffs2_truncate_fragtree (c=0xcff56800, list=0xc023d610, size=0x2800)
>     at fs/jffs2/nodelist.c:88
> #1  0xc00d23dc in jffs2_do_read_inode_internal (c=0xcff56800, f=0xc023d600, 
>     latest_node=0xc0635ea0) at fs/jffs2/readinode.c:813
> #2  0xc00d25cc in jffs2_do_crccheck_inode (c=0xcff56800, ic=0xcfe6d398)
>     at fs/jffs2/readinode.c:971

So... you're truncating the inode in question to a non-zero length, but
it didn't have any data nodes beforehand. How on earth did that happen?

Can you dump the fragtree before the truncation, and show the size it's
being truncated to?

Sounds like the simple fix is 'if (!frag) return;', but I'd like to know
what's actually happening.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* OOPS at mount
@ 2007-04-25 15:09 Joakim Tjernlund
  2007-04-25 15:23 ` David Woodhouse
  0 siblings, 1 reply; 47+ messages in thread
From: Joakim Tjernlund @ 2007-04-25 15:09 UTC (permalink / raw)
  To: linux-mtd

Got a board form the lab here that OOPS during boot(Linux 2.6.20, powerpc):
VFS: Mounted root (jffs2 filesystem).
Freeing unused kernel memory: 136k init
Starting Lumentis Main Script: /opt/appl/next is a link
Starting from /opt/appl/cuappl02a-r8a-070425_2/bin
Wed Apr 25 15:25:57 CEST 2007
Starting services: te_server
Unable to handle kernel paging request for data at address 0x0000000c
Faulting instruction address: 0xc00cdc54
Oops: Kernel access of bad area, sig: 11 [#1]

NIP: C00CDC54 LR: C00CDC54 CTR: 00000000
REGS: c0635d40 TRAP: 0300   Not tainted  (2.6.20)
MSR: 00009032 <EE,ME,IR,DR>  CR: 22022042  XER: 20000000
DAR: 0000000C, DSISR: 20000000
TASK = c04bd810[144] 'jffs2_gcd_mtd6' THREAD: c0634000
GPR00: 00000000 C0635DF0 C04BD810 00000000 00002800 00002800 00000000 00000000 
GPR08: C01DB5DC 00008000 2A7D97E9 00000000 22022084 FCE9FEF7 C0635E38 C0635E2C 
GPR16: C0635E38 C01E0000 00000000 00000000 C0635E24 C0635E20 C0235A0C CF456A28 
GPR24: 82022022 C0635EA0 C0235A00 CFF6A800 C0235A10 00000000 00002800 00000000 
NIP [C00CDC54] jffs2_truncate_fragtree+0xc0/0xf0
LR [C00CDC54] jffs2_truncate_fragtree+0xc0/0xf0
Call Trace:
[C0635E10] [C00D1808] jffs2_do_read_inode_internal+0xec4/0x1084
[C0635E90] [C00D1A20] jffs2_do_crccheck_inode+0x58/0xb4
[C0635F00] [C00D55E8] jffs2_garbage_collect_pass+0x174/0x6c4
[C0635F50] [C00D6E00] jffs2_garbage_collect_thread+0xa0/0x11c
[C0635FF0] [C000FF70] kernel_thread+0x44/0x60
Instruction dump:
2f9d0000 7fbfeb78 409effc8 2f9e0000 409e0018 80010024 bb61000c 38210020 
7c0803a6 4e800020 7f83e378 48027009 <8123000c> 2f890000 41beffdc 80030014 

Did some debugging: CONFIG_JFFS2_FS_DEBUG=1 didn't print
anything useful.

Added a litte intrumentation like so:
diff --git a/fs/jffs2/nodelist.c b/fs/jffs2/nodelist.c
index 5a6b4d6..33f526a 100644
--- a/fs/jffs2/nodelist.c
+++ b/fs/jffs2/nodelist.c
@@ -83,6 +83,10 @@ void jffs2_truncate_fragtree(struct jffs2_sb_info *c, struct 
         * REF_PRISTINE irrespective of its size.
         */
        frag = frag_last(list);
+       if (!frag) {
+          printk(KERN_ERR "frag==NULL\n");
+          BUG();
+       }
        if (frag->node && (frag->ofs & (PAGE_CACHE_SIZE - 1)) == 0) {
                dbg_fragtree2("marking the last fragment 0x%08x-0x%08x REF_PRIST
                        frag->ofs, frag->ofs + frag->size);

and set a BP on the BUG().
Here is some data when stopping the BP:
(gdb) bt
#0  jffs2_truncate_fragtree (c=0xcff56800, list=0xc023d610, size=0x2800)
    at fs/jffs2/nodelist.c:88
#1  0xc00d23dc in jffs2_do_read_inode_internal (c=0xcff56800, f=0xc023d600, 
    latest_node=0xc0635ea0) at fs/jffs2/readinode.c:813
#2  0xc00d25cc in jffs2_do_crccheck_inode (c=0xcff56800, ic=0xcfe6d398)
    at fs/jffs2/readinode.c:971
#3  0xc00d68a8 in jffs2_garbage_collect_pass (c=0xcff56800)
    at fs/jffs2/gc.c:208
#4  0xc00d8650 in jffs2_garbage_collect_thread (_c=0xe)
    at fs/jffs2/background.c:140
#5  0xc000ff70 in kernel_thread ()
Previous frame inner to this frame (corrupt stack?)
(gdb) print *list
$1 = {rb_node = 0x0}
(gdb) 

^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2013-06-03 12:13 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-30 11:17 oops at mount Papp Tamas
2013-05-30 12:32 ` Josef Bacik
2013-05-30 12:55   ` Stefan Behrens
2013-05-30 14:03     ` Chris Mason
2013-05-30 14:59       ` Stefan Behrens
2013-05-30 16:37         ` Chris Mason
2013-06-03 11:56     ` Papp Tamas
2013-06-03 12:13       ` Hugo Mills
2013-05-31 14:55   ` Papp Tamas
2013-05-30 20:08 ` Stefan Behrens
  -- strict thread matches above, loose matches on Subject: below --
2007-04-25 15:09 OOPS " Joakim Tjernlund
2007-04-25 15:23 ` David Woodhouse
2007-04-25 15:40   ` Joakim Tjernlund
2007-04-25 15:44     ` David Woodhouse
2007-04-25 15:56       ` Joakim Tjernlund
2007-04-25 16:07         ` David Woodhouse
2007-04-25 17:39           ` Joakim Tjernlund
2007-04-26  6:32             ` David Woodhouse
2007-04-26  8:20               ` Joakim Tjernlund
2007-04-26  8:29                 ` David Woodhouse
2007-04-26  9:04                   ` Joakim Tjernlund
2007-04-26  9:21                     ` David Woodhouse
2007-04-26 11:26                     ` Joakim Tjernlund
2007-04-26 11:27                       ` David Woodhouse
2007-04-26 11:35                         ` Joakim Tjernlund
2007-04-26 11:41                           ` David Woodhouse
2007-04-26 12:49                             ` Joakim Tjernlund
2007-04-26 12:49                               ` David Woodhouse
2007-04-26 11:45             ` Joakim Tjernlund
2007-04-26 11:57               ` David Woodhouse
2007-04-26 12:31               ` David Woodhouse
2007-04-26 12:43                 ` Joakim Tjernlund
2007-04-26 12:49                   ` David Woodhouse
2007-04-26 13:00                     ` Joakim Tjernlund
2007-04-26 13:08                       ` David Woodhouse
2007-04-26 13:13                         ` Joakim Tjernlund
2007-04-26 13:15                           ` David Woodhouse
2007-04-26 13:31                             ` David Woodhouse
2007-04-26 13:39                             ` Joakim Tjernlund
2007-04-26 13:44                               ` David Woodhouse
2007-04-26 13:52                                 ` Joakim Tjernlund
2007-04-26 13:56                                   ` David Woodhouse
2007-04-26 14:37                                     ` Joakim Tjernlund
2007-04-26 19:39                                       ` David Woodhouse
2007-04-26 20:21                                         ` Joakim Tjernlund
2007-04-27  9:48                                           ` David Woodhouse
2007-04-25 15:46     ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.