Response to a Write Error

* Response to a Write Error
@ 2010-04-13 23:38 Ed F.
       [not found] ` <t2kf2d3ba1f1004131638nb61e68cek1ecb43a86c316d35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ed F. @ 2010-04-13 23:38 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

For several years I have been using rsync with the --link-dest option
in scripts to make incremental backups using hard links.  I have been
testing NILFS2 snapshots as an alternative way to do equivalent
backups.  Some of my experiments have involved writing to a DVD-RAM. I
am reporting what happened when a DVD-RAM had non-recoverable write
errors in case that information is of interest to the developers.

My DVD-RAM was written on a 64-bit Ubuntu Karmic system with a 2.6.31
kernel.  The NILFS2 module came with that kernel; the NILFS2 utilities
were built from the head of NILFS2 git repository on April 8, 2010.
Here are the first syslog messages associated with the write errors:

Apr 11 20:46:28 x2 kernel: [496491.160053] usb 1-8: reset high speed
USB device using ehci_hcd and address 5
Apr 11 20:46:35 x2 kernel: [496497.963699] sr 7:0:0:0: [sr1] Unhandled
sense code
Apr 11 20:46:35 x2 kernel: [496497.963708] sr 7:0:0:0: [sr1] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 11 20:46:35 x2 kernel: [496497.963716] sr 7:0:0:0: [sr1] Sense Key
: Medium Error [current]
Apr 11 20:46:35 x2 kernel: [496497.963725] sr 7:0:0:0: [sr1] Add.
Sense: Incomplete block read
Apr 11 20:46:35 x2 kernel: [496497.963735] end_request: I/O error, dev
sr1, sector 8778160
Apr 11 20:46:54 x2 kernel: [496516.494583] sr 7:0:0:0: [sr1] Unhandled
sense code
Apr 11 20:46:54 x2 kernel: [496516.494592] sr 7:0:0:0: [sr1] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 11 20:46:54 x2 kernel: [496516.494600] sr 7:0:0:0: [sr1] Sense Key
: Medium Error [current]
Apr 11 20:46:54 x2 kernel: [496516.494608] sr 7:0:0:0: [sr1] Add.
Sense: No defect spare location available
Apr 11 20:46:54 x2 kernel: [496516.494621] end_request: I/O error, dev
sr1, sector 8781280
Apr 11 20:46:54 x2 kernel: [496516.502194] NILFS: IO error writing segment
Apr 11 20:46:54 x2 kernel: [496516.511162] NILFS warning (device sr1):
nilfs_clean_segments: segment construction failed. (err=-5)

When I detected the errors, I shut down the system and waited until
the following morning to investigate.  When I tried to examine the
files on the DVD-RAM to see which ones were corrupted, the following
error was generated:

Apr 12 10:50:42 x2 kernel: [507977.110489] ------------[ cut here ]------------
Apr 12 10:50:42 x2 kernel: [507977.110502] kernel BUG at
/build/buildd/linux-2.6.31/fs/nilfs2/btree.c:514!
Apr 12 10:50:42 x2 kernel: [507977.110509] invalid opcode: 0000 [#1] SMP
Apr 12 10:50:42 x2 kernel: [507977.110516] last sysfs file:
/sys/devices/pci0000:00/0000:00:02.1/usb1/1-1/idVendor
Apr 12 10:50:42 x2 kernel: [507977.110521] CPU 1
Apr 12 10:50:42 x2 kernel: [507977.110526] Modules linked in:
usb_storage udf crc_itu_t nilfs2 binfmt_misc snd_hda_codec_idt ppdev
snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss
snd_pcm iptable_filter snd_seq_dummy ip_tables x_tables snd_seq_oss
snd_seq_midi parport_pc nvidia(P) psmouse serio_raw snd_rawmidi
snd_seq_midi_event snd_seq snd_timer snd_seq_device snd amd64_edac_mod
i2c_nforce2 edac_core soundcore snd_page_alloc k8temp lp parport
floppy 8139too mii
Apr 12 10:50:42 x2 kernel: [507977.110589] Pid: 24330, comm: rsync
Tainted: P        W  2.6.31-20-generic #57-Ubuntu NFORCE6M-A
Apr 12 10:50:42 x2 kernel: [507977.110595] RIP:
0010:[<ffffffffa0b8d6a1>]  [<ffffffffa0b8d6a1>]
nilfs_btree_do_lookup+0x291/0x2a0 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110631] RSP: 0018:ffff88008900f748
EFLAGS: 00010293
Apr 12 10:50:42 x2 kernel: [507977.110635] RAX: 000000000000003d RBX:
ffff880086949c80 RCX: 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110641] RDX: 0000000000000000 RSI:
ffff88008900f6a8 RDI: ffff88002800cb20
Apr 12 10:50:42 x2 kernel: [507977.110646] RBP: ffff88008900f7b8 R08:
0000000000000001 R09: 0001ce00a2bfa0a0
Apr 12 10:50:42 x2 kernel: [507977.110651] R10: 00000000ffffffff R11:
0000000000000000 R12: ffff8800731ff000
Apr 12 10:50:42 x2 kernel: [507977.110656] R13: 0000000000000001 R14:
ffff8800ba8ace98 R15: 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110663] FS:  00007f82523ed6f0(0000)
GS:ffff88002803d000(0000) knlGS:0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110668] CS:  0010 DS: 0000 ES: 0000
CR0: 0000000080050033
Apr 12 10:50:42 x2 kernel: [507977.110673] CR2: 0000000001ab6ff0 CR3:
00000000196c1000 CR4: 00000000000006e0
Apr 12 10:50:42 x2 kernel: [507977.110679] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110684] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Apr 12 10:50:42 x2 kernel: [507977.110690] Process rsync (pid: 24330,
threadinfo ffff88008900e000, task ffff88010d82c410)
Apr 12 10:50:42 x2 kernel: [507977.110695] Stack:
Apr 12 10:50:42 x2 kernel: [507977.110697]  ffff880000000001
0000000100000001 ffff880086949c00 000000018900f830
Apr 12 10:50:42 x2 kernel: [507977.110705] <0> ffff88008900f810
0000000000000000 ffff88008900f7a8 00000000a0b8dd90
Apr 12 10:50:42 x2 kernel: [507977.110714] <0> 0000000000000001
ffff8800ba8ace98 ffff880086949c00 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110724] Call Trace:
Apr 12 10:50:42 x2 kernel: [507977.110746]  [<ffffffffa0b8ddc5>]
nilfs_btree_lookup_contig+0x65/0x330 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110760]  [<ffffffff8152cea0>] ?
_spin_lock_irq+0x10/0x20
Apr 12 10:50:42 x2 kernel: [507977.110778]  [<ffffffffa0b8b6a8>]
nilfs_bmap_lookup_contig+0x48/0x70 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110795]  [<ffffffffa0b83ff0>]
nilfs_get_block+0xb0/0x250 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110803]  [<ffffffff8152cf59>] ?
_spin_lock+0x9/0x10
Apr 12 10:50:42 x2 kernel: [507977.110814]  [<ffffffff8114f088>]
do_mpage_readpage+0x148/0x600
Apr 12 10:50:42 x2 kernel: [507977.110830]  [<ffffffffa0b83200>] ?
nilfs_readpage+0x0/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110837]  [<ffffffff8114f58a>]
mpage_readpage+0x4a/0x60
Apr 12 10:50:42 x2 kernel: [507977.110853]  [<ffffffffa0b83f40>] ?
nilfs_get_block+0x0/0x250 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110864]  [<ffffffff810f1b65>] ?
__inc_zone_page_state+0x25/0x30
Apr 12 10:50:42 x2 kernel: [507977.110872]  [<ffffffff810daa51>] ?
add_to_page_cache_locked+0xd1/0x130
Apr 12 10:50:42 x2 kernel: [507977.110888]  [<ffffffffa0b83200>] ?
nilfs_readpage+0x0/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110894]  [<ffffffff810daaf6>] ?
add_to_page_cache_lru+0x46/0x70
Apr 12 10:50:42 x2 kernel: [507977.110910]  [<ffffffffa0b83213>]
nilfs_readpage+0x13/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110917]  [<ffffffff810dab99>]
__read_cache_page+0x79/0xe0
Apr 12 10:50:42 x2 kernel: [507977.110932]  [<ffffffffa0b83200>] ?
nilfs_readpage+0x0/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110940]  [<ffffffff810dc071>]
read_cache_page_async+0x31/0xe0
Apr 12 10:50:42 x2 kernel: [507977.110946]  [<ffffffff810dc12d>]
read_cache_page+0xd/0x60
Apr 12 10:50:42 x2 kernel: [507977.110967]  [<ffffffffa0b9c786>] ?
nilfs_palloc_block_get_entry+0x26/0x60 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110983]  [<ffffffffa0b84a3f>]
nilfs_get_page+0x1f/0x80 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111000]  [<ffffffffa0b85388>]
nilfs_find_entry+0x78/0x220 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111017]  [<ffffffffa0b85544>]
nilfs_inode_by_name+0x14/0x30 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111034]  [<ffffffffa0b8761d>]
nilfs_lookup+0x3d/0x70 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111043]  [<ffffffff81128b0a>]
real_lookup+0xda/0x150
Apr 12 10:50:42 x2 kernel: [507977.111052]  [<ffffffff8112a700>]
do_lookup+0xb0/0xe0
Apr 12 10:50:42 x2 kernel: [507977.111059]  [<ffffffff8112b2ad>]
__link_path_walk+0x6ad/0xdc0
Apr 12 10:50:42 x2 kernel: [507977.111067]  [<ffffffff8112bba7>]
path_walk+0x57/0xb0
Apr 12 10:50:42 x2 kernel: [507977.111075]  [<ffffffff8112bc53>]
do_path_lookup+0x53/0xa0
Apr 12 10:50:42 x2 kernel: [507977.111083]  [<ffffffff8112c8f2>]
user_path_at+0x52/0xa0
Apr 12 10:50:42 x2 kernel: [507977.111091]  [<ffffffff8152cf59>] ?
_spin_lock+0x9/0x10
Apr 12 10:50:42 x2 kernel: [507977.111100]  [<ffffffff81276e35>] ?
_atomic_dec_and_lock+0x55/0x80
Apr 12 10:50:42 x2 kernel: [507977.111108]  [<ffffffff811238bf>] ?
cp_new_stat+0xdf/0xf0
Apr 12 10:50:42 x2 kernel: [507977.111114]  [<ffffffff81123ac7>]
vfs_fstatat+0x37/0x70
Apr 12 10:50:42 x2 kernel: [507977.111121]  [<ffffffff81123b69>]
vfs_lstat+0x19/0x20
Apr 12 10:50:42 x2 kernel: [507977.111127]  [<ffffffff81123b8f>]
sys_newlstat+0x1f/0x50
Apr 12 10:50:42 x2 kernel: [507977.111138]  [<ffffffff81012042>]
system_call_fastpath+0x16/0x1b
Apr 12 10:50:42 x2 kernel: [507977.111142] Code: 40 00 48 8b 4b 50 be
01 00 00 00 48 8d 53 10 8b 89 a8 00 00 00 d3 e6 48 63 f6 48 83 ee 10
48 c1 fe 04 48 c1 e6 03 e9 02 fe ff ff <0f> 0b eb fe 66 66 2e 0f 1f 84
00 00 00 00 00 55 48 89 e5 48 83
Apr 12 10:50:42 x2 kernel: [507977.111204] RIP  [<ffffffffa0b8d6a1>]
nilfs_btree_do_lookup+0x291/0x2a0 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111223]  RSP <ffff88008900f748>
Apr 12 10:50:42 x2 kernel: [507977.111230] ---[ end trace c49a62030f4c2c2c ]---

Since I could not determine how to build the latest version of the
kernel module for the 2.6.31 kernel, I rebooted on a 64-bit Ubuntu
Hardy system with a 2.6.24 kernel and installed the latest NILFS2
kernel module and utilities from git repository.  When I accessed the
DVD-RAM with rsync to see which files were in error, I got the
following messages in syslog:

Apr 12 17:15:37 x2 kernel: [  829.444875] Pid: 18374, comm: rsync Not
tainted 2.6.24-27-generic #1
Apr 12 17:15:37 x2 kernel: [  829.444879]
Apr 12 17:15:37 x2 kernel: [  829.444880] Call Trace:
Apr 12 17:15:37 x2 kernel: [  829.444913]  [<ffffffff8844218f>]
:nilfs2:nilfs_btree_do_lookup+0x28f/0x2b0
Apr 12 17:15:37 x2 kernel: [  829.444935]  [<ffffffff88443a57>]
:nilfs2:nilfs_btree_lookup_contig+0x67/0x330
Apr 12 17:15:37 x2 kernel: [  829.444946]  [__down_read+0x12/0xb1]
__down_read+0x12/0xb1
Apr 12 17:15:37 x2 kernel: [  829.444961]  [<ffffffff884417f7>]
:nilfs2:nilfs_bmap_lookup_contig+0x47/0x70
Apr 12 17:15:37 x2 kernel: [  829.444974]  [<ffffffff8843b354>]
:nilfs2:nilfs_get_block+0xb4/0x250
Apr 12 17:15:37 x2 kernel: [  829.444990]
[do_mpage_readpage+0x40b/0x540] do_mpage_readpage+0x40b/0x540
Apr 12 17:15:37 x2 kernel: [  829.445000]  [<ffffffff8843b2a0>]
:nilfs2:nilfs_get_block+0x0/0x250
Apr 12 17:15:37 x2 kernel: [  829.445026]  [<ffffffff8843a1c0>]
:nilfs2:nilfs_readpage+0x0/0x10
Apr 12 17:15:37 x2 kernel: [  829.445029]
[ext3:mpage_readpage+0x4e/0x150] mpage_readpage+0x4e/0x70
Apr 12 17:15:37 x2 kernel: [  829.445037]  [<ffffffff8843b2a0>]
:nilfs2:nilfs_get_block+0x0/0x250
Apr 12 17:15:37 x2 kernel: [  829.445041]  [__alloc_pages+0xaa/0x3f0]
__alloc_pages+0xaa/0x3f0
Apr 12 17:15:37 x2 kernel: [  829.445054]  [<ffffffff8843a1c0>]
:nilfs2:nilfs_readpage+0x0/0x10
Apr 12 17:15:37 x2 kernel: [  829.445057]
[add_to_page_cache+0xb9/0xd0] add_to_page_cache+0xb9/0xd0
Apr 12 17:15:37 x2 kernel: [  829.445063]
[add_to_page_cache_lru+0x24/0x40] add_to_page_cache_lru+0x24/0x40
Apr 12 17:15:37 x2 kernel: [  829.445068]
[read_cache_page_async+0x96/0x150] read_cache_page_async+0x96/0x150
Apr 12 17:15:37 x2 kernel: [  829.445077]  [read_cache_page+0x6/0x40]
read_cache_page+0x6/0x40
Apr 12 17:15:37 x2 kernel: [  829.445086]  [<ffffffff8843b7b8>]
:nilfs2:nilfs_get_page+0x38/0x270
Apr 12 17:15:37 x2 kernel: [  829.445091]  [__slab_alloc+0x1bd/0x410]
__slab_alloc+0x1bd/0x410
Apr 12 17:15:37 x2 kernel: [  829.445100]  [<ffffffff8843cd72>]
:nilfs2:nilfs_alloc_inode_common+0x22/0x70
Apr 12 17:15:37 x2 kernel: [  829.445115]  [<ffffffff8843bd64>]
:nilfs2:nilfs_find_entry+0x84/0x1c0
Apr 12 17:15:37 x2 kernel: [  829.445134]  [<ffffffff8843beb1>]
:nilfs2:nilfs_inode_by_name+0x11/0x30
Apr 12 17:15:37 x2 kernel: [  829.445146]  [<ffffffff8843ec55>]
:nilfs2:nilfs_lookup+0x45/0xc0
Apr 12 17:15:37 x2 kernel: [  829.445153]  [do_lookup+0x1b6/0x250]
do_lookup+0x1b6/0x250
Apr 12 17:15:37 x2 kernel: [  829.445165]
[__link_path_walk+0x74c/0xe90] __link_path_walk+0x74c/0xe90
Apr 12 17:15:37 x2 kernel: [  829.445180]  [link_path_walk+0x5b/0x100]
link_path_walk+0x5b/0x100
Apr 12 17:15:37 x2 kernel: [  829.445202]  [do_path_lookup+0x8a/0x250]
do_path_lookup+0x8a/0x250
Apr 12 17:15:37 x2 kernel: [  829.445206]  [getname+0x1a9/0x220]
getname+0x1a9/0x220
Apr 12 17:15:37 x2 kernel: [  829.445213]  [__user_walk_fd+0x4b/0x80]
__user_walk_fd+0x4b/0x80
Apr 12 17:15:37 x2 kernel: [  829.445220]  [vfs_lstat_fd+0x2c/0x70]
vfs_lstat_fd+0x2c/0x70
Apr 12 17:15:37 x2 kernel: [  829.445239]  [sys_newlstat+0x27/0x50]
sys_newlstat+0x27/0x50
Apr 12 17:15:37 x2 kernel: [  829.445256]  [system_call+0x7e/0x83]
system_call+0x7e/0x83
Apr 12 17:15:37 x2 kernel: [  829.445270]
Apr 12 17:15:37 x2 kernel: [  829.445271] NILFS: btree level mismatch: 61 != 1

The utility lssu reported the following segment structure:

              SEGNUM        DATE     TIME STAT     NBLOCKS
                   0  ---------- --:--:--  ---           0
              ...
                  20  ---------- --:--:--  ---           0
                  21  2010-04-11 23:23:21  ad-         268
                  22  ---------- --:--:--  ad-           0
                  23  ---------- --:--:--  ---           0
              ...
                 191  ---------- --:--:--  ---           0
                 192  2010-04-11 17:18:46  -d-        2048
                 193  2010-04-11 17:18:47  -d-        2048
                 194  2010-04-11 17:18:58  -d-        2048
              ...
                 533  2010-04-11 20:42:02  -d-        2048
                 534  2010-04-11 20:42:54  -d-        2048
                 535  2010-04-11 22:13:18  -d-        2047
                 536  ---------- --:--:--  ---           0
                 537  ---------- --:--:--  ---           0
                 538  ---------- --:--:--  ---           0
                 539  ---------- --:--:--  ---           0
                 540  ---------- --:--:--  ---           0
                 541  ---------- --:--:--  ---           0
                 542  ---------- --:--:--  ---           0
                 543  ---------- --:--:--  ---           0
                 544  ---------- --:--:--  ---           0
                 545  ---------- --:--:--  ---           0

The checkpoints on the DVD-RAM included 4 snapshots, each of which was
associated with a distinct backup that I had copied to build the
DVD-RAM.  I compared those snapshots with the sample backups from
which they were constructed and found that a single directory was
missing from each of the snapshots.  That directory was an old one
that was unmodified during the period from which I selected the sample
backups.  I assume that the cleaner was copying that directory when
the write errors began and that a shared copy of that directory was
lost in all of snapshots.

In summary, it appears that the write errors were isolated to a single
directory.  The errors resulted in a file system that was unusable
with the NILFS2 kernel module in 2.6.31 but was accessible with the
latest version of the kernel module when installed in a older kernel
(2.6.24).  (Is there some way to install a version of the NILFS2
kernel module from the NILFS2 git repository on a kernel newer than
2.6.29?)

I have not yet reformatted the DVD-RAM.  I will perform additional
experiments with it if they would be helpful.

Ed
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread