All of lore.kernel.org
 help / color / mirror / Atom feed
* Response to a Write Error
@ 2010-04-13 23:38 Ed F.
       [not found] ` <t2kf2d3ba1f1004131638nb61e68cek1ecb43a86c316d35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ed F. @ 2010-04-13 23:38 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

For several years I have been using rsync with the --link-dest option
in scripts to make incremental backups using hard links.  I have been
testing NILFS2 snapshots as an alternative way to do equivalent
backups.  Some of my experiments have involved writing to a DVD-RAM. I
am reporting what happened when a DVD-RAM had non-recoverable write
errors in case that information is of interest to the developers.

My DVD-RAM was written on a 64-bit Ubuntu Karmic system with a 2.6.31
kernel.  The NILFS2 module came with that kernel; the NILFS2 utilities
were built from the head of NILFS2 git repository on April 8, 2010.
Here are the first syslog messages associated with the write errors:

Apr 11 20:46:28 x2 kernel: [496491.160053] usb 1-8: reset high speed
USB device using ehci_hcd and address 5
Apr 11 20:46:35 x2 kernel: [496497.963699] sr 7:0:0:0: [sr1] Unhandled
sense code
Apr 11 20:46:35 x2 kernel: [496497.963708] sr 7:0:0:0: [sr1] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 11 20:46:35 x2 kernel: [496497.963716] sr 7:0:0:0: [sr1] Sense Key
: Medium Error [current]
Apr 11 20:46:35 x2 kernel: [496497.963725] sr 7:0:0:0: [sr1] Add.
Sense: Incomplete block read
Apr 11 20:46:35 x2 kernel: [496497.963735] end_request: I/O error, dev
sr1, sector 8778160
Apr 11 20:46:54 x2 kernel: [496516.494583] sr 7:0:0:0: [sr1] Unhandled
sense code
Apr 11 20:46:54 x2 kernel: [496516.494592] sr 7:0:0:0: [sr1] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 11 20:46:54 x2 kernel: [496516.494600] sr 7:0:0:0: [sr1] Sense Key
: Medium Error [current]
Apr 11 20:46:54 x2 kernel: [496516.494608] sr 7:0:0:0: [sr1] Add.
Sense: No defect spare location available
Apr 11 20:46:54 x2 kernel: [496516.494621] end_request: I/O error, dev
sr1, sector 8781280
Apr 11 20:46:54 x2 kernel: [496516.502194] NILFS: IO error writing segment
Apr 11 20:46:54 x2 kernel: [496516.511162] NILFS warning (device sr1):
nilfs_clean_segments: segment construction failed. (err=-5)

When I detected the errors, I shut down the system and waited until
the following morning to investigate.  When I tried to examine the
files on the DVD-RAM to see which ones were corrupted, the following
error was generated:

Apr 12 10:50:42 x2 kernel: [507977.110489] ------------[ cut here ]------------
Apr 12 10:50:42 x2 kernel: [507977.110502] kernel BUG at
/build/buildd/linux-2.6.31/fs/nilfs2/btree.c:514!
Apr 12 10:50:42 x2 kernel: [507977.110509] invalid opcode: 0000 [#1] SMP
Apr 12 10:50:42 x2 kernel: [507977.110516] last sysfs file:
/sys/devices/pci0000:00/0000:00:02.1/usb1/1-1/idVendor
Apr 12 10:50:42 x2 kernel: [507977.110521] CPU 1
Apr 12 10:50:42 x2 kernel: [507977.110526] Modules linked in:
usb_storage udf crc_itu_t nilfs2 binfmt_misc snd_hda_codec_idt ppdev
snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss
snd_pcm iptable_filter snd_seq_dummy ip_tables x_tables snd_seq_oss
snd_seq_midi parport_pc nvidia(P) psmouse serio_raw snd_rawmidi
snd_seq_midi_event snd_seq snd_timer snd_seq_device snd amd64_edac_mod
i2c_nforce2 edac_core soundcore snd_page_alloc k8temp lp parport
floppy 8139too mii
Apr 12 10:50:42 x2 kernel: [507977.110589] Pid: 24330, comm: rsync
Tainted: P        W  2.6.31-20-generic #57-Ubuntu NFORCE6M-A
Apr 12 10:50:42 x2 kernel: [507977.110595] RIP:
0010:[<ffffffffa0b8d6a1>]  [<ffffffffa0b8d6a1>]
nilfs_btree_do_lookup+0x291/0x2a0 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110631] RSP: 0018:ffff88008900f748
EFLAGS: 00010293
Apr 12 10:50:42 x2 kernel: [507977.110635] RAX: 000000000000003d RBX:
ffff880086949c80 RCX: 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110641] RDX: 0000000000000000 RSI:
ffff88008900f6a8 RDI: ffff88002800cb20
Apr 12 10:50:42 x2 kernel: [507977.110646] RBP: ffff88008900f7b8 R08:
0000000000000001 R09: 0001ce00a2bfa0a0
Apr 12 10:50:42 x2 kernel: [507977.110651] R10: 00000000ffffffff R11:
0000000000000000 R12: ffff8800731ff000
Apr 12 10:50:42 x2 kernel: [507977.110656] R13: 0000000000000001 R14:
ffff8800ba8ace98 R15: 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110663] FS:  00007f82523ed6f0(0000)
GS:ffff88002803d000(0000) knlGS:0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110668] CS:  0010 DS: 0000 ES: 0000
CR0: 0000000080050033
Apr 12 10:50:42 x2 kernel: [507977.110673] CR2: 0000000001ab6ff0 CR3:
00000000196c1000 CR4: 00000000000006e0
Apr 12 10:50:42 x2 kernel: [507977.110679] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110684] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Apr 12 10:50:42 x2 kernel: [507977.110690] Process rsync (pid: 24330,
threadinfo ffff88008900e000, task ffff88010d82c410)
Apr 12 10:50:42 x2 kernel: [507977.110695] Stack:
Apr 12 10:50:42 x2 kernel: [507977.110697]  ffff880000000001
0000000100000001 ffff880086949c00 000000018900f830
Apr 12 10:50:42 x2 kernel: [507977.110705] <0> ffff88008900f810
0000000000000000 ffff88008900f7a8 00000000a0b8dd90
Apr 12 10:50:42 x2 kernel: [507977.110714] <0> 0000000000000001
ffff8800ba8ace98 ffff880086949c00 0000000000000000
Apr 12 10:50:42 x2 kernel: [507977.110724] Call Trace:
Apr 12 10:50:42 x2 kernel: [507977.110746]  [<ffffffffa0b8ddc5>]
nilfs_btree_lookup_contig+0x65/0x330 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110760]  [<ffffffff8152cea0>] ?
_spin_lock_irq+0x10/0x20
Apr 12 10:50:42 x2 kernel: [507977.110778]  [<ffffffffa0b8b6a8>]
nilfs_bmap_lookup_contig+0x48/0x70 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110795]  [<ffffffffa0b83ff0>]
nilfs_get_block+0xb0/0x250 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110803]  [<ffffffff8152cf59>] ?
_spin_lock+0x9/0x10
Apr 12 10:50:42 x2 kernel: [507977.110814]  [<ffffffff8114f088>]
do_mpage_readpage+0x148/0x600
Apr 12 10:50:42 x2 kernel: [507977.110830]  [<ffffffffa0b83200>] ?
nilfs_readpage+0x0/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110837]  [<ffffffff8114f58a>]
mpage_readpage+0x4a/0x60
Apr 12 10:50:42 x2 kernel: [507977.110853]  [<ffffffffa0b83f40>] ?
nilfs_get_block+0x0/0x250 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110864]  [<ffffffff810f1b65>] ?
__inc_zone_page_state+0x25/0x30
Apr 12 10:50:42 x2 kernel: [507977.110872]  [<ffffffff810daa51>] ?
add_to_page_cache_locked+0xd1/0x130
Apr 12 10:50:42 x2 kernel: [507977.110888]  [<ffffffffa0b83200>] ?
nilfs_readpage+0x0/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110894]  [<ffffffff810daaf6>] ?
add_to_page_cache_lru+0x46/0x70
Apr 12 10:50:42 x2 kernel: [507977.110910]  [<ffffffffa0b83213>]
nilfs_readpage+0x13/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110917]  [<ffffffff810dab99>]
__read_cache_page+0x79/0xe0
Apr 12 10:50:42 x2 kernel: [507977.110932]  [<ffffffffa0b83200>] ?
nilfs_readpage+0x0/0x20 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110940]  [<ffffffff810dc071>]
read_cache_page_async+0x31/0xe0
Apr 12 10:50:42 x2 kernel: [507977.110946]  [<ffffffff810dc12d>]
read_cache_page+0xd/0x60
Apr 12 10:50:42 x2 kernel: [507977.110967]  [<ffffffffa0b9c786>] ?
nilfs_palloc_block_get_entry+0x26/0x60 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.110983]  [<ffffffffa0b84a3f>]
nilfs_get_page+0x1f/0x80 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111000]  [<ffffffffa0b85388>]
nilfs_find_entry+0x78/0x220 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111017]  [<ffffffffa0b85544>]
nilfs_inode_by_name+0x14/0x30 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111034]  [<ffffffffa0b8761d>]
nilfs_lookup+0x3d/0x70 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111043]  [<ffffffff81128b0a>]
real_lookup+0xda/0x150
Apr 12 10:50:42 x2 kernel: [507977.111052]  [<ffffffff8112a700>]
do_lookup+0xb0/0xe0
Apr 12 10:50:42 x2 kernel: [507977.111059]  [<ffffffff8112b2ad>]
__link_path_walk+0x6ad/0xdc0
Apr 12 10:50:42 x2 kernel: [507977.111067]  [<ffffffff8112bba7>]
path_walk+0x57/0xb0
Apr 12 10:50:42 x2 kernel: [507977.111075]  [<ffffffff8112bc53>]
do_path_lookup+0x53/0xa0
Apr 12 10:50:42 x2 kernel: [507977.111083]  [<ffffffff8112c8f2>]
user_path_at+0x52/0xa0
Apr 12 10:50:42 x2 kernel: [507977.111091]  [<ffffffff8152cf59>] ?
_spin_lock+0x9/0x10
Apr 12 10:50:42 x2 kernel: [507977.111100]  [<ffffffff81276e35>] ?
_atomic_dec_and_lock+0x55/0x80
Apr 12 10:50:42 x2 kernel: [507977.111108]  [<ffffffff811238bf>] ?
cp_new_stat+0xdf/0xf0
Apr 12 10:50:42 x2 kernel: [507977.111114]  [<ffffffff81123ac7>]
vfs_fstatat+0x37/0x70
Apr 12 10:50:42 x2 kernel: [507977.111121]  [<ffffffff81123b69>]
vfs_lstat+0x19/0x20
Apr 12 10:50:42 x2 kernel: [507977.111127]  [<ffffffff81123b8f>]
sys_newlstat+0x1f/0x50
Apr 12 10:50:42 x2 kernel: [507977.111138]  [<ffffffff81012042>]
system_call_fastpath+0x16/0x1b
Apr 12 10:50:42 x2 kernel: [507977.111142] Code: 40 00 48 8b 4b 50 be
01 00 00 00 48 8d 53 10 8b 89 a8 00 00 00 d3 e6 48 63 f6 48 83 ee 10
48 c1 fe 04 48 c1 e6 03 e9 02 fe ff ff <0f> 0b eb fe 66 66 2e 0f 1f 84
00 00 00 00 00 55 48 89 e5 48 83
Apr 12 10:50:42 x2 kernel: [507977.111204] RIP  [<ffffffffa0b8d6a1>]
nilfs_btree_do_lookup+0x291/0x2a0 [nilfs2]
Apr 12 10:50:42 x2 kernel: [507977.111223]  RSP <ffff88008900f748>
Apr 12 10:50:42 x2 kernel: [507977.111230] ---[ end trace c49a62030f4c2c2c ]---

Since I could not determine how to build the latest version of the
kernel module for the 2.6.31 kernel, I rebooted on a 64-bit Ubuntu
Hardy system with a 2.6.24 kernel and installed the latest NILFS2
kernel module and utilities from git repository.  When I accessed the
DVD-RAM with rsync to see which files were in error, I got the
following messages in syslog:

Apr 12 17:15:37 x2 kernel: [  829.444875] Pid: 18374, comm: rsync Not
tainted 2.6.24-27-generic #1
Apr 12 17:15:37 x2 kernel: [  829.444879]
Apr 12 17:15:37 x2 kernel: [  829.444880] Call Trace:
Apr 12 17:15:37 x2 kernel: [  829.444913]  [<ffffffff8844218f>]
:nilfs2:nilfs_btree_do_lookup+0x28f/0x2b0
Apr 12 17:15:37 x2 kernel: [  829.444935]  [<ffffffff88443a57>]
:nilfs2:nilfs_btree_lookup_contig+0x67/0x330
Apr 12 17:15:37 x2 kernel: [  829.444946]  [__down_read+0x12/0xb1]
__down_read+0x12/0xb1
Apr 12 17:15:37 x2 kernel: [  829.444961]  [<ffffffff884417f7>]
:nilfs2:nilfs_bmap_lookup_contig+0x47/0x70
Apr 12 17:15:37 x2 kernel: [  829.444974]  [<ffffffff8843b354>]
:nilfs2:nilfs_get_block+0xb4/0x250
Apr 12 17:15:37 x2 kernel: [  829.444990]
[do_mpage_readpage+0x40b/0x540] do_mpage_readpage+0x40b/0x540
Apr 12 17:15:37 x2 kernel: [  829.445000]  [<ffffffff8843b2a0>]
:nilfs2:nilfs_get_block+0x0/0x250
Apr 12 17:15:37 x2 kernel: [  829.445026]  [<ffffffff8843a1c0>]
:nilfs2:nilfs_readpage+0x0/0x10
Apr 12 17:15:37 x2 kernel: [  829.445029]
[ext3:mpage_readpage+0x4e/0x150] mpage_readpage+0x4e/0x70
Apr 12 17:15:37 x2 kernel: [  829.445037]  [<ffffffff8843b2a0>]
:nilfs2:nilfs_get_block+0x0/0x250
Apr 12 17:15:37 x2 kernel: [  829.445041]  [__alloc_pages+0xaa/0x3f0]
__alloc_pages+0xaa/0x3f0
Apr 12 17:15:37 x2 kernel: [  829.445054]  [<ffffffff8843a1c0>]
:nilfs2:nilfs_readpage+0x0/0x10
Apr 12 17:15:37 x2 kernel: [  829.445057]
[add_to_page_cache+0xb9/0xd0] add_to_page_cache+0xb9/0xd0
Apr 12 17:15:37 x2 kernel: [  829.445063]
[add_to_page_cache_lru+0x24/0x40] add_to_page_cache_lru+0x24/0x40
Apr 12 17:15:37 x2 kernel: [  829.445068]
[read_cache_page_async+0x96/0x150] read_cache_page_async+0x96/0x150
Apr 12 17:15:37 x2 kernel: [  829.445077]  [read_cache_page+0x6/0x40]
read_cache_page+0x6/0x40
Apr 12 17:15:37 x2 kernel: [  829.445086]  [<ffffffff8843b7b8>]
:nilfs2:nilfs_get_page+0x38/0x270
Apr 12 17:15:37 x2 kernel: [  829.445091]  [__slab_alloc+0x1bd/0x410]
__slab_alloc+0x1bd/0x410
Apr 12 17:15:37 x2 kernel: [  829.445100]  [<ffffffff8843cd72>]
:nilfs2:nilfs_alloc_inode_common+0x22/0x70
Apr 12 17:15:37 x2 kernel: [  829.445115]  [<ffffffff8843bd64>]
:nilfs2:nilfs_find_entry+0x84/0x1c0
Apr 12 17:15:37 x2 kernel: [  829.445134]  [<ffffffff8843beb1>]
:nilfs2:nilfs_inode_by_name+0x11/0x30
Apr 12 17:15:37 x2 kernel: [  829.445146]  [<ffffffff8843ec55>]
:nilfs2:nilfs_lookup+0x45/0xc0
Apr 12 17:15:37 x2 kernel: [  829.445153]  [do_lookup+0x1b6/0x250]
do_lookup+0x1b6/0x250
Apr 12 17:15:37 x2 kernel: [  829.445165]
[__link_path_walk+0x74c/0xe90] __link_path_walk+0x74c/0xe90
Apr 12 17:15:37 x2 kernel: [  829.445180]  [link_path_walk+0x5b/0x100]
link_path_walk+0x5b/0x100
Apr 12 17:15:37 x2 kernel: [  829.445202]  [do_path_lookup+0x8a/0x250]
do_path_lookup+0x8a/0x250
Apr 12 17:15:37 x2 kernel: [  829.445206]  [getname+0x1a9/0x220]
getname+0x1a9/0x220
Apr 12 17:15:37 x2 kernel: [  829.445213]  [__user_walk_fd+0x4b/0x80]
__user_walk_fd+0x4b/0x80
Apr 12 17:15:37 x2 kernel: [  829.445220]  [vfs_lstat_fd+0x2c/0x70]
vfs_lstat_fd+0x2c/0x70
Apr 12 17:15:37 x2 kernel: [  829.445239]  [sys_newlstat+0x27/0x50]
sys_newlstat+0x27/0x50
Apr 12 17:15:37 x2 kernel: [  829.445256]  [system_call+0x7e/0x83]
system_call+0x7e/0x83
Apr 12 17:15:37 x2 kernel: [  829.445270]
Apr 12 17:15:37 x2 kernel: [  829.445271] NILFS: btree level mismatch: 61 != 1

The utility lssu reported the following segment structure:

              SEGNUM        DATE     TIME STAT     NBLOCKS
                   0  ---------- --:--:--  ---           0
              ...
                  20  ---------- --:--:--  ---           0
                  21  2010-04-11 23:23:21  ad-         268
                  22  ---------- --:--:--  ad-           0
                  23  ---------- --:--:--  ---           0
              ...
                 191  ---------- --:--:--  ---           0
                 192  2010-04-11 17:18:46  -d-        2048
                 193  2010-04-11 17:18:47  -d-        2048
                 194  2010-04-11 17:18:58  -d-        2048
              ...
                 533  2010-04-11 20:42:02  -d-        2048
                 534  2010-04-11 20:42:54  -d-        2048
                 535  2010-04-11 22:13:18  -d-        2047
                 536  ---------- --:--:--  ---           0
                 537  ---------- --:--:--  ---           0
                 538  ---------- --:--:--  ---           0
                 539  ---------- --:--:--  ---           0
                 540  ---------- --:--:--  ---           0
                 541  ---------- --:--:--  ---           0
                 542  ---------- --:--:--  ---           0
                 543  ---------- --:--:--  ---           0
                 544  ---------- --:--:--  ---           0
                 545  ---------- --:--:--  ---           0

The checkpoints on the DVD-RAM included 4 snapshots, each of which was
associated with a distinct backup that I had copied to build the
DVD-RAM.  I compared those snapshots with the sample backups from
which they were constructed and found that a single directory was
missing from each of the snapshots.  That directory was an old one
that was unmodified during the period from which I selected the sample
backups.  I assume that the cleaner was copying that directory when
the write errors began and that a shared copy of that directory was
lost in all of snapshots.

In summary, it appears that the write errors were isolated to a single
directory.  The errors resulted in a file system that was unusable
with the NILFS2 kernel module in 2.6.31 but was accessible with the
latest version of the kernel module when installed in a older kernel
(2.6.24).  (Is there some way to install a version of the NILFS2
kernel module from the NILFS2 git repository on a kernel newer than
2.6.29?)

I have not yet reformatted the DVD-RAM.  I will perform additional
experiments with it if they would be helpful.

Ed
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Response to a Write Error
       [not found] ` <t2kf2d3ba1f1004131638nb61e68cek1ecb43a86c316d35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-04-14 16:51   ` Ryusuke Konishi
       [not found]     ` <20100415.015106.58888665.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ryusuke Konishi @ 2010-04-14 16:51 UTC (permalink / raw)
  To: third07-Re5JQEeQqe8AvxtiuMwx3w; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi,
On Tue, 13 Apr 2010 19:38:30 -0400, "Ed F." <third07-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> For several years I have been using rsync with the --link-dest option
> in scripts to make incremental backups using hard links.  I have been
> testing NILFS2 snapshots as an alternative way to do equivalent
> backups.  Some of my experiments have involved writing to a DVD-RAM. I
> am reporting what happened when a DVD-RAM had non-recoverable write
> errors in case that information is of interest to the developers.
> 
> My DVD-RAM was written on a 64-bit Ubuntu Karmic system with a 2.6.31
> kernel.  The NILFS2 module came with that kernel; the NILFS2 utilities
> were built from the head of NILFS2 git repository on April 8, 2010.
> Here are the first syslog messages associated with the write errors:

Thank you for the detail report.

According to the syslog messages, btree of a file or a directory broke
for some reason.  The difference of behavior between nilfs2-module and
the 2.6.31-20 Ubuntu kernel is not a bug.

It's because the latest nilfs2 standalone module includes a backported
patch to handle btree corruption as an error instead of triggering a
kernel BUG.

I have no idea why the corruption occured since the 2.6.31-20 Ubuntu
kernel is almost the same with the 2.6.31.y stable kernel.

I'd appreciate it if you could help to test whether the same problem
arises with nilfs in 2.6.32.y or 2.6.33.y.

The following are the steps to make&install a newer nilfs in the
2.6.31 kernel.

  $ sudo apt-get install linux-headers-`uname -r`
  $ cd /usr/src
  $ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.32.11.tar.bz2
  $ tar jxf linux-2.6.32.11.tar.bz2
  $ cd linux-2.6.32.11/fs/nilfs2
  $ make -C /lib/modules/`uname -r`/build SUBDIRS=$PWD BUILD_DIR=$PWD modules
  $ su
  # rmmod nilfs2
  # mv /lib/modules/`uname -r`/kernel/fs/nilfs2/nilfs2.ko{,.bak}
  # install -m 644 -c nilfs2.ko /lib/modules/`uname -r`/kernel/fs/nilfs2/
  # /sbin/depmod -A
  # mount -t nilfs2 <device> <mount-dir>


As for the write errors, the following two patches may work:

"nilfs2: use GFP_NOIO for bio_alloc instead of GFP_NOWAIT" (merged to v2.6.32)
"nilfs2: do not return io error for bio allocation failure" (merged to v2.6.33)

It's hard to track the cause from the raw block data, so you don't
have to keep the DVD-RAM data.  (dumpseg log for in-use segments is
enough to me)

Thanks,
Ryusuke Konishi 

> Apr 11 20:46:28 x2 kernel: [496491.160053] usb 1-8: reset high speed
> USB device using ehci_hcd and address 5
> Apr 11 20:46:35 x2 kernel: [496497.963699] sr 7:0:0:0: [sr1] Unhandled
> sense code
> Apr 11 20:46:35 x2 kernel: [496497.963708] sr 7:0:0:0: [sr1] Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Apr 11 20:46:35 x2 kernel: [496497.963716] sr 7:0:0:0: [sr1] Sense Key
> : Medium Error [current]
> Apr 11 20:46:35 x2 kernel: [496497.963725] sr 7:0:0:0: [sr1] Add.
> Sense: Incomplete block read
> Apr 11 20:46:35 x2 kernel: [496497.963735] end_request: I/O error, dev
> sr1, sector 8778160
> Apr 11 20:46:54 x2 kernel: [496516.494583] sr 7:0:0:0: [sr1] Unhandled
> sense code
> Apr 11 20:46:54 x2 kernel: [496516.494592] sr 7:0:0:0: [sr1] Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Apr 11 20:46:54 x2 kernel: [496516.494600] sr 7:0:0:0: [sr1] Sense Key
> : Medium Error [current]
> Apr 11 20:46:54 x2 kernel: [496516.494608] sr 7:0:0:0: [sr1] Add.
> Sense: No defect spare location available
> Apr 11 20:46:54 x2 kernel: [496516.494621] end_request: I/O error, dev
> sr1, sector 8781280
> Apr 11 20:46:54 x2 kernel: [496516.502194] NILFS: IO error writing segment
> Apr 11 20:46:54 x2 kernel: [496516.511162] NILFS warning (device sr1):
> nilfs_clean_segments: segment construction failed. (err=-5)
> 
> When I detected the errors, I shut down the system and waited until
> the following morning to investigate.  When I tried to examine the
> files on the DVD-RAM to see which ones were corrupted, the following
> error was generated:
> 
> Apr 12 10:50:42 x2 kernel: [507977.110489] ------------[ cut here ]------------
> Apr 12 10:50:42 x2 kernel: [507977.110502] kernel BUG at
> /build/buildd/linux-2.6.31/fs/nilfs2/btree.c:514!
> Apr 12 10:50:42 x2 kernel: [507977.110509] invalid opcode: 0000 [#1] SMP
> Apr 12 10:50:42 x2 kernel: [507977.110516] last sysfs file:
> /sys/devices/pci0000:00/0000:00:02.1/usb1/1-1/idVendor
> Apr 12 10:50:42 x2 kernel: [507977.110521] CPU 1
> Apr 12 10:50:42 x2 kernel: [507977.110526] Modules linked in:
> usb_storage udf crc_itu_t nilfs2 binfmt_misc snd_hda_codec_idt ppdev
> snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss
> snd_pcm iptable_filter snd_seq_dummy ip_tables x_tables snd_seq_oss
> snd_seq_midi parport_pc nvidia(P) psmouse serio_raw snd_rawmidi
> snd_seq_midi_event snd_seq snd_timer snd_seq_device snd amd64_edac_mod
> i2c_nforce2 edac_core soundcore snd_page_alloc k8temp lp parport
> floppy 8139too mii
> Apr 12 10:50:42 x2 kernel: [507977.110589] Pid: 24330, comm: rsync
> Tainted: P        W  2.6.31-20-generic #57-Ubuntu NFORCE6M-A
> Apr 12 10:50:42 x2 kernel: [507977.110595] RIP:
> 0010:[<ffffffffa0b8d6a1>]  [<ffffffffa0b8d6a1>]
> nilfs_btree_do_lookup+0x291/0x2a0 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110631] RSP: 0018:ffff88008900f748
> EFLAGS: 00010293
> Apr 12 10:50:42 x2 kernel: [507977.110635] RAX: 000000000000003d RBX:
> ffff880086949c80 RCX: 0000000000000000
> Apr 12 10:50:42 x2 kernel: [507977.110641] RDX: 0000000000000000 RSI:
> ffff88008900f6a8 RDI: ffff88002800cb20
> Apr 12 10:50:42 x2 kernel: [507977.110646] RBP: ffff88008900f7b8 R08:
> 0000000000000001 R09: 0001ce00a2bfa0a0
> Apr 12 10:50:42 x2 kernel: [507977.110651] R10: 00000000ffffffff R11:
> 0000000000000000 R12: ffff8800731ff000
> Apr 12 10:50:42 x2 kernel: [507977.110656] R13: 0000000000000001 R14:
> ffff8800ba8ace98 R15: 0000000000000000
> Apr 12 10:50:42 x2 kernel: [507977.110663] FS:  00007f82523ed6f0(0000)
> GS:ffff88002803d000(0000) knlGS:0000000000000000
> Apr 12 10:50:42 x2 kernel: [507977.110668] CS:  0010 DS: 0000 ES: 0000
> CR0: 0000000080050033
> Apr 12 10:50:42 x2 kernel: [507977.110673] CR2: 0000000001ab6ff0 CR3:
> 00000000196c1000 CR4: 00000000000006e0
> Apr 12 10:50:42 x2 kernel: [507977.110679] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Apr 12 10:50:42 x2 kernel: [507977.110684] DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Apr 12 10:50:42 x2 kernel: [507977.110690] Process rsync (pid: 24330,
> threadinfo ffff88008900e000, task ffff88010d82c410)
> Apr 12 10:50:42 x2 kernel: [507977.110695] Stack:
> Apr 12 10:50:42 x2 kernel: [507977.110697]  ffff880000000001
> 0000000100000001 ffff880086949c00 000000018900f830
> Apr 12 10:50:42 x2 kernel: [507977.110705] <0> ffff88008900f810
> 0000000000000000 ffff88008900f7a8 00000000a0b8dd90
> Apr 12 10:50:42 x2 kernel: [507977.110714] <0> 0000000000000001
> ffff8800ba8ace98 ffff880086949c00 0000000000000000
> Apr 12 10:50:42 x2 kernel: [507977.110724] Call Trace:
> Apr 12 10:50:42 x2 kernel: [507977.110746]  [<ffffffffa0b8ddc5>]
> nilfs_btree_lookup_contig+0x65/0x330 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110760]  [<ffffffff8152cea0>] ?
> _spin_lock_irq+0x10/0x20
> Apr 12 10:50:42 x2 kernel: [507977.110778]  [<ffffffffa0b8b6a8>]
> nilfs_bmap_lookup_contig+0x48/0x70 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110795]  [<ffffffffa0b83ff0>]
> nilfs_get_block+0xb0/0x250 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110803]  [<ffffffff8152cf59>] ?
> _spin_lock+0x9/0x10
> Apr 12 10:50:42 x2 kernel: [507977.110814]  [<ffffffff8114f088>]
> do_mpage_readpage+0x148/0x600
> Apr 12 10:50:42 x2 kernel: [507977.110830]  [<ffffffffa0b83200>] ?
> nilfs_readpage+0x0/0x20 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110837]  [<ffffffff8114f58a>]
> mpage_readpage+0x4a/0x60
> Apr 12 10:50:42 x2 kernel: [507977.110853]  [<ffffffffa0b83f40>] ?
> nilfs_get_block+0x0/0x250 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110864]  [<ffffffff810f1b65>] ?
> __inc_zone_page_state+0x25/0x30
> Apr 12 10:50:42 x2 kernel: [507977.110872]  [<ffffffff810daa51>] ?
> add_to_page_cache_locked+0xd1/0x130
> Apr 12 10:50:42 x2 kernel: [507977.110888]  [<ffffffffa0b83200>] ?
> nilfs_readpage+0x0/0x20 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110894]  [<ffffffff810daaf6>] ?
> add_to_page_cache_lru+0x46/0x70
> Apr 12 10:50:42 x2 kernel: [507977.110910]  [<ffffffffa0b83213>]
> nilfs_readpage+0x13/0x20 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110917]  [<ffffffff810dab99>]
> __read_cache_page+0x79/0xe0
> Apr 12 10:50:42 x2 kernel: [507977.110932]  [<ffffffffa0b83200>] ?
> nilfs_readpage+0x0/0x20 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110940]  [<ffffffff810dc071>]
> read_cache_page_async+0x31/0xe0
> Apr 12 10:50:42 x2 kernel: [507977.110946]  [<ffffffff810dc12d>]
> read_cache_page+0xd/0x60
> Apr 12 10:50:42 x2 kernel: [507977.110967]  [<ffffffffa0b9c786>] ?
> nilfs_palloc_block_get_entry+0x26/0x60 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.110983]  [<ffffffffa0b84a3f>]
> nilfs_get_page+0x1f/0x80 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.111000]  [<ffffffffa0b85388>]
> nilfs_find_entry+0x78/0x220 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.111017]  [<ffffffffa0b85544>]
> nilfs_inode_by_name+0x14/0x30 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.111034]  [<ffffffffa0b8761d>]
> nilfs_lookup+0x3d/0x70 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.111043]  [<ffffffff81128b0a>]
> real_lookup+0xda/0x150
> Apr 12 10:50:42 x2 kernel: [507977.111052]  [<ffffffff8112a700>]
> do_lookup+0xb0/0xe0
> Apr 12 10:50:42 x2 kernel: [507977.111059]  [<ffffffff8112b2ad>]
> __link_path_walk+0x6ad/0xdc0
> Apr 12 10:50:42 x2 kernel: [507977.111067]  [<ffffffff8112bba7>]
> path_walk+0x57/0xb0
> Apr 12 10:50:42 x2 kernel: [507977.111075]  [<ffffffff8112bc53>]
> do_path_lookup+0x53/0xa0
> Apr 12 10:50:42 x2 kernel: [507977.111083]  [<ffffffff8112c8f2>]
> user_path_at+0x52/0xa0
> Apr 12 10:50:42 x2 kernel: [507977.111091]  [<ffffffff8152cf59>] ?
> _spin_lock+0x9/0x10
> Apr 12 10:50:42 x2 kernel: [507977.111100]  [<ffffffff81276e35>] ?
> _atomic_dec_and_lock+0x55/0x80
> Apr 12 10:50:42 x2 kernel: [507977.111108]  [<ffffffff811238bf>] ?
> cp_new_stat+0xdf/0xf0
> Apr 12 10:50:42 x2 kernel: [507977.111114]  [<ffffffff81123ac7>]
> vfs_fstatat+0x37/0x70
> Apr 12 10:50:42 x2 kernel: [507977.111121]  [<ffffffff81123b69>]
> vfs_lstat+0x19/0x20
> Apr 12 10:50:42 x2 kernel: [507977.111127]  [<ffffffff81123b8f>]
> sys_newlstat+0x1f/0x50
> Apr 12 10:50:42 x2 kernel: [507977.111138]  [<ffffffff81012042>]
> system_call_fastpath+0x16/0x1b
> Apr 12 10:50:42 x2 kernel: [507977.111142] Code: 40 00 48 8b 4b 50 be
> 01 00 00 00 48 8d 53 10 8b 89 a8 00 00 00 d3 e6 48 63 f6 48 83 ee 10
> 48 c1 fe 04 48 c1 e6 03 e9 02 fe ff ff <0f> 0b eb fe 66 66 2e 0f 1f 84
> 00 00 00 00 00 55 48 89 e5 48 83
> Apr 12 10:50:42 x2 kernel: [507977.111204] RIP  [<ffffffffa0b8d6a1>]
> nilfs_btree_do_lookup+0x291/0x2a0 [nilfs2]
> Apr 12 10:50:42 x2 kernel: [507977.111223]  RSP <ffff88008900f748>
> Apr 12 10:50:42 x2 kernel: [507977.111230] ---[ end trace c49a62030f4c2c2c ]---
> 
> Since I could not determine how to build the latest version of the
> kernel module for the 2.6.31 kernel, I rebooted on a 64-bit Ubuntu
> Hardy system with a 2.6.24 kernel and installed the latest NILFS2
> kernel module and utilities from git repository.  When I accessed the
> DVD-RAM with rsync to see which files were in error, I got the
> following messages in syslog:
> 
> Apr 12 17:15:37 x2 kernel: [  829.444875] Pid: 18374, comm: rsync Not
> tainted 2.6.24-27-generic #1
> Apr 12 17:15:37 x2 kernel: [  829.444879]
> Apr 12 17:15:37 x2 kernel: [  829.444880] Call Trace:
> Apr 12 17:15:37 x2 kernel: [  829.444913]  [<ffffffff8844218f>]
> :nilfs2:nilfs_btree_do_lookup+0x28f/0x2b0
> Apr 12 17:15:37 x2 kernel: [  829.444935]  [<ffffffff88443a57>]
> :nilfs2:nilfs_btree_lookup_contig+0x67/0x330
> Apr 12 17:15:37 x2 kernel: [  829.444946]  [__down_read+0x12/0xb1]
> __down_read+0x12/0xb1
> Apr 12 17:15:37 x2 kernel: [  829.444961]  [<ffffffff884417f7>]
> :nilfs2:nilfs_bmap_lookup_contig+0x47/0x70
> Apr 12 17:15:37 x2 kernel: [  829.444974]  [<ffffffff8843b354>]
> :nilfs2:nilfs_get_block+0xb4/0x250
> Apr 12 17:15:37 x2 kernel: [  829.444990]
> [do_mpage_readpage+0x40b/0x540] do_mpage_readpage+0x40b/0x540
> Apr 12 17:15:37 x2 kernel: [  829.445000]  [<ffffffff8843b2a0>]
> :nilfs2:nilfs_get_block+0x0/0x250
> Apr 12 17:15:37 x2 kernel: [  829.445026]  [<ffffffff8843a1c0>]
> :nilfs2:nilfs_readpage+0x0/0x10
> Apr 12 17:15:37 x2 kernel: [  829.445029]
> [ext3:mpage_readpage+0x4e/0x150] mpage_readpage+0x4e/0x70
> Apr 12 17:15:37 x2 kernel: [  829.445037]  [<ffffffff8843b2a0>]
> :nilfs2:nilfs_get_block+0x0/0x250
> Apr 12 17:15:37 x2 kernel: [  829.445041]  [__alloc_pages+0xaa/0x3f0]
> __alloc_pages+0xaa/0x3f0
> Apr 12 17:15:37 x2 kernel: [  829.445054]  [<ffffffff8843a1c0>]
> :nilfs2:nilfs_readpage+0x0/0x10
> Apr 12 17:15:37 x2 kernel: [  829.445057]
> [add_to_page_cache+0xb9/0xd0] add_to_page_cache+0xb9/0xd0
> Apr 12 17:15:37 x2 kernel: [  829.445063]
> [add_to_page_cache_lru+0x24/0x40] add_to_page_cache_lru+0x24/0x40
> Apr 12 17:15:37 x2 kernel: [  829.445068]
> [read_cache_page_async+0x96/0x150] read_cache_page_async+0x96/0x150
> Apr 12 17:15:37 x2 kernel: [  829.445077]  [read_cache_page+0x6/0x40]
> read_cache_page+0x6/0x40
> Apr 12 17:15:37 x2 kernel: [  829.445086]  [<ffffffff8843b7b8>]
> :nilfs2:nilfs_get_page+0x38/0x270
> Apr 12 17:15:37 x2 kernel: [  829.445091]  [__slab_alloc+0x1bd/0x410]
> __slab_alloc+0x1bd/0x410
> Apr 12 17:15:37 x2 kernel: [  829.445100]  [<ffffffff8843cd72>]
> :nilfs2:nilfs_alloc_inode_common+0x22/0x70
> Apr 12 17:15:37 x2 kernel: [  829.445115]  [<ffffffff8843bd64>]
> :nilfs2:nilfs_find_entry+0x84/0x1c0
> Apr 12 17:15:37 x2 kernel: [  829.445134]  [<ffffffff8843beb1>]
> :nilfs2:nilfs_inode_by_name+0x11/0x30
> Apr 12 17:15:37 x2 kernel: [  829.445146]  [<ffffffff8843ec55>]
> :nilfs2:nilfs_lookup+0x45/0xc0
> Apr 12 17:15:37 x2 kernel: [  829.445153]  [do_lookup+0x1b6/0x250]
> do_lookup+0x1b6/0x250
> Apr 12 17:15:37 x2 kernel: [  829.445165]
> [__link_path_walk+0x74c/0xe90] __link_path_walk+0x74c/0xe90
> Apr 12 17:15:37 x2 kernel: [  829.445180]  [link_path_walk+0x5b/0x100]
> link_path_walk+0x5b/0x100
> Apr 12 17:15:37 x2 kernel: [  829.445202]  [do_path_lookup+0x8a/0x250]
> do_path_lookup+0x8a/0x250
> Apr 12 17:15:37 x2 kernel: [  829.445206]  [getname+0x1a9/0x220]
> getname+0x1a9/0x220
> Apr 12 17:15:37 x2 kernel: [  829.445213]  [__user_walk_fd+0x4b/0x80]
> __user_walk_fd+0x4b/0x80
> Apr 12 17:15:37 x2 kernel: [  829.445220]  [vfs_lstat_fd+0x2c/0x70]
> vfs_lstat_fd+0x2c/0x70
> Apr 12 17:15:37 x2 kernel: [  829.445239]  [sys_newlstat+0x27/0x50]
> sys_newlstat+0x27/0x50
> Apr 12 17:15:37 x2 kernel: [  829.445256]  [system_call+0x7e/0x83]
> system_call+0x7e/0x83
> Apr 12 17:15:37 x2 kernel: [  829.445270]
> Apr 12 17:15:37 x2 kernel: [  829.445271] NILFS: btree level mismatch: 61 != 1
> 
> The utility lssu reported the following segment structure:
> 
>               SEGNUM        DATE     TIME STAT     NBLOCKS
>                    0  ---------- --:--:--  ---           0
>               ...
>                   20  ---------- --:--:--  ---           0
>                   21  2010-04-11 23:23:21  ad-         268
>                   22  ---------- --:--:--  ad-           0
>                   23  ---------- --:--:--  ---           0
>               ...
>                  191  ---------- --:--:--  ---           0
>                  192  2010-04-11 17:18:46  -d-        2048
>                  193  2010-04-11 17:18:47  -d-        2048
>                  194  2010-04-11 17:18:58  -d-        2048
>               ...
>                  533  2010-04-11 20:42:02  -d-        2048
>                  534  2010-04-11 20:42:54  -d-        2048
>                  535  2010-04-11 22:13:18  -d-        2047
>                  536  ---------- --:--:--  ---           0
>                  537  ---------- --:--:--  ---           0
>                  538  ---------- --:--:--  ---           0
>                  539  ---------- --:--:--  ---           0
>                  540  ---------- --:--:--  ---           0
>                  541  ---------- --:--:--  ---           0
>                  542  ---------- --:--:--  ---           0
>                  543  ---------- --:--:--  ---           0
>                  544  ---------- --:--:--  ---           0
>                  545  ---------- --:--:--  ---           0
> 
> The checkpoints on the DVD-RAM included 4 snapshots, each of which was
> associated with a distinct backup that I had copied to build the
> DVD-RAM.  I compared those snapshots with the sample backups from
> which they were constructed and found that a single directory was
> missing from each of the snapshots.  That directory was an old one
> that was unmodified during the period from which I selected the sample
> backups.  I assume that the cleaner was copying that directory when
> the write errors began and that a shared copy of that directory was
> lost in all of snapshots.
> 
> In summary, it appears that the write errors were isolated to a single
> directory.  The errors resulted in a file system that was unusable
> with the NILFS2 kernel module in 2.6.31 but was accessible with the
> latest version of the kernel module when installed in a older kernel
> (2.6.24).  (Is there some way to install a version of the NILFS2
> kernel module from the NILFS2 git repository on a kernel newer than
> 2.6.29?)
> 
> I have not yet reformatted the DVD-RAM.  I will perform additional
> experiments with it if they would be helpful.
> 
> Ed
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Response to a Write Error
       [not found]     ` <20100415.015106.58888665.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2010-04-15  0:46       ` Ed F.
       [not found]         ` <j2kf2d3ba1f1004141746n9f914465q3e24d48bec7f990e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ed F. @ 2010-04-15  0:46 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 14, 2010 at 12:51 PM, Ryusuke Konishi <ryusuke-sG5X7nlA6pw@public.gmane.org> wrote:

> I have no idea why the corruption occured since the 2.6.31-20 Ubuntu
> kernel is almost the same with the 2.6.31.y stable kernel.

I believe that the error was caused by accumulated defects on the
DVD-RAM disk, not by NILFS2.  I interpret the error messages to mean
that the DVD drive detected a hardware error when it tried to read
sector 8778160 and then had a write error at sector 8781280 that could
not be correct by the drive because there were no more spare
defect-correction sectors on that particular DVD-RAM disk.

> I'd appreciate it if you could help to test whether the same problem
> arises with nilfs in 2.6.32.y or 2.6.33.y.

The problem does not seem to be repeatable, even on 2.6.31.  I
detected no errors when I previously ran the same general test many
times using both that DVD-RAM and a loop-back file system on a hard
drive.

I will try again when I upgrade to 2.6.32 with Ubuntu 10.04.

> The following are the steps to make&install a newer nilfs in the
> 2.6.31 kernel.
>
>  $ sudo apt-get install linux-headers-`uname -r`
>  $ cd /usr/src
>  $ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.32.11.tar.bz2
>  $ tar jxf linux-2.6.32.11.tar.bz2
>  $ cd linux-2.6.32.11/fs/nilfs2
>  $ make -C /lib/modules/`uname -r`/build SUBDIRS=$PWD BUILD_DIR=$PWD modules
>  $ su
>  # rmmod nilfs2
>  # mv /lib/modules/`uname -r`/kernel/fs/nilfs2/nilfs2.ko{,.bak}
>  # install -m 644 -c nilfs2.ko /lib/modules/`uname -r`/kernel/fs/nilfs2/
>  # /sbin/depmod -A
>  # mount -t nilfs2 <device> <mount-dir>

Thank you for clarifying that process.  It will be very useful in
keeping the kernel module up-to-date.

> As for the write errors, the following two patches may work:
>
> "nilfs2: use GFP_NOIO for bio_alloc instead of GFP_NOWAIT" (merged to v2.6.32)
> "nilfs2: do not return io error for bio allocation failure" (merged to v2.6.33)
>
> It's hard to track the cause from the raw block data, so you don't
> have to keep the DVD-RAM data.  (dumpseg log for in-use segments is
> enough to me)

I will send a segment dump off-list.

If my analysis is correct, I have a DVD-RAM disk that has no more
sectors with which my DVD drive can correct defects.  Would you like
for me to save that disk for future error handling experiments?

Ed
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Response to a Write Error
       [not found]         ` <j2kf2d3ba1f1004141746n9f914465q3e24d48bec7f990e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-04-16 12:31           ` Ryusuke Konishi
       [not found]             ` <20100416.213151.77349735.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ryusuke Konishi @ 2010-04-16 12:31 UTC (permalink / raw)
  To: third07-Re5JQEeQqe8AvxtiuMwx3w; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi
On Wed, 14 Apr 2010 20:46:46 -0400, "Ed F." <third07-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > The following are the steps to make&install a newer nilfs in the
> > 2.6.31 kernel.
> >
> >  $ sudo apt-get install linux-headers-`uname -r`
> >  $ cd /usr/src
> >  $ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.32.11.tar.bz2
> >  $ tar jxf linux-2.6.32.11.tar.bz2
> >  $ cd linux-2.6.32.11/fs/nilfs2
> >  $ make -C /lib/modules/`uname -r`/build SUBDIRS=$PWD BUILD_DIR=$PWD modules
> >  $ su
> >  # rmmod nilfs2
> >  # mv /lib/modules/`uname -r`/kernel/fs/nilfs2/nilfs2.ko{,.bak}
> >  # install -m 644 -c nilfs2.ko /lib/modules/`uname -r`/kernel/fs/nilfs2/
> >  # /sbin/depmod -A
> >  # mount -t nilfs2 <device> <mount-dir>
> 
> Thank you for clarifying that process.  It will be very useful in
> keeping the kernel module up-to-date.

Please note that this procedure is not guaranteed.  This may not work
for some version since internal implementation or semantics of the
kernel often changes.

> If my analysis is correct, I have a DVD-RAM disk that has no more
> sectors with which my DVD drive can correct defects.  Would you like
> for me to save that disk for future error handling experiments?

Thanks for the offer.
I have interests in what happens on the latest kernel.
If you have an opportunity to try it, please report the result.

Ideally, it is expected to skip erroneous segments.
In reality, something unexpected may happen ;)

Cheers,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Writes doubled by NILFS2
       [not found]             ` <20100416.213151.77349735.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2010-04-20  8:39               ` Yongkun Wang
  2010-04-20 11:45                 ` Ryusuke Konishi
  0 siblings, 1 reply; 12+ messages in thread
From: Yongkun Wang @ 2010-04-20  8:39 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hey, guys,

We have a database system, the data is stored on the disk formatted with
NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).

I have run a trace at the system call level and the block IO level, that is,
tracing the requests before processed by NILFS2 and after processed by
NILFS2.

We use synchronous IO. So the amount of writes at the two trace points
should be equal. 
It is true when we use EXT2 file system.

However, for NILFS2, we found that the writes have been doubled, that is,
the amount of writes is doubled after processed by NILFS2. The amount of
writes at the system call level is equal between EXT2 and NILFS2. 

Since all the address are log-structured, it is hard to know what are the
additional writes.

Can you provide some hints on the additional writes? Is it caused by some
special functions such as snapshot?

Thank you very much.

Best Regards,
Yongkun

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writes doubled by NILFS2
  2010-04-20  8:39               ` Writes doubled by NILFS2 Yongkun Wang
@ 2010-04-20 11:45                 ` Ryusuke Konishi
       [not found]                   ` <20100420.204523.76757028.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ryusuke Konishi @ 2010-04-20 11:45 UTC (permalink / raw)
  To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf
  Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi,
On Tue, 20 Apr 2010 17:39:13 +0900, "Yongkun Wang" <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> Hey, guys,
> 
> We have a database system, the data is stored on the disk formatted with
> NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).
> 
> I have run a trace at the system call level and the block IO level, that is,
> tracing the requests before processed by NILFS2 and after processed by
> NILFS2.
> 
> We use synchronous IO. So the amount of writes at the two trace points
> should be equal. 
> It is true when we use EXT2 file system.
> 
> However, for NILFS2, we found that the writes have been doubled, that is,
> the amount of writes is doubled after processed by NILFS2. The amount of
> writes at the system call level is equal between EXT2 and NILFS2. 

Interesting results.  What kind of synchronous write did you use in
the measurement ?  fsync? or O_SYNC writes ?
 
> Since all the address are log-structured, it is hard to know what are the
> additional writes.
>
> Can you provide some hints on the additional writes? Is it caused by some
> special functions such as snapshot?

You can look into the logs with dumpseg(8) command:

 # dumpseg <segment number>

This shows summary of blocks written in the specified segment. lssu(1)
command would be of help for finding a log head.


In the dump log, files with inode number 3,4,5,6 are metadata.  The
log format is depicted in the page 10 of the following slides:

  http://www.nilfs.org/papers/jls2009-nilfs.pdf


In general, copy-on-write filesystems including lfs are said to incur
overheads by metadata writes especially for synchronous writes.

I guess small-sized fsyncs or O_SYNC writes are causing the overhead.

Thanks,
Ryusuke
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Writes doubled by NILFS2
       [not found]                   ` <20100420.204523.76757028.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2010-04-21 14:41                     ` Yongkun Wang
  2010-04-21 16:59                       ` Ryusuke Konishi
  0 siblings, 1 reply; 12+ messages in thread
From: Yongkun Wang @ 2010-04-21 14:41 UTC (permalink / raw)
  To: 'Ryusuke Konishi'; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

hi, Ryusuke,

Thank you for the reply.

It's O_SYNC.

The dumpseg and lssu are useful tools. The description of the structure in
the slides is very clear. 
Thank you for the hint of checking the meta data.

The information by dumpseg:

# dumpseg 698 | grep ino
      ino = 12, cno = 9, nblocks = 1356, ndatblk = 1343
      ino = 6, cno = 9, nblocks = 1, ndatblk = 1
      ino = 4, cno = 0, nblocks = 1, ndatblk = 1
      ino = 5, cno = 0, nblocks = 2, ndatblk = 2
      ino = 3, cno = 0, nblocks = 681, ndatblk = 681

File with inode 12 (ino=12) is the data file. "nblocks" is the number of
occupied blocks, right?

In this segment, the number of data file blocks is 1356, and the number of
the rest of the blocks is 685 (1+1+2+681).
Total 685 "overhead" blocks in this segment?

This may explain the additional writes in our trace.

Best Regards,
Yongkun

-----Original Message-----
From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
Sent: Tuesday, April 20, 2010 8:45 PM
To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: Writes doubled by NILFS2

Hi,
On Tue, 20 Apr 2010 17:39:13 +0900, "Yongkun Wang"
<yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> Hey, guys,
> 
> We have a database system, the data is stored on the disk formatted with
> NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).
> 
> I have run a trace at the system call level and the block IO level, that
is,
> tracing the requests before processed by NILFS2 and after processed by
> NILFS2.
> 
> We use synchronous IO. So the amount of writes at the two trace points
> should be equal. 
> It is true when we use EXT2 file system.
> 
> However, for NILFS2, we found that the writes have been doubled, that is,
> the amount of writes is doubled after processed by NILFS2. The amount of
> writes at the system call level is equal between EXT2 and NILFS2. 

Interesting results.  What kind of synchronous write did you use in
the measurement ?  fsync? or O_SYNC writes ?
 
> Since all the address are log-structured, it is hard to know what are the
> additional writes.
>
> Can you provide some hints on the additional writes? Is it caused by some
> special functions such as snapshot?

You can look into the logs with dumpseg(8) command:

 # dumpseg <segment number>

This shows summary of blocks written in the specified segment. lssu(1)
command would be of help for finding a log head.


In the dump log, files with inode number 3,4,5,6 are metadata.  The
log format is depicted in the page 10 of the following slides:

  http://www.nilfs.org/papers/jls2009-nilfs.pdf


In general, copy-on-write filesystems including lfs are said to incur
overheads by metadata writes especially for synchronous writes.

I guess small-sized fsyncs or O_SYNC writes are causing the overhead.

Thanks,
Ryusuke

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writes doubled by NILFS2
  2010-04-21 14:41                     ` Yongkun Wang
@ 2010-04-21 16:59                       ` Ryusuke Konishi
       [not found]                         ` <20100422.015907.119858125.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ryusuke Konishi @ 2010-04-21 16:59 UTC (permalink / raw)
  To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf
  Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi,
On Wed, 21 Apr 2010 23:41:13 +0900, "Yongkun Wang" <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> hi, Ryusuke,
> 
> Thank you for the reply.
> 
> It's O_SYNC.

So, the metadata is written out every write call.
 
> The dumpseg and lssu are useful tools. The description of the structure in
> the slides is very clear. 
> Thank you for the hint of checking the meta data.
> 
> The information by dumpseg:
> 
> # dumpseg 698 | grep ino
>       ino = 12, cno = 9, nblocks = 1356, ndatblk = 1343
>       ino = 6, cno = 9, nblocks = 1, ndatblk = 1
>       ino = 4, cno = 0, nblocks = 1, ndatblk = 1
>       ino = 5, cno = 0, nblocks = 2, ndatblk = 2
>       ino = 3, cno = 0, nblocks = 681, ndatblk = 681
>
> File with inode 12 (ino=12) is the data file. "nblocks" is the number of
> occupied blocks, right?

Right.

> In this segment, the number of data file blocks is 1356, and the number of
> the rest of the blocks is 685 (1+1+2+681).
> Total 685 "overhead" blocks in this segment?

Roughly, yes.

One matter for concern is that the regular file (ino=12) may be
continued from the precedent segments.  And, the DAT file (ino=3) may
possibly continue into the segment 699.

The number of the DAT file blocks (ino=3) looks so large to me.

Ideally the ratio of N-data-blocks and N-dat-blocks comes close to

 128 : 1

since a 4 KiB DAT block stores 128 entries of indirect block
addresses.

If the above count is normal, your workload seems extremely random and
scattered.

Can you probe the previous and next segment?

> This may explain the additional writes in our trace.

Very interesting.

Thanks,
Ryusuke Konishi

> -----Original Message-----
> From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> Sent: Tuesday, April 20, 2010 8:45 PM
> To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: Writes doubled by NILFS2
> 
> Hi,
> On Tue, 20 Apr 2010 17:39:13 +0900, "Yongkun Wang"
> <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > Hey, guys,
> > 
> > We have a database system, the data is stored on the disk formatted with
> > NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).
> > 
> > I have run a trace at the system call level and the block IO level, that
> is,
> > tracing the requests before processed by NILFS2 and after processed by
> > NILFS2.
> > 
> > We use synchronous IO. So the amount of writes at the two trace points
> > should be equal. 
> > It is true when we use EXT2 file system.
> > 
> > However, for NILFS2, we found that the writes have been doubled, that is,
> > the amount of writes is doubled after processed by NILFS2. The amount of
> > writes at the system call level is equal between EXT2 and NILFS2. 
> 
> Interesting results.  What kind of synchronous write did you use in
> the measurement ?  fsync? or O_SYNC writes ?
>  
> > Since all the address are log-structured, it is hard to know what are the
> > additional writes.
> >
> > Can you provide some hints on the additional writes? Is it caused by some
> > special functions such as snapshot?
> 
> You can look into the logs with dumpseg(8) command:
> 
>  # dumpseg <segment number>
> 
> This shows summary of blocks written in the specified segment. lssu(1)
> command would be of help for finding a log head.
> 
> 
> In the dump log, files with inode number 3,4,5,6 are metadata.  The
> log format is depicted in the page 10 of the following slides:
> 
>   http://www.nilfs.org/papers/jls2009-nilfs.pdf
> 
> 
> In general, copy-on-write filesystems including lfs are said to incur
> overheads by metadata writes especially for synchronous writes.
> 
> I guess small-sized fsyncs or O_SYNC writes are causing the overhead.
> 
> Thanks,
> Ryusuke
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Writes doubled by NILFS2
       [not found]                         ` <20100422.015907.119858125.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2010-04-22 13:54                           ` Yongkun Wang
  2010-04-23 15:59                             ` Ryusuke Konishi
  0 siblings, 1 reply; 12+ messages in thread
From: Yongkun Wang @ 2010-04-22 13:54 UTC (permalink / raw)
  To: 'Ryusuke Konishi'; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

hi, Ryusuke,

The data file is very large, several GB, it scatters in hundreds of
segments, so the "ndatblk"  may be very large.

I think all the "ndatblk" should also be considered as the overhead. So the
amount of superfluous blocks is almost equal with that in our trace.

Yes, the workload is composed of small random writes on large address space.
And the workload is very intensive, even using the asynchronous IO, the
results is almost the same.

The data has been reloaded last night, 3 consecutive segments are selected
as follow, just FYI.

# dumpseg 860 | grep ino
	ino = 12, cno = 175, nblocks = 1090, ndatblk = 1090
merged by 957 lines with ino=12, nblocks and ndatblk are summed
respectively.

# dumpseg 861 | grep ino
	ino = 12, cno = 175, nblocks = 1161, ndatblk = 585
404 lines merged with ino=12, nblocks and ndatblk are summed respectively.
	ino = 6, cno = 175, nblocks = 1, ndatblk = 1
	ino = 4, cno = 0, nblocks = 3, ndatblk = 2
	ino = 5, cno = 0, nblocks = 2, ndatblk = 2
	ino = 3, cno = 0, nblocks = 475, ndatblk = 475

# dumpseg 862 | grep ino
	ino = 3, cno = 0, nblocks = 720, ndatblk = 693
	ino = 12, cno = 176, nblocks = 693, ndatblk = 693
631 lines merged with ino=12, nblocks and ndatblk are summed respectively.

Please correct me if there is something wrong.

Thank you.

Best Regards,
Yongkun

-----Original Message-----
From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
Sent: Thursday, April 22, 2010 1:59 AM
To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: Writes doubled by NILFS2

Hi,
On Wed, 21 Apr 2010 23:41:13 +0900, "Yongkun Wang"
<yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> hi, Ryusuke,
> 
> Thank you for the reply.
> 
> It's O_SYNC.

So, the metadata is written out every write call.
 
> The dumpseg and lssu are useful tools. The description of the structure in
> the slides is very clear. 
> Thank you for the hint of checking the meta data.
> 
> The information by dumpseg:
> 
> # dumpseg 698 | grep ino
>       ino = 12, cno = 9, nblocks = 1356, ndatblk = 1343
>       ino = 6, cno = 9, nblocks = 1, ndatblk = 1
>       ino = 4, cno = 0, nblocks = 1, ndatblk = 1
>       ino = 5, cno = 0, nblocks = 2, ndatblk = 2
>       ino = 3, cno = 0, nblocks = 681, ndatblk = 681
>
> File with inode 12 (ino=12) is the data file. "nblocks" is the number of
> occupied blocks, right?

Right.

> In this segment, the number of data file blocks is 1356, and the number of
> the rest of the blocks is 685 (1+1+2+681).
> Total 685 "overhead" blocks in this segment?

Roughly, yes.

One matter for concern is that the regular file (ino=12) may be
continued from the precedent segments.  And, the DAT file (ino=3) may
possibly continue into the segment 699.

The number of the DAT file blocks (ino=3) looks so large to me.

Ideally the ratio of N-data-blocks and N-dat-blocks comes close to

 128 : 1

since a 4 KiB DAT block stores 128 entries of indirect block
addresses.

If the above count is normal, your workload seems extremely random and
scattered.

Can you probe the previous and next segment?

> This may explain the additional writes in our trace.

Very interesting.

Thanks,
Ryusuke Konishi

> -----Original Message-----
> From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> Sent: Tuesday, April 20, 2010 8:45 PM
> To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: Writes doubled by NILFS2
> 
> Hi,
> On Tue, 20 Apr 2010 17:39:13 +0900, "Yongkun Wang"
> <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > Hey, guys,
> > 
> > We have a database system, the data is stored on the disk formatted with
> > NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).
> > 
> > I have run a trace at the system call level and the block IO level, that
> is,
> > tracing the requests before processed by NILFS2 and after processed by
> > NILFS2.
> > 
> > We use synchronous IO. So the amount of writes at the two trace points
> > should be equal. 
> > It is true when we use EXT2 file system.
> > 
> > However, for NILFS2, we found that the writes have been doubled, that
is,
> > the amount of writes is doubled after processed by NILFS2. The amount of
> > writes at the system call level is equal between EXT2 and NILFS2. 
> 
> Interesting results.  What kind of synchronous write did you use in
> the measurement ?  fsync? or O_SYNC writes ?
>  
> > Since all the address are log-structured, it is hard to know what are
the
> > additional writes.
> >
> > Can you provide some hints on the additional writes? Is it caused by
some
> > special functions such as snapshot?
> 
> You can look into the logs with dumpseg(8) command:
> 
>  # dumpseg <segment number>
> 
> This shows summary of blocks written in the specified segment. lssu(1)
> command would be of help for finding a log head.
> 
> 
> In the dump log, files with inode number 3,4,5,6 are metadata.  The
> log format is depicted in the page 10 of the following slides:
> 
>   http://www.nilfs.org/papers/jls2009-nilfs.pdf
> 
> 
> In general, copy-on-write filesystems including lfs are said to incur
> overheads by metadata writes especially for synchronous writes.
> 
> I guess small-sized fsyncs or O_SYNC writes are causing the overhead.
> 
> Thanks,
> Ryusuke
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writes doubled by NILFS2
  2010-04-22 13:54                           ` Yongkun Wang
@ 2010-04-23 15:59                             ` Ryusuke Konishi
       [not found]                               ` <20100424.005941.176067504.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Ryusuke Konishi @ 2010-04-23 15:59 UTC (permalink / raw)
  To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf
  Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi,
On Thu, 22 Apr 2010 22:54:59 +0900, "Yongkun Wang" <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> hi, Ryusuke,
> 
> The data file is very large, several GB, it scatters in hundreds of
> segments, so the "ndatblk"  may be very large.
>
> I think all the "ndatblk" should also be considered as the overhead. So the
> amount of superfluous blocks is almost equal with that in our trace.

No, the "ndatblk" means the number of file data blocks.

Only the "ndatblk" of metadata files whose inode number equals 3, 4,
5, or 6, should be regarded as metadata.
 
> Yes, the workload is composed of small random writes on large address space.
> And the workload is very intensive, even using the asynchronous IO, the
> results is almost the same.
> 
> The data has been reloaded last night, 3 consecutive segments are selected
> as follow, just FYI.
> 
> # dumpseg 860 | grep ino
> 	ino = 12, cno = 175, nblocks = 1090, ndatblk = 1090
> merged by 957 lines with ino=12, nblocks and ndatblk are summed
> respectively.

This means that all the 1090 blocks (nblocks) are "ndatblk" (i.e. data
blocks) and no btree node blocks are included in the diff.

In other words, "ndatblk" shows a breakdown of "nblocks".

Certainly, this abbreviation looks confusing.

> # dumpseg 861 | grep ino
> 	ino = 12, cno = 175, nblocks = 1161, ndatblk = 585
> 404 lines merged with ino=12, nblocks and ndatblk are summed respectively.

This shows rear 576 blocks of the 1161 blocks were btree node blocks.
(very large btree update!)

> 	ino = 6, cno = 175, nblocks = 1, ndatblk = 1
> 	ino = 4, cno = 0, nblocks = 3, ndatblk = 2
> 	ino = 5, cno = 0, nblocks = 2, ndatblk = 2
> 	ino = 3, cno = 0, nblocks = 475, ndatblk = 475
> 
> # dumpseg 862 | grep ino
> 	ino = 3, cno = 0, nblocks = 720, ndatblk = 693
> 	ino = 12, cno = 176, nblocks = 693, ndatblk = 693
> 631 lines merged with ino=12, nblocks and ndatblk are summed respectively.
> 
> Please correct me if there is something wrong.
> 
> Thank you.
> 
> Best Regards,
> Yongkun

With regards,
Ryusuke Konishi
 
> -----Original Message-----
> From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> Sent: Thursday, April 22, 2010 1:59 AM
> To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: Writes doubled by NILFS2
> 
> Hi,
> On Wed, 21 Apr 2010 23:41:13 +0900, "Yongkun Wang"
> <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > hi, Ryusuke,
> > 
> > Thank you for the reply.
> > 
> > It's O_SYNC.
> 
> So, the metadata is written out every write call.
>  
> > The dumpseg and lssu are useful tools. The description of the structure in
> > the slides is very clear. 
> > Thank you for the hint of checking the meta data.
> > 
> > The information by dumpseg:
> > 
> > # dumpseg 698 | grep ino
> >       ino = 12, cno = 9, nblocks = 1356, ndatblk = 1343
> >       ino = 6, cno = 9, nblocks = 1, ndatblk = 1
> >       ino = 4, cno = 0, nblocks = 1, ndatblk = 1
> >       ino = 5, cno = 0, nblocks = 2, ndatblk = 2
> >       ino = 3, cno = 0, nblocks = 681, ndatblk = 681
> >
> > File with inode 12 (ino=12) is the data file. "nblocks" is the number of
> > occupied blocks, right?
> 
> Right.
> 
> > In this segment, the number of data file blocks is 1356, and the number of
> > the rest of the blocks is 685 (1+1+2+681).
> > Total 685 "overhead" blocks in this segment?
> 
> Roughly, yes.
> 
> One matter for concern is that the regular file (ino=12) may be
> continued from the precedent segments.  And, the DAT file (ino=3) may
> possibly continue into the segment 699.
> 
> The number of the DAT file blocks (ino=3) looks so large to me.
> 
> Ideally the ratio of N-data-blocks and N-dat-blocks comes close to
> 
>  128 : 1
> 
> since a 4 KiB DAT block stores 128 entries of indirect block
> addresses.
> 
> If the above count is normal, your workload seems extremely random and
> scattered.
> 
> Can you probe the previous and next segment?
> 
> > This may explain the additional writes in our trace.
> 
> Very interesting.
> 
> Thanks,
> Ryusuke Konishi
> 
> > -----Original Message-----
> > From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> > Sent: Tuesday, April 20, 2010 8:45 PM
> > To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> > Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: Writes doubled by NILFS2
> > 
> > Hi,
> > On Tue, 20 Apr 2010 17:39:13 +0900, "Yongkun Wang"
> > <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > > Hey, guys,
> > > 
> > > We have a database system, the data is stored on the disk formatted with
> > > NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).
> > > 
> > > I have run a trace at the system call level and the block IO level, that
> > is,
> > > tracing the requests before processed by NILFS2 and after processed by
> > > NILFS2.
> > > 
> > > We use synchronous IO. So the amount of writes at the two trace points
> > > should be equal. 
> > > It is true when we use EXT2 file system.
> > > 
> > > However, for NILFS2, we found that the writes have been doubled, that
> is,
> > > the amount of writes is doubled after processed by NILFS2. The amount of
> > > writes at the system call level is equal between EXT2 and NILFS2. 
> > 
> > Interesting results.  What kind of synchronous write did you use in
> > the measurement ?  fsync? or O_SYNC writes ?
> >  
> > > Since all the address are log-structured, it is hard to know what are
> the
> > > additional writes.
> > >
> > > Can you provide some hints on the additional writes? Is it caused by
> some
> > > special functions such as snapshot?
> > 
> > You can look into the logs with dumpseg(8) command:
> > 
> >  # dumpseg <segment number>
> > 
> > This shows summary of blocks written in the specified segment. lssu(1)
> > command would be of help for finding a log head.
> > 
> > 
> > In the dump log, files with inode number 3,4,5,6 are metadata.  The
> > log format is depicted in the page 10 of the following slides:
> > 
> >   http://www.nilfs.org/papers/jls2009-nilfs.pdf
> > 
> > 
> > In general, copy-on-write filesystems including lfs are said to incur
> > overheads by metadata writes especially for synchronous writes.
> > 
> > I guess small-sized fsyncs or O_SYNC writes are causing the overhead.
> > 
> > Thanks,
> > Ryusuke
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Writes doubled by NILFS2
       [not found]                               ` <20100424.005941.176067504.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2010-04-25 14:10                                 ` Yongkun Wang
  2010-04-25 16:12                                   ` Ryusuke Konishi
  0 siblings, 1 reply; 12+ messages in thread
From: Yongkun Wang @ 2010-04-25 14:10 UTC (permalink / raw)
  To: 'Ryusuke Konishi'; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi, Ryusuke,

You make it clear to me.

nblocks: blocks occupied.
ndatblks: valid data blocks, the "leaf" node of the B-Tree.
nblocks - ndatblks: the intermediate B-Tree node blocks.

The ratio of overhead blocks is : 
1 - sum(ndatblks[ino=12])/sum(nblocks)

It is around 50% in our system. The overhead 50% seems high. Is it possible
to further optimized?

>"This shows rear 576 blocks of the 1161 blocks were btree node blocks.
>(very large btree update!)"

Is it abnormal? Small and random updates are common in the transaction
processing system. Maybe the B-Tree node updating algorithm should be
revised in case of this special workload.

The dumpseg information is large for each segment.
I can send it to you to probe it via attachment if necessary.

Thanks again.

Best Regards,
Yongkun

-----Original Message-----
From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
Sent: Saturday, April 24, 2010 1:00 AM
To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: Writes doubled by NILFS2

Hi,
On Thu, 22 Apr 2010 22:54:59 +0900, "Yongkun Wang"
<yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> hi, Ryusuke,
> 
> The data file is very large, several GB, it scatters in hundreds of
> segments, so the "ndatblk"  may be very large.
>
> I think all the "ndatblk" should also be considered as the overhead. So
the
> amount of superfluous blocks is almost equal with that in our trace.

No, the "ndatblk" means the number of file data blocks.

Only the "ndatblk" of metadata files whose inode number equals 3, 4,
5, or 6, should be regarded as metadata.
 
> Yes, the workload is composed of small random writes on large address
space.
> And the workload is very intensive, even using the asynchronous IO, the
> results is almost the same.
> 
> The data has been reloaded last night, 3 consecutive segments are selected
> as follow, just FYI.
> 
> # dumpseg 860 | grep ino
> 	ino = 12, cno = 175, nblocks = 1090, ndatblk = 1090
> merged by 957 lines with ino=12, nblocks and ndatblk are summed
> respectively.

This means that all the 1090 blocks (nblocks) are "ndatblk" (i.e. data
blocks) and no btree node blocks are included in the diff.

In other words, "ndatblk" shows a breakdown of "nblocks".

Certainly, this abbreviation looks confusing.

> # dumpseg 861 | grep ino
> 	ino = 12, cno = 175, nblocks = 1161, ndatblk = 585
> 404 lines merged with ino=12, nblocks and ndatblk are summed respectively.

This shows rear 576 blocks of the 1161 blocks were btree node blocks.
(very large btree update!)

> 	ino = 6, cno = 175, nblocks = 1, ndatblk = 1
> 	ino = 4, cno = 0, nblocks = 3, ndatblk = 2
> 	ino = 5, cno = 0, nblocks = 2, ndatblk = 2
> 	ino = 3, cno = 0, nblocks = 475, ndatblk = 475
> 
> # dumpseg 862 | grep ino
> 	ino = 3, cno = 0, nblocks = 720, ndatblk = 693
> 	ino = 12, cno = 176, nblocks = 693, ndatblk = 693
> 631 lines merged with ino=12, nblocks and ndatblk are summed respectively.
> 
> Please correct me if there is something wrong.
> 
> Thank you.
> 
> Best Regards,
> Yongkun

With regards,
Ryusuke Konishi
 
> -----Original Message-----
> From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> Sent: Thursday, April 22, 2010 1:59 AM
> To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: Writes doubled by NILFS2
> 
> Hi,
> On Wed, 21 Apr 2010 23:41:13 +0900, "Yongkun Wang"
> <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > hi, Ryusuke,
> > 
> > Thank you for the reply.
> > 
> > It's O_SYNC.
> 
> So, the metadata is written out every write call.
>  
> > The dumpseg and lssu are useful tools. The description of the structure
in
> > the slides is very clear. 
> > Thank you for the hint of checking the meta data.
> > 
> > The information by dumpseg:
> > 
> > # dumpseg 698 | grep ino
> >       ino = 12, cno = 9, nblocks = 1356, ndatblk = 1343
> >       ino = 6, cno = 9, nblocks = 1, ndatblk = 1
> >       ino = 4, cno = 0, nblocks = 1, ndatblk = 1
> >       ino = 5, cno = 0, nblocks = 2, ndatblk = 2
> >       ino = 3, cno = 0, nblocks = 681, ndatblk = 681
> >
> > File with inode 12 (ino=12) is the data file. "nblocks" is the number of
> > occupied blocks, right?
> 
> Right.
> 
> > In this segment, the number of data file blocks is 1356, and the number
of
> > the rest of the blocks is 685 (1+1+2+681).
> > Total 685 "overhead" blocks in this segment?
> 
> Roughly, yes.
> 
> One matter for concern is that the regular file (ino=12) may be
> continued from the precedent segments.  And, the DAT file (ino=3) may
> possibly continue into the segment 699.
> 
> The number of the DAT file blocks (ino=3) looks so large to me.
> 
> Ideally the ratio of N-data-blocks and N-dat-blocks comes close to
> 
>  128 : 1
> 
> since a 4 KiB DAT block stores 128 entries of indirect block
> addresses.
> 
> If the above count is normal, your workload seems extremely random and
> scattered.
> 
> Can you probe the previous and next segment?
> 
> > This may explain the additional writes in our trace.
> 
> Very interesting.
> 
> Thanks,
> Ryusuke Konishi
> 
> > -----Original Message-----
> > From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> > Sent: Tuesday, April 20, 2010 8:45 PM
> > To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> > Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: Writes doubled by NILFS2
> > 
> > Hi,
> > On Tue, 20 Apr 2010 17:39:13 +0900, "Yongkun Wang"
> > <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > > Hey, guys,
> > > 
> > > We have a database system, the data is stored on the disk formatted
with
> > > NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).
> > > 
> > > I have run a trace at the system call level and the block IO level,
that
> > is,
> > > tracing the requests before processed by NILFS2 and after processed by
> > > NILFS2.
> > > 
> > > We use synchronous IO. So the amount of writes at the two trace points
> > > should be equal. 
> > > It is true when we use EXT2 file system.
> > > 
> > > However, for NILFS2, we found that the writes have been doubled, that
> is,
> > > the amount of writes is doubled after processed by NILFS2. The amount
of
> > > writes at the system call level is equal between EXT2 and NILFS2. 
> > 
> > Interesting results.  What kind of synchronous write did you use in
> > the measurement ?  fsync? or O_SYNC writes ?
> >  
> > > Since all the address are log-structured, it is hard to know what are
> the
> > > additional writes.
> > >
> > > Can you provide some hints on the additional writes? Is it caused by
> some
> > > special functions such as snapshot?
> > 
> > You can look into the logs with dumpseg(8) command:
> > 
> >  # dumpseg <segment number>
> > 
> > This shows summary of blocks written in the specified segment. lssu(1)
> > command would be of help for finding a log head.
> > 
> > 
> > In the dump log, files with inode number 3,4,5,6 are metadata.  The
> > log format is depicted in the page 10 of the following slides:
> > 
> >   http://www.nilfs.org/papers/jls2009-nilfs.pdf
> > 
> > 
> > In general, copy-on-write filesystems including lfs are said to incur
> > overheads by metadata writes especially for synchronous writes.
> > 
> > I guess small-sized fsyncs or O_SYNC writes are causing the overhead.
> > 
> > Thanks,
> > Ryusuke
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writes doubled by NILFS2
  2010-04-25 14:10                                 ` Yongkun Wang
@ 2010-04-25 16:12                                   ` Ryusuke Konishi
  0 siblings, 0 replies; 12+ messages in thread
From: Ryusuke Konishi @ 2010-04-25 16:12 UTC (permalink / raw)
  To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf
  Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi,
On Sun, 25 Apr 2010 23:10:01 +0900, "Yongkun Wang" <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> Hi, Ryusuke,
> 
> You make it clear to me.
> 
> nblocks: blocks occupied.
> ndatblks: valid data blocks, the "leaf" node of the B-Tree.
> nblocks - ndatblks: the intermediate B-Tree node blocks.
>
> The ratio of overhead blocks is : 
> 1 - sum(ndatblks[ino=12])/sum(nblocks)

Exactly.

> It is around 50% in our system. The overhead 50% seems high. Is it possible
> to further optimized?

Well, this may be alleviated by extending the log format for the
O_SYNC writes not to write out metadata so often.

Actually, we partially applied this kind of optimization, but it seems
the conditions are not met in your case.

O_SYNC writes sematically require inode updates and I guess this
prevents application of the optimization.

If you can change the sync method to O_DSYNC, things may change.

> >"This shows rear 576 blocks of the 1161 blocks were btree node blocks.
> >(very large btree update!)"
> 
> Is it abnormal? Small and random updates are common in the transaction
> processing system. Maybe the B-Tree node updating algorithm should be
> revised in case of this special workload.

Agreed.  The frequent writing of btree nodes and DAT blocks is
undesirable.

I'll think whether it's possible to apply the above optimization even
if the inode is updated.

> The dumpseg information is large for each segment.
> I can send it to you to probe it via attachment if necessary.

Thanks.  But, things seem clear with your reports thus far.

Regards,
Ryusuke Konishi

> -----Original Message-----
> From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> Sent: Saturday, April 24, 2010 1:00 AM
> To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: Writes doubled by NILFS2
> 
> Hi,
> On Thu, 22 Apr 2010 22:54:59 +0900, "Yongkun Wang"
> <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > hi, Ryusuke,
> > 
> > The data file is very large, several GB, it scatters in hundreds of
> > segments, so the "ndatblk"  may be very large.
> >
> > I think all the "ndatblk" should also be considered as the overhead. So
> the
> > amount of superfluous blocks is almost equal with that in our trace.
> 
> No, the "ndatblk" means the number of file data blocks.
> 
> Only the "ndatblk" of metadata files whose inode number equals 3, 4,
> 5, or 6, should be regarded as metadata.
>  
> > Yes, the workload is composed of small random writes on large address
> space.
> > And the workload is very intensive, even using the asynchronous IO, the
> > results is almost the same.
> > 
> > The data has been reloaded last night, 3 consecutive segments are selected
> > as follow, just FYI.
> > 
> > # dumpseg 860 | grep ino
> > 	ino = 12, cno = 175, nblocks = 1090, ndatblk = 1090
> > merged by 957 lines with ino=12, nblocks and ndatblk are summed
> > respectively.
> 
> This means that all the 1090 blocks (nblocks) are "ndatblk" (i.e. data
> blocks) and no btree node blocks are included in the diff.
> 
> In other words, "ndatblk" shows a breakdown of "nblocks".
> 
> Certainly, this abbreviation looks confusing.
> 
> > # dumpseg 861 | grep ino
> > 	ino = 12, cno = 175, nblocks = 1161, ndatblk = 585
> > 404 lines merged with ino=12, nblocks and ndatblk are summed respectively.
> 
> This shows rear 576 blocks of the 1161 blocks were btree node blocks.
> (very large btree update!)
> 
> > 	ino = 6, cno = 175, nblocks = 1, ndatblk = 1
> > 	ino = 4, cno = 0, nblocks = 3, ndatblk = 2
> > 	ino = 5, cno = 0, nblocks = 2, ndatblk = 2
> > 	ino = 3, cno = 0, nblocks = 475, ndatblk = 475
> > 
> > # dumpseg 862 | grep ino
> > 	ino = 3, cno = 0, nblocks = 720, ndatblk = 693
> > 	ino = 12, cno = 176, nblocks = 693, ndatblk = 693
> > 631 lines merged with ino=12, nblocks and ndatblk are summed respectively.
> > 
> > Please correct me if there is something wrong.
> > 
> > Thank you.
> > 
> > Best Regards,
> > Yongkun
> 
> With regards,
> Ryusuke Konishi
>  
> > -----Original Message-----
> > From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> > Sent: Thursday, April 22, 2010 1:59 AM
> > To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> > Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: Writes doubled by NILFS2
> > 
> > Hi,
> > On Wed, 21 Apr 2010 23:41:13 +0900, "Yongkun Wang"
> > <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > > hi, Ryusuke,
> > > 
> > > Thank you for the reply.
> > > 
> > > It's O_SYNC.
> > 
> > So, the metadata is written out every write call.
> >  
> > > The dumpseg and lssu are useful tools. The description of the structure
> in
> > > the slides is very clear. 
> > > Thank you for the hint of checking the meta data.
> > > 
> > > The information by dumpseg:
> > > 
> > > # dumpseg 698 | grep ino
> > >       ino = 12, cno = 9, nblocks = 1356, ndatblk = 1343
> > >       ino = 6, cno = 9, nblocks = 1, ndatblk = 1
> > >       ino = 4, cno = 0, nblocks = 1, ndatblk = 1
> > >       ino = 5, cno = 0, nblocks = 2, ndatblk = 2
> > >       ino = 3, cno = 0, nblocks = 681, ndatblk = 681
> > >
> > > File with inode 12 (ino=12) is the data file. "nblocks" is the number of
> > > occupied blocks, right?
> > 
> > Right.
> > 
> > > In this segment, the number of data file blocks is 1356, and the number
> of
> > > the rest of the blocks is 685 (1+1+2+681).
> > > Total 685 "overhead" blocks in this segment?
> > 
> > Roughly, yes.
> > 
> > One matter for concern is that the regular file (ino=12) may be
> > continued from the precedent segments.  And, the DAT file (ino=3) may
> > possibly continue into the segment 699.
> > 
> > The number of the DAT file blocks (ino=3) looks so large to me.
> > 
> > Ideally the ratio of N-data-blocks and N-dat-blocks comes close to
> > 
> >  128 : 1
> > 
> > since a 4 KiB DAT block stores 128 entries of indirect block
> > addresses.
> > 
> > If the above count is normal, your workload seems extremely random and
> > scattered.
> > 
> > Can you probe the previous and next segment?
> > 
> > > This may explain the additional writes in our trace.
> > 
> > Very interesting.
> > 
> > Thanks,
> > Ryusuke Konishi
> > 
> > > -----Original Message-----
> > > From: Ryusuke Konishi [mailto:ryusuke-sG5X7nlA6pw@public.gmane.org] 
> > > Sent: Tuesday, April 20, 2010 8:45 PM
> > > To: yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org
> > > Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Subject: Re: Writes doubled by NILFS2
> > > 
> > > Hi,
> > > On Tue, 20 Apr 2010 17:39:13 +0900, "Yongkun Wang"
> > > <yongkun-b/Nwfc8pyCekRP3n4FtbsiGn5s73+xxf@public.gmane.org> wrote:
> > > > Hey, guys,
> > > > 
> > > > We have a database system, the data is stored on the disk formatted
> with
> > > > NILFS2 (nilfs-2.0.15, kmod-nilfs-2.0.5-1.2.6.18_92.1.22.el5.x86_64).
> > > > 
> > > > I have run a trace at the system call level and the block IO level,
> that
> > > is,
> > > > tracing the requests before processed by NILFS2 and after processed by
> > > > NILFS2.
> > > > 
> > > > We use synchronous IO. So the amount of writes at the two trace points
> > > > should be equal. 
> > > > It is true when we use EXT2 file system.
> > > > 
> > > > However, for NILFS2, we found that the writes have been doubled, that
> > is,
> > > > the amount of writes is doubled after processed by NILFS2. The amount
> of
> > > > writes at the system call level is equal between EXT2 and NILFS2. 
> > > 
> > > Interesting results.  What kind of synchronous write did you use in
> > > the measurement ?  fsync? or O_SYNC writes ?
> > >  
> > > > Since all the address are log-structured, it is hard to know what are
> > the
> > > > additional writes.
> > > >
> > > > Can you provide some hints on the additional writes? Is it caused by
> > some
> > > > special functions such as snapshot?
> > > 
> > > You can look into the logs with dumpseg(8) command:
> > > 
> > >  # dumpseg <segment number>
> > > 
> > > This shows summary of blocks written in the specified segment. lssu(1)
> > > command would be of help for finding a log head.
> > > 
> > > 
> > > In the dump log, files with inode number 3,4,5,6 are metadata.  The
> > > log format is depicted in the page 10 of the following slides:
> > > 
> > >   http://www.nilfs.org/papers/jls2009-nilfs.pdf
> > > 
> > > 
> > > In general, copy-on-write filesystems including lfs are said to incur
> > > overheads by metadata writes especially for synchronous writes.
> > > 
> > > I guess small-sized fsyncs or O_SYNC writes are causing the overhead.
> > > 
> > > Thanks,
> > > Ryusuke
> > > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-04-25 16:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-13 23:38 Response to a Write Error Ed F.
     [not found] ` <t2kf2d3ba1f1004131638nb61e68cek1ecb43a86c316d35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-04-14 16:51   ` Ryusuke Konishi
     [not found]     ` <20100415.015106.58888665.ryusuke-sG5X7nlA6pw@public.gmane.org>
2010-04-15  0:46       ` Ed F.
     [not found]         ` <j2kf2d3ba1f1004141746n9f914465q3e24d48bec7f990e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-04-16 12:31           ` Ryusuke Konishi
     [not found]             ` <20100416.213151.77349735.ryusuke-sG5X7nlA6pw@public.gmane.org>
2010-04-20  8:39               ` Writes doubled by NILFS2 Yongkun Wang
2010-04-20 11:45                 ` Ryusuke Konishi
     [not found]                   ` <20100420.204523.76757028.ryusuke-sG5X7nlA6pw@public.gmane.org>
2010-04-21 14:41                     ` Yongkun Wang
2010-04-21 16:59                       ` Ryusuke Konishi
     [not found]                         ` <20100422.015907.119858125.ryusuke-sG5X7nlA6pw@public.gmane.org>
2010-04-22 13:54                           ` Yongkun Wang
2010-04-23 15:59                             ` Ryusuke Konishi
     [not found]                               ` <20100424.005941.176067504.ryusuke-sG5X7nlA6pw@public.gmane.org>
2010-04-25 14:10                                 ` Yongkun Wang
2010-04-25 16:12                                   ` Ryusuke Konishi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.