All of lore.kernel.org
 help / color / mirror / Atom feed
* Incompatibility of internal bitmap with ext4 barriers?
@ 2011-05-17 12:10 Jason Tinker
  2011-05-17 13:17 ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: Jason Tinker @ 2011-05-17 12:10 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3639 bytes --]

Hello

I have encountered a weird bug when trying to use ext4 partition on
top of mdadm RAID5 array. Mdadm array has internal bitmap, and ext4 is
mounted with default options - which means that barriers are enabled.
When trying to write large enough amount of files system just locks up
indefinitely, only hard reset helps. This seems to happen at random
times, yet consistently after several minutes of usage.
I tried different configuration options and it seems to happen only
when both barriers on ext4 and mdadm's internal bitmaps are enabled.
After disabling ext4 barriers for good (it seemed like a lesser evil)
no lock ups happened for 3 months.
Mdadm is version 3.1.3, kernel is 2.6.32 (rhel6)

Initialy it happened on this configuration:
1 RAID0 of 2x1TB drives + 1 RAID0 of 2x1TB drives +1x2TB drive
Each array had 1 hdd on PCI SATA (SiliconImage) controller and 1 on
internal ICH7 (Intel G41 chipset), 2Tb drive was on PCI SATA
controller.

Later I successfully reproduced the same bug in a test setup with all
partitions on a single drive.
The following steps where taken:
1. Created 6 identical blank partitions (I made 6*1GB partitions on a
single hdd).
2. Created 3 RAID0 arrays from these partitions: 0&1, 2&3, 4&5.
3. Created MBR and blank primary partition on each of these arrays
using fdisk (this step is
probably optional).
4. Created 1 RAID5 array from these 3 partitions with --bitmap=internal,
everything else default.
5. Created ext4 filesystem on RAID5 with default options, mount with default
options (barriers are enabled by default).
6. Tried to rsync several hundred of ~1-20 MB files to mounted directory.

/var/log/messages at the moment of lock up:

kernel: INFO: task md90_raid5:13736 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_
timeout_secs" disables this message.
kernel: md90_raid5    D 0000000000000000     0 13736      2 0x00000080
kernel: ffff88003e0b7ab0 0000000000000046 ffff88006d273e60 ffff8800716b8240
kernel: ffff88003e0b7a50 ffffffff8123a553 ffff880072ea4800 0000000000000810
kernel: ffff88006f50ba98 ffff88003e0b7fd8 0000000000010518 ffff88006f50ba98
kernel: Call Trace:
kernel: [<ffffffff8123a553>] ? elv_insert+0x133/0x1f0
kernel: [<ffffffff810920ce>] ? prepare_to_wait+0x4e/0x80
kernel: [<ffffffff813d0535>] md_make_request+0x85/0x230
kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff81241652>] ? generic_make_request+0x1b2/0x4f0
kernel: [<ffffffff81241652>] generic_make_request+0x1b2/0x4f0
kernel: [<ffffffff8106333a>] ? find_busiest_group+0x96a/0xb40
kernel: [<ffffffffa03d8d9d>] ops_run_io+0x22d/0x330 [raid456]
kernel: [<ffffffff813d1ef6>] ? md_super_write+0xd6/0xe0
kernel: [<ffffffffa03db9f5>] handle_stripe+0x4d5/0x22e0 [raid456]
kernel: [<ffffffff81059db2>] ? finish_task_switch+0x42/0xd0
kernel: [<ffffffffa03ddc9f>] raid5d+0x49f/0x690 [raid456]
kernel: [<ffffffff813d182c>] md_thread+0x5c/0x130
kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff813d17d0>] ? md_thread+0x0/0x130
kernel: [<ffffffff81091a76>] kthread+0x96/0xa0
kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
kernel: [<ffffffff810919e0>] ? kthread+0x0/0xa0
kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20

I have included all additional info and logs about test setup in
separate attachments.
According to logs it seems that the bug is in mdadm, but I'm not sure
since I haven't found any similar reports anywhere.
It would be great if someone tried to reproduce it on their machine,
this shouldn't take long, maybe 20 minutes or so...

[-- Attachment #2: fdisk-info --]
[-- Type: application/octet-stream, Size: 1116 bytes --]

Disk /dev/sdb: 1000.2 GB, 1000203804160 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953523055 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0000c1b3

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *        2048     1050623      524288   83  Linux
/dev/sdb2         1050624    63965183    31457280   83  Linux
/dev/sdb3        63965184   126879743    31457280   83  Linux
/dev/sdb4       126879744  1953521663   913320960    5  Extended
/dev/sdb5       126883840   189798399    31457280   83  Linux
/dev/sdb6       189800448   198189055     4194304   82  Linux swap / Solaris
/dev/sdb7       198193968   200137769      971901   83  Linux
/dev/sdb8       200137833   202097699      979933+  83  Linux
/dev/sdb9       202097763   204057629      979933+  83  Linux
/dev/sdb10      204057693   206017559      979933+  83  Linux
/dev/sdb11      206017623   207977489      979933+  83  Linux
/dev/sdb12      207977553   209937419      979933+  83  Linux


[-- Attachment #3: messages --]
[-- Type: application/octet-stream, Size: 17016 bytes --]

Feb  8 19:04:24 store-el6 kernel: md: bind<md910p1>
Feb  8 19:04:24 store-el6 kernel: md: bind<md911p1>
Feb  8 19:04:24 store-el6 kernel: md: bind<md912p1>
Feb  8 19:04:24 store-el6 kernel: raid5: device md911p1 operational as raid disk 1
Feb  8 19:04:24 store-el6 kernel: raid5: device md910p1 operational as raid disk 0
Feb  8 19:04:24 store-el6 kernel: raid5: allocated 3230kB for md90
Feb  8 19:04:24 store-el6 kernel: 1: w=1 pa=0 pr=3 m=1 a=2 r=3 op1=0 op2=0
Feb  8 19:04:24 store-el6 kernel: 0: w=2 pa=0 pr=3 m=1 a=2 r=3 op1=0 op2=0
Feb  8 19:04:24 store-el6 kernel: raid5: raid level 5 set md90 active with 2 out of 3 devices, algorithm 2
Feb  8 19:04:24 store-el6 kernel: RAID5 conf printout:
Feb  8 19:04:24 store-el6 kernel: --- rd:3 wd:2
Feb  8 19:04:24 store-el6 kernel: disk 0, o:1, dev:md910p1
Feb  8 19:04:24 store-el6 kernel: disk 1, o:1, dev:md911p1
Feb  8 19:04:24 store-el6 kernel: md90: bitmap initialized from disk: read 1/1 pages, set 30 bits
Feb  8 19:04:24 store-el6 kernel: created bitmap (1 pages) for device md90
Feb  8 19:04:24 store-el6 kernel: md90: detected capacity change from 0 to 3986685952
Feb  8 19:04:24 store-el6 kernel: RAID5 conf printout:
Feb  8 19:04:24 store-el6 kernel: --- rd:3 wd:2
Feb  8 19:04:24 store-el6 kernel: disk 0, o:1, dev:md910p1
Feb  8 19:04:24 store-el6 kernel: disk 1, o:1, dev:md911p1
Feb  8 19:04:24 store-el6 kernel: disk 2, o:1, dev:md912p1
Feb  8 19:04:24 store-el6 kernel: md: recovery of RAID array md90
Feb  8 19:04:24 store-el6 kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Feb  8 19:04:24 store-el6 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Feb  8 19:04:24 store-el6 kernel: md: using 128k window, over a total of 1946624 blocks.
Feb  8 19:04:25 store-el6 kernel: md90: unknown partition table
Feb  8 19:05:50 store-el6 kernel: EXT4-fs (md90): mounted filesystem with ordered data mode
Feb  8 19:06:31 store-el6 kernel: md: md90: recovery done.
Feb  8 19:06:32 store-el6 kernel: RAID5 conf printout:
Feb  8 19:06:32 store-el6 kernel: --- rd:3 wd:3
Feb  8 19:06:32 store-el6 kernel: disk 0, o:1, dev:md910p1
Feb  8 19:06:32 store-el6 kernel: disk 1, o:1, dev:md911p1
Feb  8 19:06:32 store-el6 kernel: disk 2, o:1, dev:md912p1
Feb  8 19:06:35 store-el6 kernel: EXT4-fs (md90): mounted filesystem with ordered data mode

#
# here i started "rsync -av /somewhere /mnt/test-raid" in gnome-terminal
#

Feb  8 19:09:39 store-el6 kernel: INFO: task gnome-terminal:5463 blocked for more than 120 seconds.
Feb  8 19:09:39 store-el6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  8 19:09:39 store-el6 kernel: gnome-termina D ffff8800413ed998     0  5463   5459 0x00000080
Feb  8 19:09:39 store-el6 kernel: ffff8800413ed8b8 0000000000000086 ffff88004bd2e138 ffff88004bd2e138
Feb  8 19:09:39 store-el6 kernel: ffff880001e96980 ffff880001e96980 ffff880066e48080 ffff880001e96980
Feb  8 19:09:39 store-el6 kernel: ffff880066e48638 ffff8800413edfd8 0000000000010518 ffff880066e48638
Feb  8 19:09:39 store-el6 kernel: Call Trace:
Feb  8 19:09:39 store-el6 kernel: [<ffffffff814c9ad5>] schedule_timeout+0x225/0x2f0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8101ae45>] ? native_sched_clock+0x15/0x70
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8101a4c9>] ? sched_clock+0x9/0x10
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81097f25>] ? sched_clock_local+0x25/0x90
Feb  8 19:09:39 store-el6 kernel: [<ffffffff814c9743>] wait_for_common+0x123/0x180
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8105c530>] ? default_wake_function+0x0/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff814c985d>] wait_for_completion+0x1d/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8108d107>] flush_work+0x77/0xc0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8108cb90>] ? wq_barrier_func+0x0/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8108d324>] flush_delayed_work+0x54/0x70
Feb  8 19:09:39 store-el6 kernel: [<ffffffff812fd5d5>] tty_flush_to_ldisc+0x15/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff812f8117>] n_tty_poll+0x67/0x1d0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff812f3baa>] tty_poll+0x8a/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811824ab>] do_sys_poll+0x29b/0x520
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811820c0>] ? __pollwait+0x0/0xf0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811821b0>] ? pollwake+0x0/0x60
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8120c6cf>] ? selinux_file_permission+0xbf/0x150
Feb  8 19:09:39 store-el6 kernel: [<ffffffff811ffb76>] ? security_file_permission+0x16/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8116d9d1>] ? vfs_read+0x181/0x1a0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8118292c>] sys_poll+0x7c/0x110
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Feb  8 19:09:39 store-el6 kernel: INFO: task md90_raid5:13736 blocked for more than 120 seconds.
Feb  8 19:09:39 store-el6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  8 19:09:39 store-el6 kernel: md90_raid5    D 0000000000000000     0 13736      2 0x00000080
Feb  8 19:09:39 store-el6 kernel: ffff88003e0b7ab0 0000000000000046 ffff88006d273e60 ffff8800716b8240
Feb  8 19:09:39 store-el6 kernel: ffff88003e0b7a50 ffffffff8123a553 ffff880072ea4800 0000000000000810
Feb  8 19:09:39 store-el6 kernel: ffff88006f50ba98 ffff88003e0b7fd8 0000000000010518 ffff88006f50ba98
Feb  8 19:09:39 store-el6 kernel: Call Trace:
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8123a553>] ? elv_insert+0x133/0x1f0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810920ce>] ? prepare_to_wait+0x4e/0x80
Feb  8 19:09:39 store-el6 kernel: [<ffffffff813d0535>] md_make_request+0x85/0x230
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81241652>] ? generic_make_request+0x1b2/0x4f0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81241652>] generic_make_request+0x1b2/0x4f0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8106333a>] ? find_busiest_group+0x96a/0xb40
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa03d8d9d>] ops_run_io+0x22d/0x330 [raid456]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff813d1ef6>] ? md_super_write+0xd6/0xe0
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa03db9f5>] handle_stripe+0x4d5/0x22e0 [raid456]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81059db2>] ? finish_task_switch+0x42/0xd0
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa03ddc9f>] raid5d+0x49f/0x690 [raid456]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff813d182c>] md_thread+0x5c/0x130
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffff813d17d0>] ? md_thread+0x0/0x130
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091a76>] kthread+0x96/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810919e0>] ? kthread+0x0/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20
Feb  8 19:09:39 store-el6 kernel: INFO: task flush-9:90:13883 blocked for more than 120 seconds.
Feb  8 19:09:39 store-el6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  8 19:09:39 store-el6 kernel: flush-9:90    D 0000000000000001     0 13883      2 0x00000080
Feb  8 19:09:39 store-el6 kernel: ffff88003e16b6d0 0000000000000046 ffff88003e16b6a0 ffff880075384dd8
Feb  8 19:09:39 store-el6 kernel: ffff880070efa618 ffffffffa01e5a00 ffff880073744610 0000000000000000
Feb  8 19:09:39 store-el6 kernel: ffff88005023d0e8 ffff88003e16bfd8 0000000000010518 ffff88005023d0e8
Feb  8 19:09:39 store-el6 kernel: Call Trace:
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810920ce>] ? prepare_to_wait+0x4e/0x80
Feb  8 19:09:39 store-el6 kernel: [<ffffffff813d0535>] md_make_request+0x85/0x230
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81241652>] generic_make_request+0x1b2/0x4f0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8110e865>] ? mempool_alloc_slab+0x15/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091dcf>] ? wake_up_bit+0x2f/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81241a1f>] submit_bio+0x8f/0x120
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119d324>] submit_bh+0xf4/0x140
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119f320>] __block_write_full_page+0x1e0/0x3b0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119e9b0>] ? end_buffer_async_write+0x0/0x190
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01b4c00>] ? noalloc_get_block_write+0x0/0x60 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01b4c00>] ? noalloc_get_block_write+0x0/0x60 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119fc20>] block_write_full_page_endio+0xe0/0x120
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01af960>] ? ext4_bh_delay_or_unwritten+0x0/0x30 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119fc75>] block_write_full_page+0x15/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01b0ea6>] ext4_writepage+0xd6/0x3a0 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01b0d2c>] mpage_da_submit_io+0x14c/0x1d0 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01b6149>] ext4_da_writepages+0x3d9/0x600 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81120bc1>] do_writepages+0x21/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119656d>] writeback_single_inode+0xdd/0x2c0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119696e>] writeback_sb_inodes+0xce/0x180
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81196ac3>] writeback_inodes_wb+0xa3/0x1a0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81196e5b>] wb_writeback+0x29b/0x3f0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff814c8d96>] ? thread_return+0x4e/0x778
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81197149>] wb_do_writeback+0x199/0x240
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81197253>] bdi_writeback_task+0x63/0x1b0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091ca7>] ? bit_waitqueue+0x17/0xd0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8112ef70>] ? bdi_start_fn+0x0/0x100
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8112eff6>] bdi_start_fn+0x86/0x100
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8112ef70>] ? bdi_start_fn+0x0/0x100
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091a76>] kthread+0x96/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810919e0>] ? kthread+0x0/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20
Feb  8 19:09:39 store-el6 kernel: INFO: task jbd2/md90-8:17803 blocked for more than 120 seconds.
Feb  8 19:09:39 store-el6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  8 19:09:39 store-el6 kernel: jbd2/md90-8   D 0000000000000002     0 17803      2 0x00000080
Feb  8 19:09:39 store-el6 kernel: ffff8800741d5c10 0000000000000046 ffff8800741d5bb0 ffffffffa03d92e8
Feb  8 19:09:39 store-el6 kernel: ffff880047bc9c00 ffff880047bc9da0 ffff880047bc9c00 ffff880072044200
Feb  8 19:09:39 store-el6 kernel: ffff88003e183a98 ffff8800741d5fd8 0000000000010518 ffff88003e183a98
Feb  8 19:09:39 store-el6 kernel: Call Trace:
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa03d92e8>] ? unplug_slaves+0x98/0xe0 [raid456]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8109bae9>] ? ktime_get_ts+0xa9/0xe0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119e580>] ? sync_buffer+0x0/0x50
Feb  8 19:09:39 store-el6 kernel: [<ffffffff814c9533>] io_schedule+0x73/0xc0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119e5c0>] sync_buffer+0x40/0x50
Feb  8 19:09:39 store-el6 kernel: [<ffffffff814c9daf>] __wait_on_bit+0x5f/0x90
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119e580>] ? sync_buffer+0x0/0x50
Feb  8 19:09:39 store-el6 kernel: [<ffffffff814c9e58>] out_of_line_wait_on_bit+0x78/0x90
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091e20>] ? wake_bit_function+0x0/0x50
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119e576>] __wait_on_buffer+0x26/0x30
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa0184801>] jbd2_journal_commit_transaction+0x10d1/0x14e0 [jbd2]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa018a0b8>] kjournald2+0xb8/0x220 [jbd2]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa018a000>] ? kjournald2+0x0/0x220 [jbd2]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091a76>] kthread+0x96/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810919e0>] ? kthread+0x0/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20
Feb  8 19:09:39 store-el6 kernel: INFO: task rsync:17880 blocked for more than 120 seconds.
Feb  8 19:09:39 store-el6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  8 19:09:39 store-el6 kernel: rsync         D 0000000000000100     0 17880  17879 0x00000080
Feb  8 19:09:39 store-el6 kernel: ffff880012625918 0000000000000086 ffff880012625968 ffffffff8111e411
Feb  8 19:09:39 store-el6 kernel: ffff8800000126c0 ffffea0000697188 ffff8800772f5118 0000000000001000
Feb  8 19:09:39 store-el6 kernel: ffff880037db5a58 ffff880012625fd8 0000000000010518 ffff880037db5a58
Feb  8 19:09:39 store-el6 kernel: Call Trace:
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8111e411>] ? __alloc_pages_nodemask+0x111/0x850
Feb  8 19:09:39 store-el6 kernel: [<ffffffff810920ce>] ? prepare_to_wait+0x4e/0x80
Feb  8 19:09:39 store-el6 kernel: [<ffffffff813d0535>] md_make_request+0x85/0x230
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119d5ef>] ? __find_get_block_slow+0xaf/0x130
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81241652>] generic_make_request+0x1b2/0x4f0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8110e865>] ? mempool_alloc_slab+0x15/0x20
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119dcfc>] ? __getblk+0x9c/0x2e0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81241a1f>] submit_bio+0x8f/0x120
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119d324>] submit_bh+0xf4/0x140
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8119ec83>] ll_rw_block+0x143/0x150
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01baa59>] ext4_find_entry+0x179/0x4c0 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffffa01baded>] ext4_lookup+0x4d/0x140 [ext4]
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8117ae0b>] do_lookup+0x18b/0x220
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8117b9c5>] __link_path_walk+0x6f5/0x1040
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8117c59a>] path_walk+0x6a/0xe0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8117c76b>] do_path_lookup+0x5b/0xa0
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8116e6a1>] ? get_empty_filp+0xa1/0x170
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8117d6a6>] do_filp_open+0x106/0xd50
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8120c6cf>] ? selinux_file_permission+0xbf/0x150
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81189b82>] ? alloc_fd+0x92/0x160
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8116a1a9>] do_sys_open+0x69/0x140
Feb  8 19:09:39 store-el6 kernel: [<ffffffff8116a2c0>] sys_open+0x20/0x30
Feb  8 19:09:39 store-el6 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Feb  8 19:09:50 store-el6 ntpd[2024]: synchronized to 92.43.184.44, stratum 2

#
# here i was forced to make a hard reset, neither remote x session was not responging nor local ttys
#

Feb  8 19:11:35 store-el6 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Feb  8 19:11:35 store-el6 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="1670" x-info="http://www.rsyslog.com"] (re)start
Feb  8 19:11:35 store-el6 kernel: Initializing cgroup subsys cpuset
Feb  8 19:11:35 store-el6 kernel: Initializing cgroup subsys cpu
Feb  8 19:11:35 store-el6 kernel: Linux version 2.6.32-71.14.1.el6.x86_64 (mockbuild@ls20-bc2-14.build.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Wed Jan 5 17:01:01 EST 2011
Feb  8 19:11:35 store-el6 kernel: Command line: ro root=UUID=a59ea22b-6adf-497b-9409-4cf9062c0b99 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto rhgb quiet


[-- Attachment #4: raid-conf --]
[-- Type: application/octet-stream, Size: 651 bytes --]

DEVICE partitions
MAILADDR root

ARRAY /dev/md910 level=raid0 metadata=1.2 num-devices=2 
	UUID=5ebddd81:9a2d0aeb:d55f3c13:e1c532cf name=store-el6:test1
	devices=/dev/sdb8,/dev/sdb7
ARRAY /dev/md911 level=raid0 metadata=1.2 num-devices=2 
	UUID=6a06b35a:b39e97e4:77b26ada:f324e8fa name=store-el6:test2
	devices=/dev/sdb10,/dev/sdb9
ARRAY /dev/md912 level=raid0 metadata=1.2 num-devices=2 
	UUID=65c9e5f3:1594e59e:b41d1d8e:6f2671de name=store-el6:test3
	devices=/dev/sdb12,/dev/sdb11
ARRAY /dev/md90 level=raid5 metadata=1.2 num-devices=3 
	UUID=ae8ba75d:d0415f07:5d60eedd:394f28eb name=store-el6:test
	devices=/dev/md910p1,/dev/md911p1,/dev/md912p1



[-- Attachment #5: raid-details --]
[-- Type: application/octet-stream, Size: 898 bytes --]

/dev/md90:
        Version : 1.2
  Creation Time : Tue Feb  8 19:04:24 2011
     Raid Level : raid5
     Array Size : 3893248 (3.71 GiB 3.99 GB)
  Used Dev Size : 1946624 (1901.32 MiB 1993.34 MB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Feb  8 19:32:28 2011
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : store-el6:test  (local to host store-el6)
           UUID : 86914486:82756172:cf377f53:95c77f0a
         Events : 46

    Number   Major   Minor   RaidDevice State
       0     259        2        0      active sync   /dev/md910p1
       1     259        3        1      active sync   /dev/md911p1
       3     259        4        2      active sync   /dev/md912p1


[-- Attachment #6: raid-mdstat --]
[-- Type: application/octet-stream, Size: 482 bytes --]

Personalities : [raid0] [raid6] [raid5] [raid4] 
md90 : active raid5 md910p1[0] md912p1[3] md911p1[1]
      3893248 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md912 : active raid0 sdb12[1] sdb11[0]
      1956864 blocks super 1.2 512k chunks
      
md910 : active raid0 sdb7[0] sdb8[1]
      1949184 blocks super 1.2 512k chunks
      
md911 : active raid0 sdb9[0] sdb10[1]
      1956864 blocks super 1.2 512k chunks


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Incompatibility of internal bitmap with ext4 barriers?
  2011-05-17 12:10 Incompatibility of internal bitmap with ext4 barriers? Jason Tinker
@ 2011-05-17 13:17 ` Roman Mamedov
  2011-05-17 18:43   ` Jason Tinker
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2011-05-17 13:17 UTC (permalink / raw)
  To: Jason Tinker; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 934 bytes --]

On Tue, 17 May 2011 16:10:07 +0400
Jason Tinker <jsntinker@gmail.com> wrote:

> I have encountered a weird bug when trying to use ext4 partition on
> top of mdadm RAID5 array. Mdadm array has internal bitmap, and ext4 is
> mounted with default options - which means that barriers are enabled.

The same combination works perfectly for me on 2.6.38.

> When trying to write large enough amount of files system just locks up
> indefinitely, only hard reset helps. This seems to happen at random
> times, yet consistently after several minutes of usage.
> I tried different configuration options and it seems to happen only
> when both barriers on ext4 and mdadm's internal bitmaps are enabled.
> After disabling ext4 barriers for good (it seemed like a lesser evil)
> no lock ups happened for 3 months.
> Mdadm is version 3.1.3, kernel is 2.6.32 (rhel6)

How about a newer kernel, maybe?

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Incompatibility of internal bitmap with ext4 barriers?
  2011-05-17 13:17 ` Roman Mamedov
@ 2011-05-17 18:43   ` Jason Tinker
  2011-05-17 19:00     ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: Jason Tinker @ 2011-05-17 18:43 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid

2011/5/17 Roman Mamedov <rm@romanrm.ru>:
> On Tue, 17 May 2011 16:10:07 +0400
> Jason Tinker <jsntinker@gmail.com> wrote:
>
> The same combination works perfectly for me on 2.6.38.

Do you have RAID5 over RAID0's or an ordinary RAID5? On my box
ordinary RAID5 works perfectly too.

>
> How about a newer kernel, maybe?

This is out of the question, unfortunately. Compiling my own kernel on
top of RHEL would kind of defeat the purpose of having safe low
maintenance distro. And both stable Debian and Ubuntu LTS have the
same kernel.

If your configuration is also RAID5 over RAID0 and it works fine on
2.6.38 then I'll just wait for next ubuntu lts...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Incompatibility of internal bitmap with ext4 barriers?
  2011-05-17 18:43   ` Jason Tinker
@ 2011-05-17 19:00     ` Roman Mamedov
  2011-05-17 20:20       ` John Robinson
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2011-05-17 19:00 UTC (permalink / raw)
  To: Jason Tinker; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

On Tue, 17 May 2011 22:43:48 +0400
Jason Tinker <jsntinker@gmail.com> wrote:

> Do you have RAID5 over RAID0's or an ordinary RAID5? On my box
> ordinary RAID5 works perfectly too.

Yes, I ran RAID5 of 4x2TB+(1+1TB RAID0)+(1.5+0.5TB RAID0), recently changed to
RAID6 though, also replaced one of the RAID0s with another 2TB drive.

> If your configuration is also RAID5 over RAID0 and it works fine on
> 2.6.38 then I'll just wait for next ubuntu lts...

I am not saying you should just give up. The logs you posted seem like
they can be very helpful in tracking this down. But if it is indeed a kernel
issue, how do you fix or even debug it it without replacing/compiling a new
kernel, which is exactly what you do not want to do. And if you're already
replacing a kernel, why not try a 2.6.38 right away, to check if your issue is
already solved in there (after all 2.6.32 vs 2.6.38 are eons apart BOTH
mdadm-wise and ext4-wise).

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Incompatibility of internal bitmap with ext4 barriers?
  2011-05-17 19:00     ` Roman Mamedov
@ 2011-05-17 20:20       ` John Robinson
  0 siblings, 0 replies; 5+ messages in thread
From: John Robinson @ 2011-05-17 20:20 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Jason Tinker, linux-raid

On 17/05/2011 20:00, Roman Mamedov wrote:
> On Tue, 17 May 2011 22:43:48 +0400
> Jason Tinker<jsntinker@gmail.com>  wrote:
>
>> Do you have RAID5 over RAID0's or an ordinary RAID5? On my box
>> ordinary RAID5 works perfectly too.
>
> Yes, I ran RAID5 of 4x2TB+(1+1TB RAID0)+(1.5+0.5TB RAID0), recently changed to
> RAID6 though, also replaced one of the RAID0s with another 2TB drive.
>
>> If your configuration is also RAID5 over RAID0 and it works fine on
>> 2.6.38 then I'll just wait for next ubuntu lts...
>
> I am not saying you should just give up. The logs you posted seem like
> they can be very helpful in tracking this down. But if it is indeed a kernel
> issue, how do you fix or even debug it it without replacing/compiling a new
> kernel, which is exactly what you do not want to do. And if you're already
> replacing a kernel, why not try a 2.6.38 right away, to check if your issue is
> already solved in there (after all 2.6.32 vs 2.6.38 are eons apart BOTH
> mdadm-wise and ext4-wise).

I've said it before and no doubt I'll say it again, but the RHEL kernels 
are often a lot closer to the most recent than their version numbers 
suggest. It's a little more difficult to tell with EL6, because Red Hat 
don't ship vanilla+patches sources any more, but their .32 series 
includes lots of backported fixes from .38 etc.

Jason, it's probably worth testing with vanilla .38 to see if that does 
fix your problem, then post to Red Hat bugzilla saying so, especially as 
it looks to me like the configuration you're trying to use ought to be a 
supported one - then a future RHEL .32 might well include the fix.

Cheers,

John.
(Not a spokesman for Red Hat, nor anyone else other than myself, and 
sometimes not even that ;-)


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-05-17 20:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-17 12:10 Incompatibility of internal bitmap with ext4 barriers? Jason Tinker
2011-05-17 13:17 ` Roman Mamedov
2011-05-17 18:43   ` Jason Tinker
2011-05-17 19:00     ` Roman Mamedov
2011-05-17 20:20       ` John Robinson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.