All of lore.kernel.org
 help / color / mirror / Atom feed
* Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
@ 2014-10-17 17:50 ` Chris Friesen
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Friesen @ 2014-10-17 17:50 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, J. Bruce Fields,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q,
	linux-rt-users-u79uwXL29TY76Z2rM5mHXA

Hi,

We've run into an odd bug and I'm looking for some advice on how to 
proceed.  Please excuse the cross-posting, I wasn't sure where this 
belongs.  Any guidance would be greatly appreciated.

The problem we're seeing is that by copying a large (10GB) file to an 
nfs-mounted directory we can cause all the nfsd threads to hang in D 
state on the server.

Background:

We've got a couple of systems running 3.4.97, with the "rt120" patch for 
CONFIG_PREEMPT_RT.  On the server we've got an nfs-exported ext4 
filesystem, on top of drbd (used for mirroring over the network via 
protocol C), on top of LVM (simple vg on top of pv, nothing fancy), on 
top of a SATA SSD.

Stuff we've ruled out:

With the above system within 1-3 iterations of copying the file we can 
cause a hang.  If we switch to spinning SATA disks we were unable to 
trigger the problem.  However, if we run within virtualbox with virtual 
PIIX4 IDE disks then we can reproduce the problem.  This seems to point 
to a timing issue as opposed to a hardware issue.

We've confirmed that we can reproduce the problem with drbd out of the 
picture (NFS-exported ext4 on LVM on virtual disks in virtualbox) but it 
takes longer than with the drbd disks.

We've confirmed that we can reproduce the problem with both "deadline" 
and "noop" I/O schedulers.


Stuff still to try:

We're working on setting up a minimal test-case with both the RT and 
standard kernels.  As it stands there are a lot of other components 
running that make it harder to rebuild with different kernel options.


Details:

These are the processes we saw stuck in disk sleep:

$ ps aux | grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1797  4.2  0.0      0     0 ?        D    19:46   9:51 [nfsd]
root      1798  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
root      1799  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
root      1800  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      1801  4.2  0.0      0     0 ?        D    19:46   9:53 [nfsd]
root      1802  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      1803  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      1804  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      5305  0.0  0.0      0     0 ?        D    19:49   0:03 
[jbd2/drbd3-8]
root      5634  0.0  0.0      0     0 ?        D    19:49   0:05 
[flush-147:3]

Most of the nfsd threads are stuck waiting for the mutex on an inode in 
order to do a write to the disk. For example, the traceback for pid 1797 
[nfsd]:

[<ffffffff8110fd46>] generic_file_aio_write+0x56/0xe0
[<ffffffff8120591f>] ext4_file_write+0xbf/0x250
[<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
[<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
[<ffffffff81168dcc>] vfs_writev+0x3c/0x50
[<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
[<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
[<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
[<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffff81871679>] svc_process+0x499/0x790
[<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

Where generic_file_aio_write() is calling mutex_lock(&inode->i_mutex).

Looking through the other tracebacks, it appears that pid 1803 [nfsd] 
probably has this mutex. Also, looking at the block_start in 
/proc/1803/sched (, it appears that this was the first process to block:

se.statistics.block_start          :      13948189.066634

Its traceback looks like this:
[<ffffffff8125dc55>] jbd2_log_wait_commit+0xc5/0x150
[<ffffffff8125f628>] jbd2_complete_transaction+0xb8/0xd0
[<ffffffff81205dcd>] ext4_sync_file+0x1fd/0x4b0
[<ffffffff81197ad5>] generic_write_sync+0x55/0x70
[<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
[<ffffffff8120591f>] ext4_file_write+0xbf/0x250
[<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
[<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
[<ffffffff81168dcc>] vfs_writev+0x3c/0x50
[<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
[<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
[<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
[<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffff81871679>] svc_process+0x499/0x790
[<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

The inode-i_mutex seems to be taken in ext4_sync_file() before the call 
to jbd2_complete_transaction().

It looks like jbd2_log_wait_commit() has just called schedule() in 
wait_event() in the code below:

while (tid_gt(tid, journal->j_commit_sequence)) {
	jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
		tid, journal->j_commit_sequence);
	wake_up(&journal->j_wait_commit);
	read_unlock(&journal->j_state_lock);
	wait_event(journal->j_wait_done_commit,
		!tid_gt(tid, journal->j_commit_sequence));
	read_lock(&journal->j_state_lock);
}


The kjournald2 thread (pid 5305) blocks significantly later:
se.statistics.block_start          :      13953712.751778

Here is the traceback:
[<ffffffff8110e3ae>] sleep_on_page+0xe/0x20
[<ffffffff8110e5b8>] wait_on_page_bit+0x78/0x80
[<ffffffff8110e81c>] filemap_fdatawait_range+0x10c/0x1b0
[<ffffffff8110e8eb>] filemap_fdatawait+0x2b/0x30
[<ffffffff8125832c>] jbd2_journal_commit_transaction+0x8dc/0x1a70
	This calls journal_finish_inode_data_buffers which calls
	filemap_fdatawait()
[<ffffffff8125ddaf>] kjournald2+0xbf/0x280
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

It is stuck in jbd2_journal_commit_transaction, apparently waiting for 
inodes to be written to disk? I'm not sure what would be preventing that 
from happening.


One of the nfsd processes is stuck in a different spot:
root      1804     2  3 Oct15 ?        00:09:52 [nfsd]

This is the second process to block:
se.statistics.block_start          :      13948189.759783

And the traceback:
[<ffffffff81256f85>] do_get_write_access+0x2b5/0x4e0
[<ffffffff81257300>] jbd2_journal_get_write_access+0x30/0x50
[<ffffffff8123848e>] __ext4_journal_get_write_access+0x3e/0x80
[<ffffffff8123efaf>] ext4_mb_mark_diskspace_used+0xff/0x530
[<ffffffff8124074f>] ext4_mb_new_blocks+0x1ef/0x5f0
[<ffffffff812361d5>] ext4_ext_map_blocks+0xc05/0x1e60
[<ffffffff8120ae15>] ext4_map_blocks+0x1c5/0x270
[<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
[<ffffffff8120fd5e>] ext4_da_writepages+0x36e/0x720
[<ffffffff8111a5f4>] do_writepages+0x24/0x40
[<ffffffff8110f5db>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8110f6ea>] filemap_write_and_wait_range+0x5a/0x80
[<ffffffff81205c48>] ext4_sync_file+0x78/0x4b0
[<ffffffff81197ad5>] generic_write_sync+0x55/0x70
[<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
[<ffffffff8120591f>] ext4_file_write+0xbf/0x250
[<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
[<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
[<ffffffff81168dcc>] vfs_writev+0x3c/0x50
[<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
[<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
[<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
[<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffff81871679>] svc_process+0x499/0x790
[<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

I'm pretty sure it called schedule() below in the code from 
do_get_write_access():
if (jh->b_jlist == BJ_Shadow) {
	DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
	wait_queue_head_t *wqh;
	wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
	JBUFFER_TRACE(jh, "on shadow: sleep");
	jbd_unlock_bh_state(bh);
	/* commit wakes up all shadow buffers after IO */
	for ( ; ; ) {
		prepare_to_wait(wqh, &wait.wait,
		TASK_UNINTERRUPTIBLE);
		if (jh->b_jlist != BJ_Shadow)
			break;
		schedule();
	}
	finish_wait(wqh, &wait.wait);
	goto repeat;
}


And finally, there is a flush stuck as well:
root      5634     2  0 Oct15 ?        00:00:05 [flush-147:3]

This is the third process to block:
se.statistics.block_start          :      13948189.814929

And the traceback:
[<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
[<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
[<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
[<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
[<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
[<ffffffff8111a5f4>] do_writepages+0x24/0x40
[<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
[<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
[<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff811928e3>] wb_writeback+0x223/0x3f0
[<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
[<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
[<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
@ 2014-10-17 17:50 ` Chris Friesen
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Friesen @ 2014-10-17 17:50 UTC (permalink / raw)
  To: linux-nfs, J. Bruce Fields, linux-ext4, tytso, adilger.kernel,
	linux-rt-users

Hi,

We've run into an odd bug and I'm looking for some advice on how to 
proceed.  Please excuse the cross-posting, I wasn't sure where this 
belongs.  Any guidance would be greatly appreciated.

The problem we're seeing is that by copying a large (10GB) file to an 
nfs-mounted directory we can cause all the nfsd threads to hang in D 
state on the server.

Background:

We've got a couple of systems running 3.4.97, with the "rt120" patch for 
CONFIG_PREEMPT_RT.  On the server we've got an nfs-exported ext4 
filesystem, on top of drbd (used for mirroring over the network via 
protocol C), on top of LVM (simple vg on top of pv, nothing fancy), on 
top of a SATA SSD.

Stuff we've ruled out:

With the above system within 1-3 iterations of copying the file we can 
cause a hang.  If we switch to spinning SATA disks we were unable to 
trigger the problem.  However, if we run within virtualbox with virtual 
PIIX4 IDE disks then we can reproduce the problem.  This seems to point 
to a timing issue as opposed to a hardware issue.

We've confirmed that we can reproduce the problem with drbd out of the 
picture (NFS-exported ext4 on LVM on virtual disks in virtualbox) but it 
takes longer than with the drbd disks.

We've confirmed that we can reproduce the problem with both "deadline" 
and "noop" I/O schedulers.


Stuff still to try:

We're working on setting up a minimal test-case with both the RT and 
standard kernels.  As it stands there are a lot of other components 
running that make it harder to rebuild with different kernel options.


Details:

These are the processes we saw stuck in disk sleep:

$ ps aux | grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1797  4.2  0.0      0     0 ?        D    19:46   9:51 [nfsd]
root      1798  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
root      1799  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
root      1800  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      1801  4.2  0.0      0     0 ?        D    19:46   9:53 [nfsd]
root      1802  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      1803  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      1804  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
root      5305  0.0  0.0      0     0 ?        D    19:49   0:03 
[jbd2/drbd3-8]
root      5634  0.0  0.0      0     0 ?        D    19:49   0:05 
[flush-147:3]

Most of the nfsd threads are stuck waiting for the mutex on an inode in 
order to do a write to the disk. For example, the traceback for pid 1797 
[nfsd]:

[<ffffffff8110fd46>] generic_file_aio_write+0x56/0xe0
[<ffffffff8120591f>] ext4_file_write+0xbf/0x250
[<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
[<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
[<ffffffff81168dcc>] vfs_writev+0x3c/0x50
[<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
[<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
[<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
[<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffff81871679>] svc_process+0x499/0x790
[<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

Where generic_file_aio_write() is calling mutex_lock(&inode->i_mutex).

Looking through the other tracebacks, it appears that pid 1803 [nfsd] 
probably has this mutex. Also, looking at the block_start in 
/proc/1803/sched (, it appears that this was the first process to block:

se.statistics.block_start          :      13948189.066634

Its traceback looks like this:
[<ffffffff8125dc55>] jbd2_log_wait_commit+0xc5/0x150
[<ffffffff8125f628>] jbd2_complete_transaction+0xb8/0xd0
[<ffffffff81205dcd>] ext4_sync_file+0x1fd/0x4b0
[<ffffffff81197ad5>] generic_write_sync+0x55/0x70
[<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
[<ffffffff8120591f>] ext4_file_write+0xbf/0x250
[<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
[<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
[<ffffffff81168dcc>] vfs_writev+0x3c/0x50
[<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
[<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
[<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
[<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffff81871679>] svc_process+0x499/0x790
[<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

The inode-i_mutex seems to be taken in ext4_sync_file() before the call 
to jbd2_complete_transaction().

It looks like jbd2_log_wait_commit() has just called schedule() in 
wait_event() in the code below:

while (tid_gt(tid, journal->j_commit_sequence)) {
	jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
		tid, journal->j_commit_sequence);
	wake_up(&journal->j_wait_commit);
	read_unlock(&journal->j_state_lock);
	wait_event(journal->j_wait_done_commit,
		!tid_gt(tid, journal->j_commit_sequence));
	read_lock(&journal->j_state_lock);
}


The kjournald2 thread (pid 5305) blocks significantly later:
se.statistics.block_start          :      13953712.751778

Here is the traceback:
[<ffffffff8110e3ae>] sleep_on_page+0xe/0x20
[<ffffffff8110e5b8>] wait_on_page_bit+0x78/0x80
[<ffffffff8110e81c>] filemap_fdatawait_range+0x10c/0x1b0
[<ffffffff8110e8eb>] filemap_fdatawait+0x2b/0x30
[<ffffffff8125832c>] jbd2_journal_commit_transaction+0x8dc/0x1a70
	This calls journal_finish_inode_data_buffers which calls
	filemap_fdatawait()
[<ffffffff8125ddaf>] kjournald2+0xbf/0x280
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

It is stuck in jbd2_journal_commit_transaction, apparently waiting for 
inodes to be written to disk? I'm not sure what would be preventing that 
from happening.


One of the nfsd processes is stuck in a different spot:
root      1804     2  3 Oct15 ?        00:09:52 [nfsd]

This is the second process to block:
se.statistics.block_start          :      13948189.759783

And the traceback:
[<ffffffff81256f85>] do_get_write_access+0x2b5/0x4e0
[<ffffffff81257300>] jbd2_journal_get_write_access+0x30/0x50
[<ffffffff8123848e>] __ext4_journal_get_write_access+0x3e/0x80
[<ffffffff8123efaf>] ext4_mb_mark_diskspace_used+0xff/0x530
[<ffffffff8124074f>] ext4_mb_new_blocks+0x1ef/0x5f0
[<ffffffff812361d5>] ext4_ext_map_blocks+0xc05/0x1e60
[<ffffffff8120ae15>] ext4_map_blocks+0x1c5/0x270
[<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
[<ffffffff8120fd5e>] ext4_da_writepages+0x36e/0x720
[<ffffffff8111a5f4>] do_writepages+0x24/0x40
[<ffffffff8110f5db>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8110f6ea>] filemap_write_and_wait_range+0x5a/0x80
[<ffffffff81205c48>] ext4_sync_file+0x78/0x4b0
[<ffffffff81197ad5>] generic_write_sync+0x55/0x70
[<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
[<ffffffff8120591f>] ext4_file_write+0xbf/0x250
[<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
[<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
[<ffffffff81168dcc>] vfs_writev+0x3c/0x50
[<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
[<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
[<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
[<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffff81871679>] svc_process+0x499/0x790
[<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10

I'm pretty sure it called schedule() below in the code from 
do_get_write_access():
if (jh->b_jlist == BJ_Shadow) {
	DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
	wait_queue_head_t *wqh;
	wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
	JBUFFER_TRACE(jh, "on shadow: sleep");
	jbd_unlock_bh_state(bh);
	/* commit wakes up all shadow buffers after IO */
	for ( ; ; ) {
		prepare_to_wait(wqh, &wait.wait,
		TASK_UNINTERRUPTIBLE);
		if (jh->b_jlist != BJ_Shadow)
			break;
		schedule();
	}
	finish_wait(wqh, &wait.wait);
	goto repeat;
}


And finally, there is a flush stuck as well:
root      5634     2  0 Oct15 ?        00:00:05 [flush-147:3]

This is the third process to block:
se.statistics.block_start          :      13948189.814929

And the traceback:
[<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
[<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
[<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
[<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
[<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
[<ffffffff8111a5f4>] do_writepages+0x24/0x40
[<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
[<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
[<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff811928e3>] wb_writeback+0x223/0x3f0
[<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
[<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
[<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
[<ffffffff810609bb>] kthread+0xdb/0xe0
[<ffffffff8193b904>] kernel_thread_helper+0x4/0x10


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
  2014-10-17 17:50 ` Chris Friesen
  (?)
@ 2014-10-17 18:01 ` Pavel Vasilyev
       [not found]   ` <CANGgnMbQmsdMDJUx7Bop9Xs=jQMmAJgWRjhXVFUGx-DwF=inYw@mail.gmail.com>
  -1 siblings, 1 reply; 28+ messages in thread
From: Pavel Vasilyev @ 2014-10-17 18:01 UTC (permalink / raw)
  To: Chris Friesen, linux-nfs, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, linux-rt-users

Ext4 it's a kiddy fs, use XFS, Luke!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
  2014-10-17 17:50 ` Chris Friesen
@ 2014-10-17 18:58     ` Austin Schuh
  -1 siblings, 0 replies; 28+ messages in thread
From: Austin Schuh @ 2014-10-17 18:58 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-nfs, J. Bruce Fields, linux-ext4, tytso, adilger.kernel, rt-users

On Fri, Oct 17, 2014 at 10:50 AM, Chris Friesen
<chris.friesen-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org> wrote:
> Where generic_file_aio_write() is calling mutex_lock(&inode->i_mutex).


Use the 121 patch.  This sounds very similar to the issue that I
helped debug with XFS.  There ended up being a deadlock due to a bug
in the kernel work queues.  You can search the RT archives for more
info.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
@ 2014-10-17 18:58     ` Austin Schuh
  0 siblings, 0 replies; 28+ messages in thread
From: Austin Schuh @ 2014-10-17 18:58 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-nfs, J. Bruce Fields, linux-ext4, tytso, adilger.kernel, rt-users

On Fri, Oct 17, 2014 at 10:50 AM, Chris Friesen
<chris.friesen@windriver.com> wrote:
> Where generic_file_aio_write() is calling mutex_lock(&inode->i_mutex).


Use the 121 patch.  This sounds very similar to the issue that I
helped debug with XFS.  There ended up being a deadlock due to a bug
in the kernel work queues.  You can search the RT archives for more
info.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
  2014-10-17 17:50 ` Chris Friesen
@ 2014-10-17 19:12     ` Dmitry Monakhov
  -1 siblings, 0 replies; 28+ messages in thread
From: Dmitry Monakhov @ 2014-10-17 19:12 UTC (permalink / raw)
  To: Chris Friesen, linux-nfs-u79uwXL29TY76Z2rM5mHXA, J. Bruce Fields,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q,
	linux-rt-users-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 10178 bytes --]

Chris Friesen <chris.friesen-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org> writes:

> Hi,
>
> We've run into an odd bug and I'm looking for some advice on how to 
> proceed.  Please excuse the cross-posting, I wasn't sure where this 
> belongs.  Any guidance would be greatly appreciated.
>
> The problem we're seeing is that by copying a large (10GB) file to an 
> nfs-mounted directory we can cause all the nfsd threads to hang in D 
> state on the server.
>
> Background:
>
> We've got a couple of systems running 3.4.97, with the "rt120" patch for 
> CONFIG_PREEMPT_RT.  On the server we've got an nfs-exported ext4 
> filesystem, on top of drbd (used for mirroring over the network via 
> protocol C), on top of LVM (simple vg on top of pv, nothing fancy), on 
> top of a SATA SSD.
>
> Stuff we've ruled out:
>
> With the above system within 1-3 iterations of copying the file we can 
> cause a hang.  If we switch to spinning SATA disks we were unable to 
> trigger the problem.  However, if we run within virtualbox with virtual 
> PIIX4 IDE disks then we can reproduce the problem.  This seems to point 
> to a timing issue as opposed to a hardware issue.
>
> We've confirmed that we can reproduce the problem with drbd out of the 
> picture (NFS-exported ext4 on LVM on virtual disks in virtualbox) but it 
> takes longer than with the drbd disks.
>
> We've confirmed that we can reproduce the problem with both "deadline" 
> and "noop" I/O schedulers.
>
>
> Stuff still to try:
>
> We're working on setting up a minimal test-case with both the RT and 
> standard kernels.  As it stands there are a lot of other components 
> running that make it harder to rebuild with different kernel options.
>
>
> Details:
>
> These are the processes we saw stuck in disk sleep:
>
> $ ps aux | grep D
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      1797  4.2  0.0      0     0 ?        D    19:46   9:51 [nfsd]
> root      1798  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
> root      1799  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
> root      1800  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1801  4.2  0.0      0     0 ?        D    19:46   9:53 [nfsd]
> root      1802  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1803  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1804  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      5305  0.0  0.0      0     0 ?        D    19:49   0:03 
> [jbd2/drbd3-8]
> root      5634  0.0  0.0      0     0 ?        D    19:49   0:05 
> [flush-147:3]
>
> Most of the nfsd threads are stuck waiting for the mutex on an inode in 
> order to do a write to the disk. For example, the traceback for pid 1797 
> [nfsd]:
>
> [<ffffffff8110fd46>] generic_file_aio_write+0x56/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> Where generic_file_aio_write() is calling mutex_lock(&inode->i_mutex).
>
> Looking through the other tracebacks, it appears that pid 1803 [nfsd] 
> probably has this mutex. Also, looking at the block_start in 
> /proc/1803/sched (, it appears that this was the first process to block:
>
> se.statistics.block_start          :      13948189.066634
>
> Its traceback looks like this:
> [<ffffffff8125dc55>] jbd2_log_wait_commit+0xc5/0x150
> [<ffffffff8125f628>] jbd2_complete_transaction+0xb8/0xd0
> [<ffffffff81205dcd>] ext4_sync_file+0x1fd/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> The inode-i_mutex seems to be taken in ext4_sync_file() before the call 
> to jbd2_complete_transaction().
>
> It looks like jbd2_log_wait_commit() has just called schedule() in 
> wait_event() in the code below:
>
> while (tid_gt(tid, journal->j_commit_sequence)) {
> 	jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
> 		tid, journal->j_commit_sequence);
> 	wake_up(&journal->j_wait_commit);
> 	read_unlock(&journal->j_state_lock);
> 	wait_event(journal->j_wait_done_commit,
> 		!tid_gt(tid, journal->j_commit_sequence));
> 	read_lock(&journal->j_state_lock);
> }
>
>
> The kjournald2 thread (pid 5305) blocks significantly later:
> se.statistics.block_start          :      13953712.751778
>
> Here is the traceback:
> [<ffffffff8110e3ae>] sleep_on_page+0xe/0x20
> [<ffffffff8110e5b8>] wait_on_page_bit+0x78/0x80
> [<ffffffff8110e81c>] filemap_fdatawait_range+0x10c/0x1b0
> [<ffffffff8110e8eb>] filemap_fdatawait+0x2b/0x30
> [<ffffffff8125832c>] jbd2_journal_commit_transaction+0x8dc/0x1a70
> 	This calls journal_finish_inode_data_buffers which calls
> 	filemap_fdatawait()
> [<ffffffff8125ddaf>] kjournald2+0xbf/0x280
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> It is stuck in jbd2_journal_commit_transaction, apparently waiting for 
> inodes to be written to disk? I'm not sure what would be preventing that 
> from happening.
>
>
> One of the nfsd processes is stuck in a different spot:
> root      1804     2  3 Oct15 ?        00:09:52 [nfsd]
>
> This is the second process to block:
> se.statistics.block_start          :      13948189.759783
>
> And the traceback:
> [<ffffffff81256f85>] do_get_write_access+0x2b5/0x4e0
> [<ffffffff81257300>] jbd2_journal_get_write_access+0x30/0x50
> [<ffffffff8123848e>] __ext4_journal_get_write_access+0x3e/0x80
> [<ffffffff8123efaf>] ext4_mb_mark_diskspace_used+0xff/0x530
> [<ffffffff8124074f>] ext4_mb_new_blocks+0x1ef/0x5f0
> [<ffffffff812361d5>] ext4_ext_map_blocks+0xc05/0x1e60
> [<ffffffff8120ae15>] ext4_map_blocks+0x1c5/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120fd5e>] ext4_da_writepages+0x36e/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff8110f5db>] __filemap_fdatawrite_range+0x5b/0x60
> [<ffffffff8110f6ea>] filemap_write_and_wait_range+0x5a/0x80
> [<ffffffff81205c48>] ext4_sync_file+0x78/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> I'm pretty sure it called schedule() below in the code from 
> do_get_write_access():
> if (jh->b_jlist == BJ_Shadow) {
> 	DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
> 	wait_queue_head_t *wqh;
> 	wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
> 	JBUFFER_TRACE(jh, "on shadow: sleep");
> 	jbd_unlock_bh_state(bh);
> 	/* commit wakes up all shadow buffers after IO */
> 	for ( ; ; ) {
> 		prepare_to_wait(wqh, &wait.wait,
> 		TASK_UNINTERRUPTIBLE);
> 		if (jh->b_jlist != BJ_Shadow)
> 			break;
> 		schedule();
> 	}
> 	finish_wait(wqh, &wait.wait);
> 	goto repeat;
> }
>
>
> And finally, there is a flush stuck as well:
> root      5634     2  0 Oct15 ?        00:00:05 [flush-147:3]
Please check that you have this path in your tree
https://lkml.org/lkml/2014/3/7/612
There are also other bug which was fixed long time ago which was
related with incorrect barrier handling. Please disable barriers
temporarily (mount_opt="-onobarrier") and try to reproduce the issue.
>
> This is the third process to block:
> se.statistics.block_start          :      13948189.814929
>
> And the traceback:
> [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
> [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
> [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
> [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
> [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
> [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
> [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
> [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
@ 2014-10-17 19:12     ` Dmitry Monakhov
  0 siblings, 0 replies; 28+ messages in thread
From: Dmitry Monakhov @ 2014-10-17 19:12 UTC (permalink / raw)
  To: Chris Friesen, linux-nfs, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, linux-rt-users

[-- Attachment #1: Type: text/plain, Size: 10128 bytes --]

Chris Friesen <chris.friesen@windriver.com> writes:

> Hi,
>
> We've run into an odd bug and I'm looking for some advice on how to 
> proceed.  Please excuse the cross-posting, I wasn't sure where this 
> belongs.  Any guidance would be greatly appreciated.
>
> The problem we're seeing is that by copying a large (10GB) file to an 
> nfs-mounted directory we can cause all the nfsd threads to hang in D 
> state on the server.
>
> Background:
>
> We've got a couple of systems running 3.4.97, with the "rt120" patch for 
> CONFIG_PREEMPT_RT.  On the server we've got an nfs-exported ext4 
> filesystem, on top of drbd (used for mirroring over the network via 
> protocol C), on top of LVM (simple vg on top of pv, nothing fancy), on 
> top of a SATA SSD.
>
> Stuff we've ruled out:
>
> With the above system within 1-3 iterations of copying the file we can 
> cause a hang.  If we switch to spinning SATA disks we were unable to 
> trigger the problem.  However, if we run within virtualbox with virtual 
> PIIX4 IDE disks then we can reproduce the problem.  This seems to point 
> to a timing issue as opposed to a hardware issue.
>
> We've confirmed that we can reproduce the problem with drbd out of the 
> picture (NFS-exported ext4 on LVM on virtual disks in virtualbox) but it 
> takes longer than with the drbd disks.
>
> We've confirmed that we can reproduce the problem with both "deadline" 
> and "noop" I/O schedulers.
>
>
> Stuff still to try:
>
> We're working on setting up a minimal test-case with both the RT and 
> standard kernels.  As it stands there are a lot of other components 
> running that make it harder to rebuild with different kernel options.
>
>
> Details:
>
> These are the processes we saw stuck in disk sleep:
>
> $ ps aux | grep D
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      1797  4.2  0.0      0     0 ?        D    19:46   9:51 [nfsd]
> root      1798  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
> root      1799  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
> root      1800  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1801  4.2  0.0      0     0 ?        D    19:46   9:53 [nfsd]
> root      1802  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1803  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1804  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      5305  0.0  0.0      0     0 ?        D    19:49   0:03 
> [jbd2/drbd3-8]
> root      5634  0.0  0.0      0     0 ?        D    19:49   0:05 
> [flush-147:3]
>
> Most of the nfsd threads are stuck waiting for the mutex on an inode in 
> order to do a write to the disk. For example, the traceback for pid 1797 
> [nfsd]:
>
> [<ffffffff8110fd46>] generic_file_aio_write+0x56/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> Where generic_file_aio_write() is calling mutex_lock(&inode->i_mutex).
>
> Looking through the other tracebacks, it appears that pid 1803 [nfsd] 
> probably has this mutex. Also, looking at the block_start in 
> /proc/1803/sched (, it appears that this was the first process to block:
>
> se.statistics.block_start          :      13948189.066634
>
> Its traceback looks like this:
> [<ffffffff8125dc55>] jbd2_log_wait_commit+0xc5/0x150
> [<ffffffff8125f628>] jbd2_complete_transaction+0xb8/0xd0
> [<ffffffff81205dcd>] ext4_sync_file+0x1fd/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> The inode-i_mutex seems to be taken in ext4_sync_file() before the call 
> to jbd2_complete_transaction().
>
> It looks like jbd2_log_wait_commit() has just called schedule() in 
> wait_event() in the code below:
>
> while (tid_gt(tid, journal->j_commit_sequence)) {
> 	jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
> 		tid, journal->j_commit_sequence);
> 	wake_up(&journal->j_wait_commit);
> 	read_unlock(&journal->j_state_lock);
> 	wait_event(journal->j_wait_done_commit,
> 		!tid_gt(tid, journal->j_commit_sequence));
> 	read_lock(&journal->j_state_lock);
> }
>
>
> The kjournald2 thread (pid 5305) blocks significantly later:
> se.statistics.block_start          :      13953712.751778
>
> Here is the traceback:
> [<ffffffff8110e3ae>] sleep_on_page+0xe/0x20
> [<ffffffff8110e5b8>] wait_on_page_bit+0x78/0x80
> [<ffffffff8110e81c>] filemap_fdatawait_range+0x10c/0x1b0
> [<ffffffff8110e8eb>] filemap_fdatawait+0x2b/0x30
> [<ffffffff8125832c>] jbd2_journal_commit_transaction+0x8dc/0x1a70
> 	This calls journal_finish_inode_data_buffers which calls
> 	filemap_fdatawait()
> [<ffffffff8125ddaf>] kjournald2+0xbf/0x280
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> It is stuck in jbd2_journal_commit_transaction, apparently waiting for 
> inodes to be written to disk? I'm not sure what would be preventing that 
> from happening.
>
>
> One of the nfsd processes is stuck in a different spot:
> root      1804     2  3 Oct15 ?        00:09:52 [nfsd]
>
> This is the second process to block:
> se.statistics.block_start          :      13948189.759783
>
> And the traceback:
> [<ffffffff81256f85>] do_get_write_access+0x2b5/0x4e0
> [<ffffffff81257300>] jbd2_journal_get_write_access+0x30/0x50
> [<ffffffff8123848e>] __ext4_journal_get_write_access+0x3e/0x80
> [<ffffffff8123efaf>] ext4_mb_mark_diskspace_used+0xff/0x530
> [<ffffffff8124074f>] ext4_mb_new_blocks+0x1ef/0x5f0
> [<ffffffff812361d5>] ext4_ext_map_blocks+0xc05/0x1e60
> [<ffffffff8120ae15>] ext4_map_blocks+0x1c5/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120fd5e>] ext4_da_writepages+0x36e/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff8110f5db>] __filemap_fdatawrite_range+0x5b/0x60
> [<ffffffff8110f6ea>] filemap_write_and_wait_range+0x5a/0x80
> [<ffffffff81205c48>] ext4_sync_file+0x78/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> I'm pretty sure it called schedule() below in the code from 
> do_get_write_access():
> if (jh->b_jlist == BJ_Shadow) {
> 	DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
> 	wait_queue_head_t *wqh;
> 	wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
> 	JBUFFER_TRACE(jh, "on shadow: sleep");
> 	jbd_unlock_bh_state(bh);
> 	/* commit wakes up all shadow buffers after IO */
> 	for ( ; ; ) {
> 		prepare_to_wait(wqh, &wait.wait,
> 		TASK_UNINTERRUPTIBLE);
> 		if (jh->b_jlist != BJ_Shadow)
> 			break;
> 		schedule();
> 	}
> 	finish_wait(wqh, &wait.wait);
> 	goto repeat;
> }
>
>
> And finally, there is a flush stuck as well:
> root      5634     2  0 Oct15 ?        00:00:05 [flush-147:3]
Please check that you have this path in your tree
https://lkml.org/lkml/2014/3/7/612
There are also other bug which was fixed long time ago which was
related with incorrect barrier handling. Please disable barriers
temporarily (mount_opt="-onobarrier") and try to reproduce the issue.
>
> This is the third process to block:
> se.statistics.block_start          :      13948189.814929
>
> And the traceback:
> [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
> [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
> [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
> [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
> [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
> [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
> [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
> [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hang writing to nfs-mounted filesystem from client -- expected code path?
  2014-10-17 17:50 ` Chris Friesen
@ 2014-10-18 17:05     ` Chris Friesen
  -1 siblings, 0 replies; 28+ messages in thread
From: Chris Friesen @ 2014-10-18 17:05 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, J. Bruce Fields,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q,
	linux-rt-users-u79uwXL29TY76Z2rM5mHXA

On 10/17/2014 11:50 AM, Chris Friesen wrote:
> Looking through the other tracebacks, it appears that pid 1803 [nfsd]
> probably has this mutex. Also, looking at the block_start in
> /proc/1803/sched (, it appears that this was the first process to block:
>
> se.statistics.block_start          :      13948189.066634
>
> Its traceback looks like this:
> [<ffffffff8125dc55>] jbd2_log_wait_commit+0xc5/0x150
> [<ffffffff8125f628>] jbd2_complete_transaction+0xb8/0xd0
> [<ffffffff81205dcd>] ext4_sync_file+0x1fd/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> The inode-i_mutex seems to be taken in ext4_sync_file() before the call
> to jbd2_complete_transaction().
>
> It looks like jbd2_log_wait_commit() has just called schedule() in
> wait_event() in the code below:
>
> while (tid_gt(tid, journal->j_commit_sequence)) {
>      jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
>          tid, journal->j_commit_sequence);
>      wake_up(&journal->j_wait_commit);
>      read_unlock(&journal->j_state_lock);
>      wait_event(journal->j_wait_done_commit,
>          !tid_gt(tid, journal->j_commit_sequence));
>      read_lock(&journal->j_state_lock);
> }
>
>
> The kjournald2 thread (pid 5305) blocks significantly later:
> se.statistics.block_start          :      13953712.751778
>
> Here is the traceback:
> [<ffffffff8110e3ae>] sleep_on_page+0xe/0x20
> [<ffffffff8110e5b8>] wait_on_page_bit+0x78/0x80
> [<ffffffff8110e81c>] filemap_fdatawait_range+0x10c/0x1b0
> [<ffffffff8110e8eb>] filemap_fdatawait+0x2b/0x30
> [<ffffffff8125832c>] jbd2_journal_commit_transaction+0x8dc/0x1a70
>      This calls journal_finish_inode_data_buffers which calls
>      filemap_fdatawait()
> [<ffffffff8125ddaf>] kjournald2+0xbf/0x280
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> It is stuck in jbd2_journal_commit_transaction, apparently waiting for
> inodes to be written to disk? I'm not sure what would be preventing that
> from happening.

Given the above traces, it appears that nfsd is waiting for jbd2 to 
commit a transaction to the journal, and jbd2 is waiting for a page to 
finish writing to disk.

Can someone point me to the code path that actually starts the writing 
process, and also the code path we would expect to follow when the page 
has been written?  I assume we're going to be calling 
set_page_writeback/end_page_writeback, but I don't know how we're going 
to get there given the above tracebacks.

Thanks,
Chris


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hang writing to nfs-mounted filesystem from client -- expected code path?
@ 2014-10-18 17:05     ` Chris Friesen
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Friesen @ 2014-10-18 17:05 UTC (permalink / raw)
  To: linux-nfs, J. Bruce Fields, linux-ext4, tytso, adilger.kernel,
	linux-rt-users

On 10/17/2014 11:50 AM, Chris Friesen wrote:
> Looking through the other tracebacks, it appears that pid 1803 [nfsd]
> probably has this mutex. Also, looking at the block_start in
> /proc/1803/sched (, it appears that this was the first process to block:
>
> se.statistics.block_start          :      13948189.066634
>
> Its traceback looks like this:
> [<ffffffff8125dc55>] jbd2_log_wait_commit+0xc5/0x150
> [<ffffffff8125f628>] jbd2_complete_transaction+0xb8/0xd0
> [<ffffffff81205dcd>] ext4_sync_file+0x1fd/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> The inode-i_mutex seems to be taken in ext4_sync_file() before the call
> to jbd2_complete_transaction().
>
> It looks like jbd2_log_wait_commit() has just called schedule() in
> wait_event() in the code below:
>
> while (tid_gt(tid, journal->j_commit_sequence)) {
>      jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
>          tid, journal->j_commit_sequence);
>      wake_up(&journal->j_wait_commit);
>      read_unlock(&journal->j_state_lock);
>      wait_event(journal->j_wait_done_commit,
>          !tid_gt(tid, journal->j_commit_sequence));
>      read_lock(&journal->j_state_lock);
> }
>
>
> The kjournald2 thread (pid 5305) blocks significantly later:
> se.statistics.block_start          :      13953712.751778
>
> Here is the traceback:
> [<ffffffff8110e3ae>] sleep_on_page+0xe/0x20
> [<ffffffff8110e5b8>] wait_on_page_bit+0x78/0x80
> [<ffffffff8110e81c>] filemap_fdatawait_range+0x10c/0x1b0
> [<ffffffff8110e8eb>] filemap_fdatawait+0x2b/0x30
> [<ffffffff8125832c>] jbd2_journal_commit_transaction+0x8dc/0x1a70
>      This calls journal_finish_inode_data_buffers which calls
>      filemap_fdatawait()
> [<ffffffff8125ddaf>] kjournald2+0xbf/0x280
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> It is stuck in jbd2_journal_commit_transaction, apparently waiting for
> inodes to be written to disk? I'm not sure what would be preventing that
> from happening.

Given the above traces, it appears that nfsd is waiting for jbd2 to 
commit a transaction to the journal, and jbd2 is waiting for a page to 
finish writing to disk.

Can someone point me to the code path that actually starts the writing 
process, and also the code path we would expect to follow when the page 
has been written?  I assume we're going to be calling 
set_page_writeback/end_page_writeback, but I don't know how we're going 
to get there given the above tracebacks.

Thanks,
Chris



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RT/ext4/jbd2 circular dependency (was: Re: Hang writing to nfs-mounted filesystem from client)
       [not found]   ` <CANGgnMbQmsdMDJUx7Bop9Xs=jQMmAJgWRjhXVFUGx-DwF=inYw@mail.gmail.com>
@ 2014-10-23 17:54     ` Chris Friesen
  2014-10-26 14:25       ` Thomas Gleixner
  2014-10-27 19:57       ` Chris Friesen
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Friesen @ 2014-10-23 17:54 UTC (permalink / raw)
  To: Austin Schuh, pavel
  Cc: J. Bruce Fields, linux-ext4, tytso, adilger.kernel, rt-users

On 10/17/2014 12:55 PM, Austin Schuh wrote:
> Use the 121 patch.  This sounds very similar to the issue that I helped
> debug with XFS.  There ended up being a deadlock due to a bug in the
> kernel work queues.  You can search the RT archives for more info.

I can confirm that the problem still shows up with the rt121 patch. (And
also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)

We added some instrumentation and it looks like we've tracked down the problem.
Figuring out how to fix it is proving to be tricky.

Basically it looks like we have a circular dependency involving the
inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list.  It
goes something like this:

jbd2_journal_commit_transaction:
1) set page for writeback (set PG_writeback bit)
2) put jbd2 journal head on BJ_Shadow list
3) sleep on PG_writeback bit waiting for page writeback complete

ext4_da_writepages:
1) ext4_map_blocks() acquires inode->i_data_sem for writing
2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
the BJ_Shadow list

At this point the flush code can't run because it can't acquire
inode->i_data_sem for reading, so the page will never get written out.
Deadlock.


The following is a more detailed timeline with information from added trace
events:

 nfsd-2012  [003] ....1..  8612.903541: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903546: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903559: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903565: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903611: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903616: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-5960  [004] ....1..  8612.903628: jbd2_submit_inode_data: dev 147,3 ino 2097160
<...>-5960  [004] ....111  8612.903651: jbd2_list_add_bjshadow: adding jh ffff880415350000 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903653: jbd2_list_add_bjshadow: adding jh ffff8803eb08dbd0 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903655: jbd2_list_add_bjshadow: adding jh ffff8803eb08d150 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903656: jbd2_list_add_bjshadow: adding jh ffff8803eb08d0e0 to transaction ffff880415391180 BJ_Shadow list
<...>-5960  [004] ....111  8612.903657: jbd2_list_add_bjshadow: adding jh ffff88031c9449a0 to transaction ffff880415391180 BJ_Shadow list
 nfsd-2012  [003] ....1..  8612.903658: ext4_map_blocks_down_write: dev 147,3 ino 2097161
<...>-5960  [004] ....111  8612.903658: jbd2_list_add_bjshadow: adding jh ffff88031c944310 to transaction ffff880415391180 BJ_Shadow list
 nfsd-2012  [003] ....1..  8612.903665: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-5960  [004] ....1..  8612.903696: jbd2_finish_inode_data: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903706: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903714: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903802: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903814: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903960: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.903983: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904311: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904318: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904331: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904337: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904399: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904408: ext4_map_blocks_up_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904678: ext4_map_blocks_down_write: dev 147,3 ino 2097161
 nfsd-2012  [003] ....1..  8612.904772: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-2015  [007] ....1..  8612.934515: ext4_map_blocks_down_write: dev 147,3 ino 2097161
<...>-2015  [007] ....1..  8612.934525: jbd2_list_sleep_bjshadow: waiting for jh ffff8803eb08dbd0 from transaction ffff880415391180 to be removed from BJ_Shadow list



Timeline:

pid 5960 is [jbd2/drbd3-8]
pid 2015 is [nfsd]
pid 2012 is [nfsd]

pid 5960:
8612.903628: jbd2_submit_inode_data for inode 2097160.
	     This is right before calling journal_submit_inode_data_buffers(), which
             ends up calling set_page_writeback().
8612.903653: Add jh ffff8803eb08dbd0 to BJ_Shadow list.
	     This is in bd2_journal_write_metadata_buffer() right before calling
             __jbd2_journal_file_buffer().
8612.903696: This is in journal_finish_inode_data_buffers(), right before calling
             filemap_fdatawait() which ends up calling wait_on_page_bit(page,
             PG_writeback)
<we see no more logs for pid 5960 after this>


pid 2015:
8612.934515: takes write lock on inode->i_data_sem for inode 2097161
8612.934525: goes to sleep waiting for jh ffff8803eb08dbd0 to be removed from
             BJ_Shadow list
<we see no more logs for pid 2015 after this>

pid 2012:
8617.963896: hits 5-sec retry limit and stops the trace.  This means it blocked
             trying to get a read lock on inode->i_data_sem for inode 2097161
             at time 8612.963.


Chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency (was: Re: Hang writing to nfs-mounted filesystem from client)
  2014-10-23 17:54     ` RT/ext4/jbd2 circular dependency (was: Re: Hang writing to nfs-mounted filesystem from client) Chris Friesen
@ 2014-10-26 14:25       ` Thomas Gleixner
  2014-10-27 16:22         ` RT/ext4/jbd2 circular dependency Chris Friesen
  2014-10-27 19:57       ` Chris Friesen
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2014-10-26 14:25 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, rt-users

On Thu, 23 Oct 2014, Chris Friesen wrote:
> On 10/17/2014 12:55 PM, Austin Schuh wrote:
> > Use the 121 patch.  This sounds very similar to the issue that I helped
> > debug with XFS.  There ended up being a deadlock due to a bug in the
> > kernel work queues.  You can search the RT archives for more info.
> 
> I can confirm that the problem still shows up with the rt121 patch. (And
> also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)
 
> We added some instrumentation and it looks like we've tracked down the problem.
> Figuring out how to fix it is proving to be tricky.
> 
> Basically it looks like we have a circular dependency involving the
> inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list.  It
> goes something like this:
> 
> jbd2_journal_commit_transaction:
> 1) set page for writeback (set PG_writeback bit)
> 2) put jbd2 journal head on BJ_Shadow list
> 3) sleep on PG_writeback bit waiting for page writeback complete
> 
> ext4_da_writepages:
> 1) ext4_map_blocks() acquires inode->i_data_sem for writing
> 2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
> the BJ_Shadow list
> 
> At this point the flush code can't run because it can't acquire
> inode->i_data_sem for reading, so the page will never get written out.
> Deadlock.

Sorry, I really cannot map that sparse description to any code
flow. Proper callchains for the involved parts might help to actually
understand what you are looking for.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-26 14:25       ` Thomas Gleixner
@ 2014-10-27 16:22         ` Chris Friesen
  2014-10-29 18:05           ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Friesen @ 2014-10-27 16:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, rt-users

On 10/26/2014 08:25 AM, Thomas Gleixner wrote:
> On Thu, 23 Oct 2014, Chris Friesen wrote:
>> On 10/17/2014 12:55 PM, Austin Schuh wrote:
>>> Use the 121 patch.  This sounds very similar to the issue that I helped
>>> debug with XFS.  There ended up being a deadlock due to a bug in the
>>> kernel work queues.  You can search the RT archives for more info.
>>
>> I can confirm that the problem still shows up with the rt121 patch. (And
>> also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)
>
>> We added some instrumentation and it looks like we've tracked down the problem.
>> Figuring out how to fix it is proving to be tricky.
>>
>> Basically it looks like we have a circular dependency involving the
>> inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list.  It
>> goes something like this:
>>
>> jbd2_journal_commit_transaction:
>> 1) set page for writeback (set PG_writeback bit)
>> 2) put jbd2 journal head on BJ_Shadow list
>> 3) sleep on PG_writeback bit waiting for page writeback complete
>>
>> ext4_da_writepages:
>> 1) ext4_map_blocks() acquires inode->i_data_sem for writing
>> 2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
>> the BJ_Shadow list
>>
>> At this point the flush code can't run because it can't acquire
>> inode->i_data_sem for reading, so the page will never get written out.
>> Deadlock.
>
> Sorry, I really cannot map that sparse description to any code
> flow. Proper callchains for the involved parts might help to actually
> understand what you are looking for.

There are details (stack traces, etc.) in the first message in the thread:
http://www.spinics.net/lists/linux-rt-users/msg12261.html


Originally we had thought that nfsd might have been implicated somehow, 
but it seems like it was just a trigger (possibly by increasing the rate 
of sync I/O).

In the interest of full disclosure I should point out that we're using a 
modified kernel so there is a chance that we have introduced the problem 
ourselves.  That said, we have not made significant changes to either 
ext4 or jbd2.  (Just a couple of minor cherry-picked bugfixes.)


The relevant code paths are:

Journal commit.  The important thing here is that we set the 
PG_writeback on a page, put the jbd2 journal head on BJ_Shadow list, 
then sleep waiting for page writeback complete.  If the page writeback 
never completes, then the journal head never comes off the BJ_Shadow list.


jbd2_journal_commit_transaction
     journal_submit_data_buffers
         journal_submit_inode_data_buffers
             generic_writepages
                 set_page_writeback(page) [PG_writeback]
     jbd2_journal_write_metadata_buffer
         __jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);

     journal_finish_inode_data_buffers
         filemap_fdatawait
             filemap_fdatawait_range
                 wait_on_page_writeback(page)
                     wait_on_page_bit(page, PG_writeback) <--stuck here
     jbd2_journal_unfile_buffer(journal, jh) [delete from BJ_Shadow list]



We can get to the code path below a couple of different ways (see 
further down).  The important stuff here is:
1) There is a code path that takes i_data_sem and then goes to sleep 
waiting for the jbd2 journal head to be removed from the BJ_Shadow list. 
  If the journal head never comes off the list, the sema will never be 
released.
2) ext4_map_blocks() always takes a read lock on i_data_sem.  If the 
sema is held by someone waiting for the journal head to come off the 
list, it will block.

ext4_da_writepages
     write_cache_pages_da
         mpage_da_map_and_submit
             ext4_map_blocks
                 down_read((&EXT4_I(inode)->i_data_sem))
                 up_read((&EXT4_I(inode)->i_data_sem))
                 down_write((&EXT4_I(inode)->i_data_sem))
                 ext4_ext_map_blocks
                     ext4_mb_new_blocks
                         ext4_mb_mark_diskspace_used
                             __ext4_journal_get_write_access
                                 jbd2_journal_get_write_access
                                     do_get_write_access
                                         wait on BJ_Shadow list



One of the ways we end up at ext4_da_writepages() is via the page 
writeback thread.  If i_data_sem is already held by someone that is 
sleeping, this can result in pages not getting written out.

bdi_writeback_thread
     wb_do_writeback
         wb_check_old_data_flush
             wb_writeback
                 __writeback_inodes_wb
                     writeback_sb_inodes
                         writeback_single_inode
                             do_writepages
                                 ext4_da_writepages


Another way to end up at ext4_da_writepages() is via sync writev() 
calls.  In the traces from my original report this ended up taking the 
sema and then going to sleep waiting for the journal head to get removed 
from the BJ_Shadow list.

sys_writev
     vfs_writev
         do_readv_writev
             do_sync_readv_writev
                 ext4_file_write
                     generic_file_aio_write
                         generic_write_sync
                             ext4_sync_file
                                 filemap_write_and_wait_range
                                      __filemap_fdatawrite_range
                                          do_writepages
                                              ext4_da_writepages


Chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-23 17:54     ` RT/ext4/jbd2 circular dependency (was: Re: Hang writing to nfs-mounted filesystem from client) Chris Friesen
  2014-10-26 14:25       ` Thomas Gleixner
@ 2014-10-27 19:57       ` Chris Friesen
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Friesen @ 2014-10-27 19:57 UTC (permalink / raw)
  To: Austin Schuh, pavel
  Cc: J. Bruce Fields, linux-ext4, tytso, adilger.kernel, rt-users

On 10/23/2014 11:54 AM, Chris Friesen wrote:

> Basically it looks like we have a circular dependency involving the
> inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list.  It
> goes something like this:
>
> jbd2_journal_commit_transaction:
> 1) set page for writeback (set PG_writeback bit)
> 2) put jbd2 journal head on BJ_Shadow list
> 3) sleep on PG_writeback bit waiting for page writeback complete
>
> ext4_da_writepages:
> 1) ext4_map_blocks() acquires inode->i_data_sem for writing
> 2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
> the BJ_Shadow list
>
> At this point the flush code can't run because it can't acquire
> inode->i_data_sem for reading, so the page will never get written out.
> Deadlock.

Just curious...would we expect lockdep to detect this sort of thing?  I 
wasn't sure if the introduction of the two wait queues would cause 
complications.

Chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-27 16:22         ` RT/ext4/jbd2 circular dependency Chris Friesen
@ 2014-10-29 18:05           ` Thomas Gleixner
  2014-10-29 19:11             ` Chris Friesen
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2014-10-29 18:05 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, rt-users

On Mon, 27 Oct 2014, Chris Friesen wrote:
> There are details (stack traces, etc.) in the first message in the thread:
> http://www.spinics.net/lists/linux-rt-users/msg12261.html
> 
> 
> Originally we had thought that nfsd might have been implicated somehow, but it
> seems like it was just a trigger (possibly by increasing the rate of sync
> I/O).
> 
> In the interest of full disclosure I should point out that we're using a
> modified kernel so there is a chance that we have introduced the problem
> ourselves.  That said, we have not made significant changes to either ext4 or
> jbd2.  (Just a couple of minor cherry-picked bugfixes.)

I don't think it's an ext4/jdb2 problem.
 
> The relevant code paths are:
> 
> Journal commit.  The important thing here is that we set the PG_writeback on a
> page, put the jbd2 journal head on BJ_Shadow list, then sleep waiting for page
> writeback complete.  If the page writeback never completes, then the journal
> head never comes off the BJ_Shadow list.

And that's what you need to investigate.

The rest of the threads being stuck waiting for the journal writeback
or inode->sem are just the consequence of it and have nothing to do
with the root cause of the problem.

ftrace with the block/writeback/jdb/ext4/sched tracepoints enabled
should provide a first insight into the issue.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 18:05           ` Thomas Gleixner
@ 2014-10-29 19:11             ` Chris Friesen
  2014-10-29 19:26               ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Friesen @ 2014-10-29 19:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, rt-users

On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
> On Mon, 27 Oct 2014, Chris Friesen wrote:
>> There are details (stack traces, etc.) in the first message in the thread:
>> http://www.spinics.net/lists/linux-rt-users/msg12261.html
>>
>>
>> Originally we had thought that nfsd might have been implicated somehow, but it
>> seems like it was just a trigger (possibly by increasing the rate of sync
>> I/O).
>>
>> In the interest of full disclosure I should point out that we're using a
>> modified kernel so there is a chance that we have introduced the problem
>> ourselves.  That said, we have not made significant changes to either ext4 or
>> jbd2.  (Just a couple of minor cherry-picked bugfixes.)
>
> I don't think it's an ext4/jdb2 problem.

If we turn off journalling in ext4 we can't reproduce the problem.  Not 
conclusive, I'll admit...but interesting.

>> The relevant code paths are:
>>
>> Journal commit.  The important thing here is that we set the PG_writeback on a
>> page, put the jbd2 journal head on BJ_Shadow list, then sleep waiting for page
>> writeback complete.  If the page writeback never completes, then the journal
>> head never comes off the BJ_Shadow list.
>
> And that's what you need to investigate.
>
> The rest of the threads being stuck waiting for the journal writeback
> or inode->sem are just the consequence of it and have nothing to do
> with the root cause of the problem.
>
> ftrace with the block/writeback/jdb/ext4/sched tracepoints enabled
> should provide a first insight into the issue.

It seems plausible that the reason why page writeback never completes is 
that it's blocking trying to take inode->i_data_sem for reading, as seen 
in the following stack trace (from a hung system):

[<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
[<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
[<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
[<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
[<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
[<ffffffff8111a5f4>] do_writepages+0x24/0x40
[<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
[<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
[<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff811928e3>] wb_writeback+0x223/0x3f0
[<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
[<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
[<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320

I have ftrace logs for two of the three components that we think are 
involved.  I don't have ftrace logs for the above writeback case.  My 
instrumentation was set up to end tracing when someone blocked for 5 
seconds trying to get inode->i_data_sem, and it happened to be an nfsd 
task instead of the page writeback code.  I could conceivably modify the 
instrumentation to only get triggered by page writeback blocking.


For what it's worth, I'm currently testing a backport of commit b34090e 
from mainline (which in turn required backporting commits e5a120a and 
f5113ef).  It switches from using the BJ_Shadow list to using the 
BH_Shadow flag on the buffer head.  More interestingly, waiters now get 
woken up from journal_end_buffer_io_sync() instead of from 
jbd2_journal_commit_transaction().

So far this seems to be helping a lot.  It's lasted about 15x as long 
under stress as without the patches.

Chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 19:11             ` Chris Friesen
@ 2014-10-29 19:26               ` Thomas Gleixner
  2014-10-29 20:17                 ` Chris Friesen
  2014-10-29 23:19                 ` Theodore Ts'o
  0 siblings, 2 replies; 28+ messages in thread
From: Thomas Gleixner @ 2014-10-29 19:26 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, rt-users

On Wed, 29 Oct 2014, Chris Friesen wrote:
> On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
> It seems plausible that the reason why page writeback never completes is that
> it's blocking trying to take inode->i_data_sem for reading, as seen in the
> following stack trace (from a hung system):
> 
> [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
> [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
> [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
> [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
> [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
> [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
> [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
> [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320

Well, the point is that the JBD write out is not completed. The above
is just the consequence. So really looking at ext4 inode write backs
and something stuck on BJ_Shadow or the inode sem is the wrong
place. It's all just caused by the JDB writeout not being completed
for whatever reason.
 
> For what it's worth, I'm currently testing a backport of commit b34090e from
> mainline (which in turn required backporting commits e5a120a and f5113ef).  It
> switches from using the BJ_Shadow list to using the BH_Shadow flag on the
> buffer head.  More interestingly, waiters now get woken up from
> journal_end_buffer_io_sync() instead of from
> jbd2_journal_commit_transaction().
> 
> So far this seems to be helping a lot.  It's lasted about 15x as long under
> stress as without the patches.

I fear that this is just papering over the problem, but you have to
talk to the jbd2 folks about that.

I personally prefer a reasonable explanation for the current behaviour
rather than a magic "solution" to the problem. But that's up to you.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 19:26               ` Thomas Gleixner
@ 2014-10-29 20:17                 ` Chris Friesen
  2014-10-29 20:31                   ` Thomas Gleixner
  2014-10-29 23:19                 ` Theodore Ts'o
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Friesen @ 2014-10-29 20:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, rt-users

On 10/29/2014 01:26 PM, Thomas Gleixner wrote:
> On Wed, 29 Oct 2014, Chris Friesen wrote:
>> On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
>> It seems plausible that the reason why page writeback never completes is that
>> it's blocking trying to take inode->i_data_sem for reading, as seen in the
>> following stack trace (from a hung system):
>>
>> [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
>> [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
>> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
>> [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
>> [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
>> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
>> [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
>> [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
>> [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
>> [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
>> [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
>> [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
>> [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
>
> Well, the point is that the JBD write out is not completed. The above
> is just the consequence. So really looking at ext4 inode write backs
> and something stuck on BJ_Shadow or the inode sem is the wrong
> place. It's all just caused by the JDB writeout not being completed
> for whatever reason.

I'll willingly confess my ignorance of filesystem code before I started 
looking at this issue.  I was under the impression that the above stack 
trace (for the "flush-147:3" task, in this case) was performing the 
write out of the page that had been flagged for writeback by JBD...is 
that not the case?  If not, then could you point me in the right direction?

>> For what it's worth, I'm currently testing a backport of commit b34090e from
>> mainline (which in turn required backporting commits e5a120a and f5113ef).  It
>> switches from using the BJ_Shadow list to using the BH_Shadow flag on the
>> buffer head.  More interestingly, waiters now get woken up from
>> journal_end_buffer_io_sync() instead of from
>> jbd2_journal_commit_transaction().
>>
>> So far this seems to be helping a lot.  It's lasted about 15x as long under
>> stress as without the patches.
>
> I fear that this is just papering over the problem, but you have to
> talk to the jbd2 folks about that.

They're on the CC list, hopefully they'll chime in...

> I personally prefer a reasonable explanation for the current behaviour
> rather than a magic "solution" to the problem. But that's up to you.

Well sure...but if nothing else it helps to point to a possible cause. 
I'm currently looking at the locking implications of the patch to see if 
it completely closes the race window or just narrows it.

Chris


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 20:17                 ` Chris Friesen
@ 2014-10-29 20:31                   ` Thomas Gleixner
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2014-10-29 20:31 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, tytso,
	adilger.kernel, rt-users

On Wed, 29 Oct 2014, Chris Friesen wrote:
> On 10/29/2014 01:26 PM, Thomas Gleixner wrote:
> > On Wed, 29 Oct 2014, Chris Friesen wrote:
> > > On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
> > > It seems plausible that the reason why page writeback never completes is
> > > that
> > > it's blocking trying to take inode->i_data_sem for reading, as seen in the
> > > following stack trace (from a hung system):
> > > 
> > > [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
> > > [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
> > > [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> > > [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
> > > [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
> > > [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> > > [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
> > > [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
> > > [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
> > > [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
> > > [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
> > > [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
> > > [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
> > 
> > Well, the point is that the JBD write out is not completed. The above
> > is just the consequence. So really looking at ext4 inode write backs
> > and something stuck on BJ_Shadow or the inode sem is the wrong
> > place. It's all just caused by the JDB writeout not being completed
> > for whatever reason.
> 
> I'll willingly confess my ignorance of filesystem code before I started
> looking at this issue.  I was under the impression that the above stack trace
> (for the "flush-147:3" task, in this case) was performing the write out of the
> page that had been flagged for writeback by JBD...is that not the case?  If
> not, then could you point me in the right direction?

jbd2_journal_commit_transaction
    journal_submit_data_buffers
        journal_submit_inode_data_buffers
            generic_writepages
                set_page_writeback(page) [PG_writeback]

generic_writepages() is issuing the writeout to the block layer and
the underlying physical device.

    journal_finish_inode_data_buffers
        filemap_fdatawait
            filemap_fdatawait_range
                wait_on_page_writeback(page)
                    wait_on_page_bit(page, PG_writeback) <--stuck here

That parts waits for the IO completion. And because that thing is
stuck everything else gets stuck as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 19:26               ` Thomas Gleixner
  2014-10-29 20:17                 ` Chris Friesen
@ 2014-10-29 23:19                 ` Theodore Ts'o
  2014-10-29 23:37                   ` Chris Friesen
  2014-10-30 21:11                   ` Thomas Gleixner
  1 sibling, 2 replies; 28+ messages in thread
From: Theodore Ts'o @ 2014-10-29 23:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Friesen, Austin Schuh, pavel, J. Bruce Fields, linux-ext4,
	adilger.kernel, rt-users

On Wed, Oct 29, 2014 at 08:26:36PM +0100, Thomas Gleixner wrote:
> > For what it's worth, I'm currently testing a backport of commit b34090e from
> > mainline (which in turn required backporting commits e5a120a and f5113ef).  It
> > switches from using the BJ_Shadow list to using the BH_Shadow flag on the
> > buffer head.  More interestingly, waiters now get woken up from
> > journal_end_buffer_io_sync() instead of from
> > jbd2_journal_commit_transaction().
> > 
> > So far this seems to be helping a lot.  It's lasted about 15x as long under
> > stress as without the patches.
> 
> I fear that this is just papering over the problem, but you have to
> talk to the jbd2 folks about that.

No, it's a clean fix for the problem.  The main issue is that what the
jbd2 commit was doing was starting inode writeback for those blocks
needed to guarantee data=ordered mode (so this is what caused various
pages to have writeback page set) as well as starting metadata writes
to the commit (which is what caused the shadow bit to be set on the
metadata buffers). 

Now that we clear the shadow flag when the metadata writes is
complete, the writeback will eventually be allowed to complete and
this prevents the deadlock.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 23:19                 ` Theodore Ts'o
@ 2014-10-29 23:37                   ` Chris Friesen
  2014-10-30  1:44                     ` Theodore Ts'o
  2014-10-30 21:11                   ` Thomas Gleixner
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Friesen @ 2014-10-29 23:37 UTC (permalink / raw)
  To: Theodore Ts'o, Thomas Gleixner
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, adilger.kernel,
	rt-users

On 10/29/2014 05:19 PM, Theodore Ts'o wrote:
> On Wed, Oct 29, 2014 at 08:26:36PM +0100, Thomas Gleixner wrote:
>>> For what it's worth, I'm currently testing a backport of commit b34090e from
>>> mainline (which in turn required backporting commits e5a120a and f5113ef).  It
>>> switches from using the BJ_Shadow list to using the BH_Shadow flag on the
>>> buffer head.  More interestingly, waiters now get woken up from
>>> journal_end_buffer_io_sync() instead of from
>>> jbd2_journal_commit_transaction().
>>>
>>> So far this seems to be helping a lot.  It's lasted about 15x as long under
>>> stress as without the patches.
>>
>> I fear that this is just papering over the problem, but you have to
>> talk to the jbd2 folks about that.
>
> No, it's a clean fix for the problem.  The main issue is that what the
> jbd2 commit was doing was starting inode writeback for those blocks
> needed to guarantee data=ordered mode (so this is what caused various
> pages to have writeback page set) as well as starting metadata writes
> to the commit (which is what caused the shadow bit to be set on the
> metadata buffers).
>
> Now that we clear the shadow flag when the metadata writes is
> complete, the writeback will eventually be allowed to complete and
> this prevents the deadlock.

Thanks for the explanation.

A few questions:

1) Is this something that could hit mainline as well, or just the RT kernel?

2) If it can hit mainline, is this something that should be considered 
for the various longterm-support kernels?  (3.10, maybe 3.4)

3) For 3.4, do you think that it's sufficient to backport the three 
commits I mentioned, or are you aware of others that I should be looking 
at as well?

Chris


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 23:37                   ` Chris Friesen
@ 2014-10-30  1:44                     ` Theodore Ts'o
  2014-10-30  8:15                       ` Kevin Liao
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Ts'o @ 2014-10-30  1:44 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Thomas Gleixner, Austin Schuh, pavel, J. Bruce Fields,
	linux-ext4, adilger.kernel, rt-users

On Wed, Oct 29, 2014 at 05:37:03PM -0600, Chris Friesen wrote:
> 
> 1) Is this something that could hit mainline as well, or just the RT kernel?

Yes, it could hit the mainline as well.  It's more about the workload
and how the files are written to and/or sync'ed.

> 2) If it can hit mainline, is this something that should be considered for
> the various longterm-support kernels?  (3.10, maybe 3.4)

Yes, probably.

> 3) For 3.4, do you think that it's sufficient to backport the three commits
> I mentioned, or are you aware of others that I should be looking at as well?

No, those three commits would be quite sufficient.  If you could
handle doing the backports, I would greatly appreciate it!  Sounds
like you've done so for 3.4 already, and it should be fairly
straightforward to get them into 3.10, since b34090e landed in 3.11.

Thanks,

						- Ted








^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-30  1:44                     ` Theodore Ts'o
@ 2014-10-30  8:15                       ` Kevin Liao
  2014-10-30 12:24                         ` Theodore Ts'o
  0 siblings, 1 reply; 28+ messages in thread
From: Kevin Liao @ 2014-10-30  8:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Chris Friesen, Thomas Gleixner, Austin Schuh, pavel,
	J. Bruce Fields, Ext4 Developers List, adilger.kernel, rt-users

2014-10-30 9:44 GMT+08:00 Theodore Ts'o <tytso@mit.edu>:
> On Wed, Oct 29, 2014 at 05:37:03PM -0600, Chris Friesen wrote:
>> 3) For 3.4, do you think that it's sufficient to backport the three commits
>> I mentioned, or are you aware of others that I should be looking at as well?
>
> No, those three commits would be quite sufficient.  If you could
> handle doing the backports, I would greatly appreciate it!  Sounds
> like you've done so for 3.4 already, and it should be fairly
> straightforward to get them into 3.10, since b34090e landed in 3.11.
>

Hi Ted and Chris,

I also have similar jbd2 hang problem for kernel 3.4. Could you kindly
specify what are the three commits to fix the issue? Cause I can not
find them using the keyword "commit b34090e e5a120a f5113ef".

Thanks a lot for your help.

Kevin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-30  8:15                       ` Kevin Liao
@ 2014-10-30 12:24                         ` Theodore Ts'o
  0 siblings, 0 replies; 28+ messages in thread
From: Theodore Ts'o @ 2014-10-30 12:24 UTC (permalink / raw)
  To: Kevin Liao
  Cc: Chris Friesen, Thomas Gleixner, Austin Schuh, pavel,
	J. Bruce Fields, Ext4 Developers List, adilger.kernel, rt-users

On Thu, Oct 30, 2014 at 04:15:45PM +0800, Kevin Liao wrote:
> 
> I also have similar jbd2 hang problem for kernel 3.4. Could you kindly
> specify what are the three commits to fix the issue? Cause I can not
> find them using the keyword "commit b34090e e5a120a f5113ef".

Those *are* the three commits:

% git log -3 --oneline  b34090e 
b34090e jbd2: refine waiting for shadow buffers
e5a120a jbd2: remove journal_head from descriptor buffers
f5113ef jbd2: don't create journal_head for temporary journal buffers

Or if you need the full git commit ID's

% git log --pretty=oneline -3  b34090e 
b34090e5e22a02fba0e4473056cce9420ad9dd0b jbd2: refine waiting for shadow buffers
e5a120aeb57f40ae568a5ca1dd6ace53d0213582 jbd2: remove journal_head from descriptor buffers
f5113effc2a2ee6b86a4b345ce557353dcbcfffe jbd2: don't create journal_head for temporary journal buffers

Or the URL's:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b34090e5e22a02fba0e4473056cce9420ad9dd0b
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e5a120aeb57f40ae568a5ca1dd6ace53d0213582
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f5113effc2a2ee6b86a4b345ce557353dcbcfffe

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-29 23:19                 ` Theodore Ts'o
  2014-10-29 23:37                   ` Chris Friesen
@ 2014-10-30 21:11                   ` Thomas Gleixner
  2014-10-30 23:24                     ` Theodore Ts'o
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2014-10-30 21:11 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Chris Friesen, Austin Schuh, pavel, J. Bruce Fields, linux-ext4,
	adilger.kernel, rt-users

On Wed, 29 Oct 2014, Theodore Ts'o wrote:
> On Wed, Oct 29, 2014 at 08:26:36PM +0100, Thomas Gleixner wrote:
> > > For what it's worth, I'm currently testing a backport of commit b34090e from
> > > mainline (which in turn required backporting commits e5a120a and f5113ef).  It
> > > switches from using the BJ_Shadow list to using the BH_Shadow flag on the
> > > buffer head.  More interestingly, waiters now get woken up from
> > > journal_end_buffer_io_sync() instead of from
> > > jbd2_journal_commit_transaction().
> > > 
> > > So far this seems to be helping a lot.  It's lasted about 15x as long under
> > > stress as without the patches.
> > 
> > I fear that this is just papering over the problem, but you have to
> > talk to the jbd2 folks about that.
> 
> No, it's a clean fix for the problem.  The main issue is that what the
> jbd2 commit was doing was starting inode writeback for those blocks
> needed to guarantee data=ordered mode (so this is what caused various
> pages to have writeback page set) as well as starting metadata writes
> to the commit (which is what caused the shadow bit to be set on the
> metadata buffers). 
> 
> Now that we clear the shadow flag when the metadata writes is
> complete, the writeback will eventually be allowed to complete and
> this prevents the deadlock.

That's a way better explanation than what I saw in the commit logs and
it actually maps to the observed traces and stackdumps.

Thanks for the clarification! I'm just getting nervous when 'picked
some backports' magically 'fixes' an issue without a proper
explanation.

Thanks,

       tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-30 21:11                   ` Thomas Gleixner
@ 2014-10-30 23:24                     ` Theodore Ts'o
  2014-10-31  0:08                       ` Chris Friesen
                                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Theodore Ts'o @ 2014-10-30 23:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Friesen, Austin Schuh, pavel, J. Bruce Fields, linux-ext4,
	adilger.kernel, rt-users

On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
> 
> That's a way better explanation than what I saw in the commit logs and
> it actually maps to the observed traces and stackdumps.

I can't speak for Jan, but I suspect he didn't realize that there was
a problem.  The commit description in b34090e5e2 makes it clear that
the intent was a performance improvement, and not an attempt to fix a
potential deadlock bug.

Looking at the commit history, the problem was introduced in 2.6.27
(July 2008), in commit c851ed54017373, so this problem wasn't noticed
in the RHEL 6 and RHEL 7 enterprise linux QA runs, and it wasn't
noticed in all of the regression testing that we've been doing.

I've certainly seen this before.  Two years ago we found a bug that
was only noticed when we deployed ext4 in production at Google, and
stress tested it at Google scale with the appropriate monitoring
systems so we could find a bug that had existed since the very
beginning of ext3, and which had never been noticed in all of the
enterprise testing done by Red Hat, SuSE, IBM, HP, etc.  Actually, it
probably was noticed, but never in a reproducible way, and so it was
probably written off as some kind of flaky hardware induced
corruption.

The difference is that in this case, it seems that Chris and Kevin was
able to reproduce the problem reliably.  (It also might be that the RT
patch kits widens the race window and makes it much more likely to
trigger.)  Chris or Kevin, if you have time to try to create a
reliable repro that is small/simple enough that we could propose it as
an new test to add to xfstests, that would be great.  If you can't,
that's completely understable.

In the case I described above, it was an extremely hard to hit race
that only happened under high memory pressure, so we never able to
create a reliable repro.  Instead we had a theory that was consistent
pattern of metadata corruption we were seeing, deployed a kernel with
the fix, and after a few weeks were able to conclude we had finally
fixed the bug.  Welcome to file system debugging.  :-)

> Thanks for the clarification! I'm just getting nervous when 'picked
> some backports' magically 'fixes' an issue without a proper
> explanation.

Well, thanks to Chris for pointing out that b34090e5 seemed to make
the problem go away.  Once I looked at what that patch changed, it was
a lot more obvious what might have been going wrong.  It's always
helpful if you can beek at the answer key, even if it's a only
potential answer key.  :-)

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-30 23:24                     ` Theodore Ts'o
@ 2014-10-31  0:08                       ` Chris Friesen
  2014-10-31  0:16                       ` Thomas Gleixner
  2014-11-13 19:06                       ` Jan Kara
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Friesen @ 2014-10-31  0:08 UTC (permalink / raw)
  To: Theodore Ts'o, Thomas Gleixner
  Cc: Austin Schuh, pavel, J. Bruce Fields, linux-ext4, adilger.kernel,
	rt-users

On 10/30/2014 05:24 PM, Theodore Ts'o wrote:
> On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
>>
>> That's a way better explanation than what I saw in the commit logs and
>> it actually maps to the observed traces and stackdumps.
>
> I can't speak for Jan, but I suspect he didn't realize that there was
> a problem.  The commit description in b34090e5e2 makes it clear that
> the intent was a performance improvement, and not an attempt to fix a
> potential deadlock bug.
>
> Looking at the commit history, the problem was introduced in 2.6.27
> (July 2008), in commit c851ed54017373, so this problem wasn't noticed
> in the RHEL 6 and RHEL 7 enterprise linux QA runs, and it wasn't
> noticed in all of the regression testing that we've been doing.
>
> I've certainly seen this before.  Two years ago we found a bug that
> was only noticed when we deployed ext4 in production at Google, and
> stress tested it at Google scale with the appropriate monitoring
> systems so we could find a bug that had existed since the very
> beginning of ext3, and which had never been noticed in all of the
> enterprise testing done by Red Hat, SuSE, IBM, HP, etc.  Actually, it
> probably was noticed, but never in a reproducible way, and so it was
> probably written off as some kind of flaky hardware induced
> corruption.
>
> The difference is that in this case, it seems that Chris and Kevin was
> able to reproduce the problem reliably.  (It also might be that the RT
> patch kits widens the race window and makes it much more likely to
> trigger.)  Chris or Kevin, if you have time to try to create a
> reliable repro that is small/simple enough that we could propose it as
> an new test to add to xfstests, that would be great.  If you can't,
> that's completely understable.

It appears that EXT4_I(inode)->i_data_sem is involved, so I wonder if it 
might have something to do with the fact that the RT patches modify the 
reader-writer semaphores so that the read-side is exclusive?

I suspect I won't have time to isolate a useful testcase, unfortunately.

For what it's worth, we initially discovered the problem when copying 
large (10GB) files from an NFS client onto an NFS-mounted ext4 
filesystem that was mounted with "noatime,nodiratime,data=ordered". 
Initially it failed quite reliably, then something in our environment 
changed and it became more intermittent (could take several hours of 
stress test to reproduce).

We discovered somewhat by accident that we could more reliably reproduce 
it running on a pair of VirtualBox VMs.  The server exported the 
filesystem as per above, and on the client I just used dd to copy from 
/dev/zero to the NFS-mounted filesystem.  Generally it would hang before 
copying 5GB of data.

Chris



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-30 23:24                     ` Theodore Ts'o
  2014-10-31  0:08                       ` Chris Friesen
@ 2014-10-31  0:16                       ` Thomas Gleixner
  2014-11-13 19:06                       ` Jan Kara
  2 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2014-10-31  0:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Chris Friesen, Austin Schuh, pavel, J. Bruce Fields, linux-ext4,
	adilger.kernel, rt-users

Ted,

On Thu, 30 Oct 2014, Theodore Ts'o wrote:
> On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
> > 
> > That's a way better explanation than what I saw in the commit logs and
> > it actually maps to the observed traces and stackdumps.
> 
> I can't speak for Jan, but I suspect he didn't realize that there was
> a problem.  The commit description in b34090e5e2 makes it clear that
> the intent was a performance improvement, and not an attempt to fix a
> potential deadlock bug.

Indeed. The changelog does not look like a bug fix. That's why I was
worried about Chris 'magic resolution' report.
 
> The difference is that in this case, it seems that Chris and Kevin was
> able to reproduce the problem reliably.  (It also might be that the RT
> patch kits widens the race window and makes it much more likely to
> trigger.)

We've seen that before. RT indeed makes race conditions more prominent
by opening way more race windows than a mainline kernel ever does.

But there is a caveat. With the recent per cpu assumptions RT might
very well run into some completely not mainline relevant RT specific
pitfalls. And that's becoming worse with every day that RT is out of
tree. But that's a totally different story.

Thanks again for your profound analysis of the issue at hand!

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RT/ext4/jbd2 circular dependency
  2014-10-30 23:24                     ` Theodore Ts'o
  2014-10-31  0:08                       ` Chris Friesen
  2014-10-31  0:16                       ` Thomas Gleixner
@ 2014-11-13 19:06                       ` Jan Kara
  2 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2014-11-13 19:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Thomas Gleixner, Chris Friesen, Austin Schuh, pavel,
	J. Bruce Fields, linux-ext4, adilger.kernel, rt-users

On Thu 30-10-14 19:24:37, Ted Tso wrote:
> On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
> > 
> > That's a way better explanation than what I saw in the commit logs and
> > it actually maps to the observed traces and stackdumps.
> 
> I can't speak for Jan, but I suspect he didn't realize that there was
> a problem.  The commit description in b34090e5e2 makes it clear that
> the intent was a performance improvement, and not an attempt to fix a
> potential deadlock bug.
  Yeah, correct. When I wrote b34090e5e2 I did that because I have realized
that journal reservations would create similar deadlock opportunity as
Chris has observed. I hadn't realized the problem is already in the code :)

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-11-13 19:06 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-17 17:50 Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D Chris Friesen
2014-10-17 17:50 ` Chris Friesen
2014-10-17 18:01 ` Pavel Vasilyev
     [not found]   ` <CANGgnMbQmsdMDJUx7Bop9Xs=jQMmAJgWRjhXVFUGx-DwF=inYw@mail.gmail.com>
2014-10-23 17:54     ` RT/ext4/jbd2 circular dependency (was: Re: Hang writing to nfs-mounted filesystem from client) Chris Friesen
2014-10-26 14:25       ` Thomas Gleixner
2014-10-27 16:22         ` RT/ext4/jbd2 circular dependency Chris Friesen
2014-10-29 18:05           ` Thomas Gleixner
2014-10-29 19:11             ` Chris Friesen
2014-10-29 19:26               ` Thomas Gleixner
2014-10-29 20:17                 ` Chris Friesen
2014-10-29 20:31                   ` Thomas Gleixner
2014-10-29 23:19                 ` Theodore Ts'o
2014-10-29 23:37                   ` Chris Friesen
2014-10-30  1:44                     ` Theodore Ts'o
2014-10-30  8:15                       ` Kevin Liao
2014-10-30 12:24                         ` Theodore Ts'o
2014-10-30 21:11                   ` Thomas Gleixner
2014-10-30 23:24                     ` Theodore Ts'o
2014-10-31  0:08                       ` Chris Friesen
2014-10-31  0:16                       ` Thomas Gleixner
2014-11-13 19:06                       ` Jan Kara
2014-10-27 19:57       ` Chris Friesen
     [not found] ` <544156FE.7070905-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org>
2014-10-17 18:58   ` Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D Austin Schuh
2014-10-17 18:58     ` Austin Schuh
2014-10-17 19:12   ` Dmitry Monakhov
2014-10-17 19:12     ` Dmitry Monakhov
2014-10-18 17:05   ` Hang writing to nfs-mounted filesystem from client -- expected code path? Chris Friesen
2014-10-18 17:05     ` Chris Friesen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.