All of lore.kernel.org
 help / color / mirror / Atom feed
* bug #917 - deadlock on log recovery
@ 2012-03-22 17:34 Kirill Malkin
  2012-03-30 16:06 ` Christoph Hellwig
  0 siblings, 1 reply; 3+ messages in thread
From: Kirill Malkin @ 2012-03-22 17:34 UTC (permalink / raw)
  To: xfs, xfs-masters

Hi,

I am wondering if someone had a chance to look at the bug #917. I
filed it a couple of weeks ago, but haven’t seen any action. We are
running into it quite a lot, and the only way out of it is to reboot
the OS and drop the log. Below is another stack trace that is slightly
different from the one I filed, but apparently it is the same bug.

Please let me know if you need any other input.

Thanks!
Kirill

[1185916.684850] mount         D ffff8808edc989c0     0  6978      1 0x00000000
[1185916.684853]  ffff8802433632f8 0000000000000086 0000000000000000
0000000000000000
[1185916.684856]  000000000000e488 ffff880243363fd8 ffff880443636280
ffff880c3cdd8180
[1185916.684860]  ffff880443636608 000000063b49a400 0000000111a61ff6
ffff88065848e488
[1185916.684863] Call Trace:
[1185916.684866]  [<ffffffff8150d44b>] ? dm_any_congested+0x6b/0x90
[1185916.684869]  [<ffffffff816766ed>] schedule_timeout+0x1dd/0x260
[1185916.684871]  [<ffffffff8150bfea>] ? dm_get_live_table+0x4a/0x60
[1185916.684874]  [<ffffffff8167759e>] __down+0x6e/0xb0
[1185916.684877]  [<ffffffff8135aad5>] ? _xfs_buf_find+0x145/0x280
[1185916.684879]  [<ffffffff8108318c>] down+0x4c/0x50
[1185916.684882]  [<ffffffff813599d0>] xfs_buf_lock+0x60/0xd0
[1185916.684884]  [<ffffffff8135aad5>] _xfs_buf_find+0x145/0x280
[1185916.684887]  [<ffffffff8135ac71>] xfs_buf_get+0x61/0x1c0
[1185916.684890]  [<ffffffff8134fe6b>] xfs_trans_get_buf+0x13b/0x1c0
[1185916.684895]  [<ffffffff8131cf94>] xfs_btree_get_buf_block+0x54/0x80
[1185916.684898]  [<ffffffff813206a4>] xfs_btree_split+0x114/0x6a0
[1185916.684900]  [<ffffffff8131e995>] ? xfs_btree_rshift+0x75/0x530
[1185916.684903]  [<ffffffff8131d89d>] ? xfs_btree_lshift+0x7d/0x5f0
[1185916.684906]  [<ffffffff81321151>] xfs_btree_make_block_unfull+0x151/0x190
[1185916.684909]  [<ffffffff8132152c>] xfs_btree_insrec+0x39c/0x5b0
[1185916.684911]  [<ffffffff8131dec7>] ? xfs_btree_lookup_get_block+0xb7/0xf0
[1185916.684915]  [<ffffffff8131be72>] ? xfs_btree_rec_addr+0x12/0x20
[1185916.684917]  [<ffffffff8131c0d8>] ? xfs_lookup_get_search_key+0x58/0x60
[1185916.684920]  [<ffffffff813217c6>] xfs_btree_insert+0x86/0x180
[1185916.684925]  [<ffffffff81306d01>] xfs_free_ag_extent+0x4f1/0x7a0
[1185916.684928]  [<ffffffff81308850>] xfs_alloc_fix_freelist+0x120/0x490
[1185916.684931]  [<ffffffff81342306>] ?
xlog_regrant_write_log_space+0x1e6/0x590
[1185916.684934]  [<ffffffff81308c3c>] xfs_free_extent+0x7c/0xc0
[1185916.684938]  [<ffffffff81312aa5>] xfs_bmap_finish+0x165/0x1b0
[1185916.684942]  [<ffffffff81339065>] xfs_itruncate_finish+0x195/0x370
[1185916.684945]  [<ffffffff8135526e>] xfs_inactive+0x3be/0x4e0
[1185916.684948]  [<ffffffff8134f9f7>] ? xfs_trans_read_buf+0x217/0x410
[1185916.684951]  [<ffffffff813616bd>] xfs_fs_clear_inode+0x9d/0xe0
[1185916.684954]  [<ffffffff8114553e>] clear_inode+0x7e/0x100
[1185916.684957]  [<ffffffff81145cc6>] generic_delete_inode+0x186/0x1c0
[1185916.684959]  [<ffffffff81145d65>] generic_drop_inode+0x65/0x90
[1185916.684961]  [<ffffffff81144892>] iput+0x62/0x70
[1185916.684964]  [<ffffffff813471c9>]
xlog_recover_process_one_iunlink+0x169/0x180
[1185916.684967]  [<ffffffff810830ca>] ? up+0x3a/0x50
[1185916.684969]  [<ffffffff81347287>] xlog_recover_process_iunlinks+0xa7/0x130
[1185916.684972]  [<ffffffff81347354>] xlog_recover_finish+0x44/0xd0
[1185916.684975]  [<ffffffff813403fc>] xfs_log_mount_finish+0x2c/0x40
[1185916.684978]  [<ffffffff8134b03a>] xfs_mountfs+0x48a/0x6f0
[1185916.684981]  [<ffffffff81356003>] ? kmem_zalloc+0x33/0x50
[1185916.684984]  [<ffffffff8134badb>] ? xfs_mru_cache_create+0x13b/0x170
[1185916.684987]  [<ffffffff813631b5>] xfs_fs_fill_super+0x245/0x3a0
[1185916.684990]  [<ffffffff8112e31c>] get_sb_bdev+0x17c/0x1e0
[1185916.684992]  [<ffffffff810f9a61>] ? kstrdup+0x41/0x70
[1185916.684995]  [<ffffffff81362f70>] ? xfs_fs_fill_super+0x0/0x3a0
[1185916.684998]  [<ffffffff813612f8>] xfs_fs_get_sb+0x18/0x20
[1185916.685000]  [<ffffffff8112cc9c>] vfs_kern_mount+0x5c/0xf0
[1185916.685002]  [<ffffffff8112cda3>] do_kern_mount+0x53/0x120
[1185916.685005]  [<ffffffff8114b80a>] do_mount+0x26a/0x8c0
[1185916.685008]  [<ffffffff8114bf1b>] sys_mount+0xbb/0xf0
[1185916.685011]  [<ffffffff8100c15b>] system_call_fastpath+0x16/0x1b

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: bug #917 - deadlock on log recovery
  2012-03-22 17:34 bug #917 - deadlock on log recovery Kirill Malkin
@ 2012-03-30 16:06 ` Christoph Hellwig
  2012-03-30 16:44   ` Kirill Malkin
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Hellwig @ 2012-03-30 16:06 UTC (permalink / raw)
  To: Kirill Malkin; +Cc: xfs-masters, xfs

On Thu, Mar 22, 2012 at 01:34:00PM -0400, Kirill Malkin wrote:
> Hi,
> 
> I am wondering if someone had a chance to look at the bug #917. I
> filed it a couple of weeks ago, but haven?t seen any action. We are
> running into it quite a lot, and the only way out of it is to reboot
> the OS and drop the log. Below is another stack trace that is slightly
> different from the one I filed, but apparently it is the same bug.
> 
> Please let me know if you need any other input.

Can you reproduce this with a recent kernel?  2.6.32 is fairly old
and a lot of things have changed in this area.  I quickly looked
over the trace and nothing obvious springs to mind.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: bug #917 - deadlock on log recovery
  2012-03-30 16:06 ` Christoph Hellwig
@ 2012-03-30 16:44   ` Kirill Malkin
  0 siblings, 0 replies; 3+ messages in thread
From: Kirill Malkin @ 2012-03-30 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs-masters, xfs

Christoph -

Thank you for getting back to me. The kernel I am using is not a vanilla
kernel.org 2.6.32, but is part of the RHEL/CentOS 6 distribution, which
has many bug fixes backported, at least up until 2.6.38 or so.
Technically, it's their latest kernel.

The bug is very difficult to reproduce even on this kernel. It occurs
while mounting a snapshot of a very large (40TB) filesystem that is in a
very active, continuous use. Once the filesystem snapshot is in that
state, it is reproducible 100% (i.e. on every mount), but it's not clear
what pushes it there. Unfortunately, a kernel upgrade on that system is
currently not possible.

Note the lockup occurs during the trimming of free list in
xfs_alloc.c:xfs_alloc_fix_freelist when it's too long (look for "Make the
freelist shorter if it's too long" comment inside this function), then for
some reason the buffer gets double-locked inside xfs_btree_get_bufs, and
the mount hangs forever. I suspect that we are not seeing this more
frequently because the free list trimming is not a typical occurrence
during recovery.

I've looked through the patches to xfs stack in kernel.org git, and found
virtually no changes to this particular area or references to something
similar. I can probably do more research into it, but would really
appreciate some guidance. Would it help to obtain the metadata backup from
that system? What could possibly cause a deadlock when the log recovery
has really no concurrency? Would it help to debug this by somehow forcing
free list trimming during the recovery?

Thanks again for your help.

Kirill

-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org]
Sent: Friday, March 30, 2012 12:07 PM
To: Kirill Malkin
Cc: xfs@oss.sgi.com; xfs-masters@oss.sgi.com
Subject: Re: bug #917 - deadlock on log recovery

On Thu, Mar 22, 2012 at 01:34:00PM -0400, Kirill Malkin wrote:
> Hi,
>
> I am wondering if someone had a chance to look at the bug #917. I
> filed it a couple of weeks ago, but haven?t seen any action. We are
> running into it quite a lot, and the only way out of it is to reboot
> the OS and drop the log. Below is another stack trace that is slightly
> different from the one I filed, but apparently it is the same bug.
>
> Please let me know if you need any other input.

Can you reproduce this with a recent kernel?  2.6.32 is fairly old and a
lot of things have changed in this area.  I quickly looked over the trace
and nothing obvious springs to mind.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-03-30 16:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-22 17:34 bug #917 - deadlock on log recovery Kirill Malkin
2012-03-30 16:06 ` Christoph Hellwig
2012-03-30 16:44   ` Kirill Malkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.