linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed
@ 2014-05-22  9:09 Marc MERLIN
  2014-05-22 13:15 ` 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed / balance seems to create locks that block everything else Marc MERLIN
  0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2014-05-22  9:09 UTC (permalink / raw)
  To: linux-btrfs

I got m laptop to hang all IO to one of its devices again, this time
drive #2.
This is the 3rd time it happens, and I've already lost data as a result
since things that haven't hit disk, don't make it at this point.

I was doing balance and btrfs send/receive.
Then cron started a scrub in the background too.

IO to drive #1 was working fine, I didn't even notice that drive #2 IO
was hung.

And then I typed sync and it never returned.

legolas:~# ps -eo pid,user,args,wchan  | grep  sync
23605 root     sync                        call_rwsem_down_read_failed
31885 root     sync                        call_rwsem_down_read_failed

What does this mean when sync is stuck that way?

When I'm in that state, accessing btrfs on drive 1 still works (read and
write).
Any access on drive 2 through btrfs hangs

Both block devices still work.
legolas:~# dd if=/dev/sda of=/dev/null bs=1M 
2593128448 bytes (2.6 GB) copied, 6.47656 s, 400 MB/s

legolas:~# dd if=/dev/sdb of=/dev/null bs=1M 
148897792 bytes (149 MB) copied, 7.99576 s, 18.6 MB/s

So at least it shows that I don't have a hardware problem, right?

After reboot, most of the data to disk1 made it, so at least sync worked
there.

How can I confirm that it is btrfs deadlocking and not something else in
the kernel?
 
The state of btrfs is:
legolas:~# ps -eo pid,user,args,wchan  | grep  btrfs
  527 root     [btrfs-worker]              rescuer_thread
  528 root     [btrfs-worker-hi]           rescuer_thread
  529 root     [btrfs-delalloc]            rescuer_thread
  530 root     [btrfs-flush_del]           rescuer_thread
  531 root     [btrfs-cache]               rescuer_thread
  532 root     [btrfs-submit]              rescuer_thread
  533 root     [btrfs-fixup]               rescuer_thread
  534 root     [btrfs-endio]               rescuer_thread
  535 root     [btrfs-endio-met]           rescuer_thread
  536 root     [btrfs-endio-met]           rescuer_thread
  537 root     [btrfs-endio-rai]           rescuer_thread
  538 root     [btrfs-rmw]                 rescuer_thread
  539 root     [btrfs-endio-wri]           rescuer_thread
  540 root     [btrfs-freespace]           rescuer_thread
  541 root     [btrfs-delayed-m]           rescuer_thread
  542 root     [btrfs-readahead]           rescuer_thread
  543 root     [btrfs-qgroup-re]           rescuer_thread
  544 root     [btrfs-cleaner]             cleaner_kthread
  545 root     [btrfs-transacti]           transaction_kthread
 2111 root     [btrfs-worker]              rescuer_thread
 2112 root     [btrfs-worker-hi]           rescuer_thread
 2113 root     [btrfs-delalloc]            rescuer_thread
 2114 root     [btrfs-flush_del]           rescuer_thread
 2115 root     [btrfs-cache]               rescuer_thread
 2116 root     [btrfs-submit]              rescuer_thread
 2117 root     [btrfs-fixup]               rescuer_thread
 2119 root     [btrfs-endio]               rescuer_thread
 2120 root     [btrfs-endio-met]           rescuer_thread
 2121 root     [btrfs-endio-met]           rescuer_thread
 2122 root     [btrfs-endio-rai]           rescuer_thread
 2123 root     [btrfs-rmw]                 rescuer_thread
 2124 root     [btrfs-endio-wri]           rescuer_thread
 2125 root     [btrfs-freespace]           rescuer_thread
 2126 root     [btrfs-delayed-m]           rescuer_thread
 2127 root     [btrfs-readahead]           rescuer_thread
 2128 root     [btrfs-qgroup-re]           rescuer_thread
 3205 root     [btrfs-cleaner]             cleaner_kthread
 3206 root     [btrfs-transacti]           transaction_kthread
19156 root     gvim /etc/cron.d/btrfs_back poll_schedule_timeout
19729 root     btrfs send var_ro.20140521_ pipe_wait
19730 root     btrfs receive /mnt/btrfs_po sleep_on_page
19824 root     btrfs balance start -dusage btrfs_wait_and_free_delalloc_work
24611 root     /bin/sh -c cd /mnt/btrfs_po wait
24619 root     btrfs subvolume snapshot /m btrfs_start_delalloc_inodes
32044 root     /sbin/btrfs scrub start -Bd futex_wait_queue_me

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed / balance seems to create locks that block everything else
  2014-05-22  9:09 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed Marc MERLIN
@ 2014-05-22 13:15 ` Marc MERLIN
  2014-05-22 20:52   ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2014-05-22 13:15 UTC (permalink / raw)
  To: linux-btrfs

On Thu, May 22, 2014 at 02:09:21AM -0700, Marc MERLIN wrote:
> I got m laptop to hang all IO to one of its devices again, this time
> drive #2.
> This is the 3rd time it happens, and I've already lost data as a result
> since things that haven't hit disk, don't make it at this point.
> 
> I was doing balance and btrfs send/receive.
> Then cron started a scrub in the background too.
> 
> IO to drive #1 was working fine, I didn't even notice that drive #2 IO
> was hung.
> 
> And then I typed sync and it never returned.
> 
> legolas:~# ps -eo pid,user,args,wchan  | grep  sync
> 23605 root     sync                        call_rwsem_down_read_failed
> 31885 root     sync                        call_rwsem_down_read_failed
> 
> What does this mean when sync is stuck that way?
> 
> When I'm in that state, accessing btrfs on drive 1 still works (read and
> write).
> Any access on drive 2 through btrfs hangs

After reboot, I got hangs on drive 2 quickly:
[ 1559.667362] INFO: task btrfs-balance:3280 blocked for more than 120 seconds.
[ 1559.667374]       Not tainted 3.15.0-rc5-amd64-i915-preempt-20140216s2 #1
[ 1559.667379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1559.667383] btrfs-balance   D 0000000000000001     0  3280      2 0x00000000
[ 1559.667395]  ffff880408531c20 0000000000000046 000000000003da54 ffff880408531fd8
[ 1559.667405]  ffff880408fe8110 00000000000141c0 ffff8800ca1cc5e0 ffff8800ca1cc5e4
[ 1559.667414]  ffff880408fe8110 ffff8800ca1cc5e8 00000000ffffffff ffff880408531c30
[ 1559.667423] Call Trace:
[ 1559.667442]  [<ffffffff8161c896>] schedule+0x73/0x75
[ 1559.667451]  [<ffffffff8161cb57>] schedule_preempt_disabled+0x18/0x24
[ 1559.667459]  [<ffffffff8161dc7a>] __mutex_lock_slowpath+0x160/0x1d7
[ 1559.667466]  [<ffffffff8161dd08>] mutex_lock+0x17/0x27
[ 1559.667475]  [<ffffffff8126adb7>] btrfs_relocate_block_group+0x153/0x26d
[ 1559.667486]  [<ffffffff81249838>] btrfs_relocate_chunk.isra.23+0x5c/0x5e8
[ 1559.667494]  [<ffffffff8161efbb>] ? _raw_spin_unlock+0x17/0x2a
[ 1559.667502]  [<ffffffff81245584>] ? free_extent_buffer+0x8a/0x8d
[ 1559.667510]  [<ffffffff8124c0be>] btrfs_balance+0x9b6/0xb74
[ 1559.667517]  [<ffffffff81615c3d>] ? printk+0x54/0x56
[ 1559.667526]  [<ffffffff8124c27c>] ? btrfs_balance+0xb74/0xb74
[ 1559.667534]  [<ffffffff8124c2d5>] balance_kthread+0x59/0x7b
[ 1559.667542]  [<ffffffff8106b467>] kthread+0xae/0xb6
[ 1559.667549]  [<ffffffff8106b3b9>] ? __kthread_parkme+0x61/0x61
[ 1559.667557]  [<ffffffff81625b3c>] ret_from_fork+0x7c/0xb0
[ 1559.667563]  [<ffffffff8106b3b9>] ? __kthread_parkme+0x61/0x61
[ 1679.595668] INFO: task btrfs-balance:3280 blocked for more than 120 seconds.
[ 1679.595680]       Not tainted 3.15.0-rc5-amd64-i915-preempt-20140216s2 #1
[ 1679.595685] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Balance cancel hangs too and so does sync again:
legolas:~# ps -eo pid,user,args,wchan  | grep  btrfs
  527 root     [btrfs-worker]              rescuer_thread
  528 root     [btrfs-worker-hi]           rescuer_thread
  529 root     [btrfs-delalloc]            rescuer_thread
  530 root     [btrfs-flush_del]           rescuer_thread
  531 root     [btrfs-cache]               rescuer_thread
  532 root     [btrfs-submit]              rescuer_thread
  533 root     [btrfs-fixup]               rescuer_thread
  534 root     [btrfs-endio]               rescuer_thread
  535 root     [btrfs-endio-met]           rescuer_thread
  536 root     [btrfs-endio-met]           rescuer_thread
  537 root     [btrfs-endio-rai]           rescuer_thread
  538 root     [btrfs-rmw]                 rescuer_thread
  539 root     [btrfs-endio-wri]           rescuer_thread
  540 root     [btrfs-freespace]           rescuer_thread
  541 root     [btrfs-delayed-m]           rescuer_thread
  542 root     [btrfs-readahead]           rescuer_thread
  543 root     [btrfs-qgroup-re]           rescuer_thread
  544 root     [btrfs-cleaner]             cleaner_kthread
  545 root     [btrfs-transacti]           transaction_kthread
 2267 root     [btrfs-worker]              rescuer_thread
 2268 root     [btrfs-worker-hi]           rescuer_thread
 2269 root     [btrfs-delalloc]            rescuer_thread
 2271 root     [btrfs-flush_del]           rescuer_thread
 2272 root     [btrfs-cache]               rescuer_thread
 2275 root     [btrfs-submit]              rescuer_thread
 2276 root     [btrfs-fixup]               rescuer_thread
 2277 root     [btrfs-endio]               rescuer_thread
 2278 root     [btrfs-endio-met]           rescuer_thread
 2279 root     [btrfs-endio-met]           rescuer_thread
 2281 root     [btrfs-endio-rai]           rescuer_thread
 2282 root     [btrfs-rmw]                 rescuer_thread
 2283 root     [btrfs-endio-wri]           rescuer_thread
 2284 root     [btrfs-freespace]           rescuer_thread
 2285 root     [btrfs-delayed-m]           rescuer_thread
 2286 root     [btrfs-readahead]           rescuer_thread
 2288 root     [btrfs-qgroup-re]           rescuer_thread
 3278 root     [btrfs-cleaner]             sleep_on_page
 3279 root     [btrfs-transacti]           sleep_on_page
 3280 root     [btrfs-balance]             btrfs_relocate_block_group
14727 root     [kworker/u16:47]            btrfs_tree_lock
14770 root     [kworker/u16:90]            btrfs_tree_lock
22551 root     btrfs send var_ro.20140522_ pipe_wait
22552 root     btrfs receive /mnt/btrfs_po balance_dirty_pages_ratelimited
22593 root     [kworker/u16:3]             btrfs_tree_lock
25054 root     btrfs balance cancel .      btrfs_cancel_balance

I was able to stop my btrfs send/receive, in turn this unlocked sync which
succeeded too (2mn later).
btrfs balance cancel did not return, but maybe that's normal. 
I see:
legolas:~# btrfs balance status /mnt/btrfs_pool2/
Balance on '/mnt/btrfs_pool2/' is running, cancel requested
383 out of about 388 chunks balanced (457 considered),   1% left

It's been running for at least 15mn in 'cancel mode'. Is that normal?

The system doesn't seem hung, but it seems that running anything else while
balance is running creates an avalanche of locks that kills everything.

Is that a known performance problem?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed / balance seems to create locks that block everything else
  2014-05-22 13:15 ` 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed / balance seems to create locks that block everything else Marc MERLIN
@ 2014-05-22 20:52   ` Duncan
  2014-05-23  0:22     ` Marc MERLIN
  0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2014-05-22 20:52 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Thu, 22 May 2014 06:15:29 -0700 as excerpted:

> Balance cancel hangs too and so does sync [...]

For balance, if it comes to having to stop it on new mount after a 
shutdown, there is of course the skip_balance mount option.

> I was able to stop my btrfs send/receive, in turn this unlocked sync
> which succeeded too (2mn later).
> btrfs balance cancel did not return, but maybe that's normal.
> I see:
> legolas:~# btrfs balance status /mnt/btrfs_pool2/
> Balance on '/mnt/btrfs_pool2/' is running, cancel requested
> 383 out of about 388 chunks balanced (457 considered),   1% left
> 
> It's been running for at least 15mn in 'cancel mode'. Is that normal?

I'd guess so.  It's probably in the middle of operations for a single 
chunk, and only checks for cancel between chunks.  Given the possible 
complexity of those operations with snapshotting and quotas factored in 
as well as COW fragmentation, 15 minutes on a single chunk isn't 
/entirely/ out there.

That being symptomatic of the whole performance problem they're battling 
ATM.  They've turned off snapshot-aware-defrag for the time being, and 
there's the quota handling rework in the pipeline, but...

> The system doesn't seem hung, but it seems that running anything else
> while balance is running creates an avalanche of locks that kills
> everything.
> 
> Is that a known performance problem?

Yes, in that at least there's currently a definite known problem with 
balance and snapshotting and snapshot deletion and send all going on at 
the same time, as is certainly a possibility if some of those are on a 
cron job that the admin running the other(s) didn't think about when they 
initiated their own commands.

I've seen patches for at least one related race-related problem (where 
snapshot deletion could collide with balance or send) go by, and don't 
believe it's in Linus-mainline yet, tho I haven't closely tracked status 
beyond that.

Basically, at this point running only one such "major" btrfs operation at 
a time should drastically reduce the possibility of problems, because 
there /are/ known races.  Even after the known races are fixed, it's 
probably a good idea anyway where possible, since just one such operation 
is complex enough and running more than one at a time is only going to 
slow them all down as well as requiring more CPU/IO/memory bandwidth, but 
there /is/ recognition of the very real likelihood that people /will/ end 
up doing it, especially since one or more of the operations may be cron 
jobs that the admin isn't thinking about, so they're /trying/ to make it 
work.  But "just don't do that" does remain the best policy, where it's 
possible.  And of course right now there are known collision issues, so 
definitely avoid it ATM.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed / balance seems to create locks that block everything else
  2014-05-22 20:52   ` Duncan
@ 2014-05-23  0:22     ` Marc MERLIN
  2014-05-23 14:17       ` 3.15.0-rc5: now sync and mount are hung on call_rwsem_down_write_failed Marc MERLIN
  0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2014-05-23  0:22 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Thu, May 22, 2014 at 08:52:34PM +0000, Duncan wrote:
> > It's been running for at least 15mn in 'cancel mode'. Is that normal?
> 
> I'd guess so.  It's probably in the middle of operations for a single 
> chunk, and only checks for cancel between chunks.  Given the possible 
> complexity of those operations with snapshotting and quotas factored in 
> as well as COW fragmentation, 15 minutes on a single chunk isn't 
> /entirely/ out there.

That's probably what I saw indeed.
 
> That being symptomatic of the whole performance problem they're battling 
> ATM.  They've turned off snapshot-aware-defrag for the time being, and 
> there's the quota handling rework in the pipeline, but...

Right. I'm just surprised that sync would hang too. That feels pretty
bad.

> I've seen patches for at least one related race-related problem (where 
> snapshot deletion could collide with balance or send) go by, and don't 
> believe it's in Linus-mainline yet, tho I haven't closely tracked status 
> beyond that.
 
That's indeed what I've been seeing and since I have snapshots and btrfs
send both from cron, I'm hitting this too often :(
If god forbid scrub kicks in from cron too, then I'm toast.

> Basically, at this point running only one such "major" btrfs operation at 
> a time should drastically reduce the possibility of problems, because 
> there /are/ known races.  Even after the known races are fixed, it's 
> probably a good idea anyway where possible, since just one such operation 
> is complex enough and running more than one at a time is only going to 
> slow them all down as well as requiring more CPU/IO/memory bandwidth, but 
> there /is/ recognition of the very real likelihood that people /will/ end 
> up doing it, especially since one or more of the operations may be cron 

The thing is that scrub takes hours to run.
I run btrfs send and snapshots once an hour for backups.

I'm not took keen on stopping backups for hours while scrub runs.
I understand it's a workaround for now though.

I've just stopped scrub altogether now and will see if I still have
problems.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 3.15.0-rc5: now sync and mount are hung on call_rwsem_down_write_failed
  2014-05-23  0:22     ` Marc MERLIN
@ 2014-05-23 14:17       ` Marc MERLIN
  2014-05-23 20:24         ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2014-05-23 14:17 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs, takeuchi_satoru

I had btrfs send/receive running.

Plugging the power in caused laptop-mode to remount my root partition.

That hung, and in turn all of btrfs hung too.

 7668 root     btrfs send home_ro.20140523 -
 7669 root     btrfs receive /mnt/btrfs_po sleep_on_page
12118 root     mount /dev/mapper/cryptroot call_rwsem_down_write_failed
10678 merlin   mencoder -passlogfile dsc04 sleep_on_page

Clearly, it takes very little for 3.15rc5 to deadlock :(

(I wasn't running balance or snapshot this time)

I was able to kill btrfs send and receive, but mencoder is very hung, and
sync does not finish either:
10654 merlin   sync                        sync_inodes_sb
17191 merlin   sync                        call_rwsem_down_read_failed

I'm not posting the sysrq-w every time, but I have it available if needed.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 3.15.0-rc5: now sync and mount are hung on call_rwsem_down_write_failed
  2014-05-23 14:17       ` 3.15.0-rc5: now sync and mount are hung on call_rwsem_down_write_failed Marc MERLIN
@ 2014-05-23 20:24         ` Chris Mason
  2014-05-23 23:13           ` Marc MERLIN
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2014-05-23 20:24 UTC (permalink / raw)
  To: Marc MERLIN, Duncan; +Cc: linux-btrfs, takeuchi_satoru

On 05/23/2014 10:17 AM, Marc MERLIN wrote:
> I had btrfs send/receive running.
> 
> Plugging the power in caused laptop-mode to remount my root partition.
> 
> That hung, and in turn all of btrfs hung too.
> 
>  7668 root     btrfs send home_ro.20140523 -
>  7669 root     btrfs receive /mnt/btrfs_po sleep_on_page
> 12118 root     mount /dev/mapper/cryptroot call_rwsem_down_write_failed
> 10678 merlin   mencoder -passlogfile dsc04 sleep_on_page
> 
> Clearly, it takes very little for 3.15rc5 to deadlock :(
> 
> (I wasn't running balance or snapshot this time)
> 
> I was able to kill btrfs send and receive, but mencoder is very hung, and
> sync does not finish either:
> 10654 merlin   sync                        sync_inodes_sb
> 17191 merlin   sync                        call_rwsem_down_read_failed
> 
> I'm not posting the sysrq-w every time, but I have it available if needed.

Hi Marc,

Can I have the sysrq-w from this one if it's still available?

-chris


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 3.15.0-rc5: now sync and mount are hung on call_rwsem_down_write_failed
  2014-05-23 20:24         ` Chris Mason
@ 2014-05-23 23:13           ` Marc MERLIN
  2014-05-27 19:27             ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2014-05-23 23:13 UTC (permalink / raw)
  To: Chris Mason; +Cc: Duncan, linux-btrfs, takeuchi_satoru

On Fri, May 23, 2014 at 04:24:49PM -0400, Chris Mason wrote:
> > I was able to kill btrfs send and receive, but mencoder is very hung, and
> > sync does not finish either:
> > 10654 merlin   sync                        sync_inodes_sb
> > 17191 merlin   sync                        call_rwsem_down_read_failed
> > 
> > I'm not posting the sysrq-w every time, but I have it available if needed.
> 
> Hi Marc,
> 
> Can I have the sysrq-w from this one if it's still available?

Argh, just found out that the bug caused none of the 2 copies to ever
be committed to disk (including an ext4 partition), and the remote
syslog lost too much for it to be useful.

What's more weird is the previous one, where I was able to copy the
syslog data that never got committed to disk but was still in the page
cache to another machine, I just realized that this one is missing the
beginning (it starts at cpu #4).

So it looks like the only complete one I have right now is
http://marc.merlins.org/tmp/btrfs-hang.txt

If you need more, please let me know, and I'll make sure that I save
that very carefully next time.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 3.15.0-rc5: now sync and mount are hung on call_rwsem_down_write_failed
  2014-05-23 23:13           ` Marc MERLIN
@ 2014-05-27 19:27             ` Chris Mason
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Mason @ 2014-05-27 19:27 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Duncan, linux-btrfs, takeuchi_satoru



On 05/23/2014 07:13 PM, Marc MERLIN wrote:
> On Fri, May 23, 2014 at 04:24:49PM -0400, Chris Mason wrote:
>>> I was able to kill btrfs send and receive, but mencoder is very hung, and
>>> sync does not finish either:
>>> 10654 merlin   sync                        sync_inodes_sb
>>> 17191 merlin   sync                        call_rwsem_down_read_failed
>>>
>>> I'm not posting the sysrq-w every time, but I have it available if needed.
>>
>> Hi Marc,
>>
>> Can I have the sysrq-w from this one if it's still available?
> 
> Argh, just found out that the bug caused none of the 2 copies to ever
> be committed to disk (including an ext4 partition), and the remote
> syslog lost too much for it to be useful.
> 
> What's more weird is the previous one, where I was able to copy the
> syslog data that never got committed to disk but was still in the page
> cache to another machine, I just realized that this one is missing the
> beginning (it starts at cpu #4).
> 
> So it looks like the only complete one I have right now is
> https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/tmp/btrfs-hang.txt&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=trVl686QjTewKFAeRvMI4%2BQqLBCr36hUPGAiCv6xEMk%3D%0A&s=8b775a694311d54d110d686f86531ca5ce2db479b2aa5966d6056ebf173825b8
> 
> If you need more, please let me know, and I'll make sure that I save
> that very carefully next time.

It's not 100% clear what is going on here.  You have a number of procs
waiting for page locks, one of which is trying to read in your free
space cache.

Was this one of your machines with metadata corruption?  More traces
definitely help.

-chris


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-05-27 19:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-22  9:09 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed Marc MERLIN
2014-05-22 13:15 ` 3.15.0-rc5: btrfs and sync deadlock: call_rwsem_down_read_failed / balance seems to create locks that block everything else Marc MERLIN
2014-05-22 20:52   ` Duncan
2014-05-23  0:22     ` Marc MERLIN
2014-05-23 14:17       ` 3.15.0-rc5: now sync and mount are hung on call_rwsem_down_write_failed Marc MERLIN
2014-05-23 20:24         ` Chris Mason
2014-05-23 23:13           ` Marc MERLIN
2014-05-27 19:27             ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).