All of lore.kernel.org
 help / color / mirror / Atom feed
* Delayed inode operations not doing the right thing with enospc
@ 2011-06-03 18:46 Josef Bacik
  2011-06-07  1:39 ` Miao Xie
  0 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2011-06-03 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: miaox

Hello,

I got a lot of these when running stress.sh on my test box

[ 9792.654889] ------------[ cut here ]------------
[ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
btrfs_alloc_free_block+0xca/0x27c [btrfs]()
[ 9792.654899] Hardware name: To Be Filled By O.E.M.
[ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
[ 9792.654919] Pid: 2762, comm: rm Tainted: G        W   2.6.39+ #1
[ 9792.654920] Call Trace:
[ 9792.654922]  [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b
[ 9792.654925]  [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c
[ 9792.654933]  [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c
[btrfs]
[ 9792.654945]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
[ 9792.654953]  [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs]
[ 9792.654963]  [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58
[btrfs]
[ 9792.654970]  [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368
[btrfs]
[ 9792.654978]  [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs]
[ 9792.654986]  [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs]
[ 9792.654997]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
[ 9792.655022]  [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
[ 9792.655025]  [<ffffffff8147afac>] ? _cond_resched+0xe/0x22
[ 9792.655027]  [<ffffffff8147b892>] ? mutex_lock+0x29/0x50
[ 9792.655039]  [<ffffffffa03d41b1>]
btrfs_update_delayed_inode+0x72/0x137 [btrfs]
[ 9792.655051]  [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb
[btrfs]
[ 9792.655062]  [<ffffffffa039a69b>]
btrfs_commit_transaction+0x228/0x654 [btrfs]
[ 9792.655064]  [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a
[ 9792.655075]  [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs]
[ 9792.655077]  [<ffffffff81132bd6>] evict+0x71/0x111
[ 9792.655079]  [<ffffffff81132de0>] iput+0x12a/0x132
[ 9792.655081]  [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155
[ 9792.655083]  [<ffffffff81127b83>] ? path_put+0x1f/0x23
[ 9792.655085]  [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171
[ 9792.655087]  [<ffffffff81128410>] ? putname+0x34/0x36
[ 9792.655090]  [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b
[ 9792.655092]  [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b
[ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---


This is because use_block_rsv() is having to do a
reserve_metadata_bytes(), which shouldn't happen as we should have
reserved enough space for those operations to complete.  This is
happening because use_block_rsv() will call get_block_rsv(), which if
root->ref_cows is set (which is the case on all fs roots) we will use
trans->block_rsv, which will only have what the current transaction
starter had reserved.

What needs to be done instead is we need to have a block reserve that
any reservation that is done at create time for these inodes is migrated
to this special reserve, and then when you run the delayed inode items
stuff you set trans->block_rsv to the special block reserve so the
accounting is all done properly.

This is just off the top of my head, there may be a better way to do it,
I've not actually looked that the delayed inode code at all.

I would do this myself but I have a ever increasing list of shit to do
so will somebody pick this up and fix it please?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-06-03 18:46 Delayed inode operations not doing the right thing with enospc Josef Bacik
@ 2011-06-07  1:39 ` Miao Xie
  2011-06-07 13:23   ` Josef Bacik
  2011-06-07 21:04   ` Josef Bacik
  0 siblings, 2 replies; 13+ messages in thread
From: Miao Xie @ 2011-06-07  1:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
> I got a lot of these when running stress.sh on my test box
> 
> [ 9792.654889] ------------[ cut here ]------------
> [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
> btrfs_alloc_free_block+0xca/0x27c [btrfs]()
> [ 9792.654899] Hardware name: To Be Filled By O.E.M.
> [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
> ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
> arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
> snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
> pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
> i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
> mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
> ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
> drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
> [ 9792.654919] Pid: 2762, comm: rm Tainted: G        W   2.6.39+ #1
> [ 9792.654920] Call Trace:
> [ 9792.654922]  [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b
> [ 9792.654925]  [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c
> [ 9792.654933]  [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c
> [btrfs]
> [ 9792.654945]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
> [ 9792.654953]  [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs]
> [ 9792.654963]  [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58
> [btrfs]
> [ 9792.654970]  [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368
> [btrfs]
> [ 9792.654978]  [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs]
> [ 9792.654986]  [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs]
> [ 9792.654997]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
> [ 9792.655022]  [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
> [ 9792.655025]  [<ffffffff8147afac>] ? _cond_resched+0xe/0x22
> [ 9792.655027]  [<ffffffff8147b892>] ? mutex_lock+0x29/0x50
> [ 9792.655039]  [<ffffffffa03d41b1>]
> btrfs_update_delayed_inode+0x72/0x137 [btrfs]
> [ 9792.655051]  [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb
> [btrfs]
> [ 9792.655062]  [<ffffffffa039a69b>]
> btrfs_commit_transaction+0x228/0x654 [btrfs]
> [ 9792.655064]  [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a
> [ 9792.655075]  [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs]
> [ 9792.655077]  [<ffffffff81132bd6>] evict+0x71/0x111
> [ 9792.655079]  [<ffffffff81132de0>] iput+0x12a/0x132
> [ 9792.655081]  [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155
> [ 9792.655083]  [<ffffffff81127b83>] ? path_put+0x1f/0x23
> [ 9792.655085]  [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171
> [ 9792.655087]  [<ffffffff81128410>] ? putname+0x34/0x36
> [ 9792.655090]  [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b
> [ 9792.655092]  [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b
> [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---
> 
> 
> This is because use_block_rsv() is having to do a
> reserve_metadata_bytes(), which shouldn't happen as we should have
> reserved enough space for those operations to complete.  This is
> happening because use_block_rsv() will call get_block_rsv(), which if
> root->ref_cows is set (which is the case on all fs roots) we will use
> trans->block_rsv, which will only have what the current transaction
> starter had reserved.
> 
> What needs to be done instead is we need to have a block reserve that
> any reservation that is done at create time for these inodes is migrated
> to this special reserve, and then when you run the delayed inode items
> stuff you set trans->block_rsv to the special block reserve so the
> accounting is all done properly.
> 
> This is just off the top of my head, there may be a better way to do it,
> I've not actually looked that the delayed inode code at all.
> 
> I would do this myself but I have a ever increasing list of shit to do
> so will somebody pick this up and fix it please?  Thanks,

Sorry, it's my miss.
I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
the space from trans_block_rsv to global_block_rsv.

I'll fix it soon.

Thanks for your help.
Miao

> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-06-07  1:39 ` Miao Xie
@ 2011-06-07 13:23   ` Josef Bacik
  2011-06-07 21:04   ` Josef Bacik
  1 sibling, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2011-06-07 13:23 UTC (permalink / raw)
  To: miaox; +Cc: linux-btrfs

On 06/06/2011 09:39 PM, Miao Xie wrote:
> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>> I got a lot of these when running stress.sh on my test box
>>
>> [ 9792.654889] ------------[ cut here ]------------
>> [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
>> btrfs_alloc_free_block+0xca/0x27c [btrfs]()
>> [ 9792.654899] Hardware name: To Be Filled By O.E.M.
>> [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
>> ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
>> arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
>> snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
>> pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
>> i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
>> mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
>> ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
>> drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
>> [ 9792.654919] Pid: 2762, comm: rm Tainted: G        W   2.6.39+ #1
>> [ 9792.654920] Call Trace:
>> [ 9792.654922]  [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b
>> [ 9792.654925]  [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c
>> [ 9792.654933]  [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c
>> [btrfs]
>> [ 9792.654945]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
>> [ 9792.654953]  [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs]
>> [ 9792.654963]  [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58
>> [btrfs]
>> [ 9792.654970]  [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368
>> [btrfs]
>> [ 9792.654978]  [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs]
>> [ 9792.654986]  [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs]
>> [ 9792.654997]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
>> [ 9792.655022]  [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
>> [ 9792.655025]  [<ffffffff8147afac>] ? _cond_resched+0xe/0x22
>> [ 9792.655027]  [<ffffffff8147b892>] ? mutex_lock+0x29/0x50
>> [ 9792.655039]  [<ffffffffa03d41b1>]
>> btrfs_update_delayed_inode+0x72/0x137 [btrfs]
>> [ 9792.655051]  [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb
>> [btrfs]
>> [ 9792.655062]  [<ffffffffa039a69b>]
>> btrfs_commit_transaction+0x228/0x654 [btrfs]
>> [ 9792.655064]  [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a
>> [ 9792.655075]  [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs]
>> [ 9792.655077]  [<ffffffff81132bd6>] evict+0x71/0x111
>> [ 9792.655079]  [<ffffffff81132de0>] iput+0x12a/0x132
>> [ 9792.655081]  [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155
>> [ 9792.655083]  [<ffffffff81127b83>] ? path_put+0x1f/0x23
>> [ 9792.655085]  [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171
>> [ 9792.655087]  [<ffffffff81128410>] ? putname+0x34/0x36
>> [ 9792.655090]  [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b
>> [ 9792.655092]  [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b
>> [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---
>>
>>
>> This is because use_block_rsv() is having to do a
>> reserve_metadata_bytes(), which shouldn't happen as we should have
>> reserved enough space for those operations to complete.  This is
>> happening because use_block_rsv() will call get_block_rsv(), which if
>> root->ref_cows is set (which is the case on all fs roots) we will use
>> trans->block_rsv, which will only have what the current transaction
>> starter had reserved.
>>
>> What needs to be done instead is we need to have a block reserve that
>> any reservation that is done at create time for these inodes is migrated
>> to this special reserve, and then when you run the delayed inode items
>> stuff you set trans->block_rsv to the special block reserve so the
>> accounting is all done properly.
>>
>> This is just off the top of my head, there may be a better way to do it,
>> I've not actually looked that the delayed inode code at all.
>>
>> I would do this myself but I have a ever increasing list of shit to do
>> so will somebody pick this up and fix it please?  Thanks,
>
> Sorry, it's my miss.
> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
> the space from trans_block_rsv to global_block_rsv.
>
> I'll fix it soon.
>

Great thanks Miao,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-06-07  1:39 ` Miao Xie
  2011-06-07 13:23   ` Josef Bacik
@ 2011-06-07 21:04   ` Josef Bacik
  2011-07-12 15:20       ` Christian Brunner
  1 sibling, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2011-06-07 21:04 UTC (permalink / raw)
  To: miaox; +Cc: linux-btrfs

On 06/06/2011 09:39 PM, Miao Xie wrote:
> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>> I got a lot of these when running stress.sh on my test box
>>
>> [ 9792.654889] ------------[ cut here ]------------
>> [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
>> btrfs_alloc_free_block+0xca/0x27c [btrfs]()
>> [ 9792.654899] Hardware name: To Be Filled By O.E.M.
>> [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
>> ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
>> arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
>> snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
>> pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
>> i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
>> mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
>> ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
>> drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
>> [ 9792.654919] Pid: 2762, comm: rm Tainted: G        W   2.6.39+ #1
>> [ 9792.654920] Call Trace:
>> [ 9792.654922]  [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b
>> [ 9792.654925]  [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c
>> [ 9792.654933]  [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c
>> [btrfs]
>> [ 9792.654945]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
>> [ 9792.654953]  [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs]
>> [ 9792.654963]  [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58
>> [btrfs]
>> [ 9792.654970]  [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368
>> [btrfs]
>> [ 9792.654978]  [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs]
>> [ 9792.654986]  [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs]
>> [ 9792.654997]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
>> [ 9792.655022]  [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
>> [ 9792.655025]  [<ffffffff8147afac>] ? _cond_resched+0xe/0x22
>> [ 9792.655027]  [<ffffffff8147b892>] ? mutex_lock+0x29/0x50
>> [ 9792.655039]  [<ffffffffa03d41b1>]
>> btrfs_update_delayed_inode+0x72/0x137 [btrfs]
>> [ 9792.655051]  [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb
>> [btrfs]
>> [ 9792.655062]  [<ffffffffa039a69b>]
>> btrfs_commit_transaction+0x228/0x654 [btrfs]
>> [ 9792.655064]  [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a
>> [ 9792.655075]  [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs]
>> [ 9792.655077]  [<ffffffff81132bd6>] evict+0x71/0x111
>> [ 9792.655079]  [<ffffffff81132de0>] iput+0x12a/0x132
>> [ 9792.655081]  [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155
>> [ 9792.655083]  [<ffffffff81127b83>] ? path_put+0x1f/0x23
>> [ 9792.655085]  [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171
>> [ 9792.655087]  [<ffffffff81128410>] ? putname+0x34/0x36
>> [ 9792.655090]  [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b
>> [ 9792.655092]  [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b
>> [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---
>>
>>
>> This is because use_block_rsv() is having to do a
>> reserve_metadata_bytes(), which shouldn't happen as we should have
>> reserved enough space for those operations to complete.  This is
>> happening because use_block_rsv() will call get_block_rsv(), which if
>> root->ref_cows is set (which is the case on all fs roots) we will use
>> trans->block_rsv, which will only have what the current transaction
>> starter had reserved.
>>
>> What needs to be done instead is we need to have a block reserve that
>> any reservation that is done at create time for these inodes is migrated
>> to this special reserve, and then when you run the delayed inode items
>> stuff you set trans->block_rsv to the special block reserve so the
>> accounting is all done properly.
>>
>> This is just off the top of my head, there may be a better way to do it,
>> I've not actually looked that the delayed inode code at all.
>>
>> I would do this myself but I have a ever increasing list of shit to do
>> so will somebody pick this up and fix it please?  Thanks,
> 
> Sorry, it's my miss.
> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
> the space from trans_block_rsv to global_block_rsv.
> 
> I'll fix it soon.
> 

There is another problem, we're failing xfstest 204.  I tried making
reserve_metadata_bytes commit the transaction regardless of whether or
not there were pinned bytes but the test just hung there.  Usually it
takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
204 just creates a crap ton of files, which is what is killing us.
There needs to be a way to start flushing delayed inode items so we can
reclaim the space they are holding onto so we don't get enospc, and it
needs to be better than just committing the transaction because that is
dog slow.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-06-07 21:04   ` Josef Bacik
@ 2011-07-12 15:20       ` Christian Brunner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brunner @ 2011-07-12 15:20 UTC (permalink / raw)
  To: Josef Bacik; +Cc: miaox, linux-btrfs, ceph-devel

2011/6/7 Josef Bacik <josef@redhat.com>:
> On 06/06/2011 09:39 PM, Miao Xie wrote:
>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>> I got a lot of these when running stress.sh on my test box
>>>
>>>
>>>
>>> This is because use_block_rsv() is having to do a
>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>> reserved enough space for those operations to complete. =A0This is
>>> happening because use_block_rsv() will call get_block_rsv(), which =
if
>>> root->ref_cows is set (which is the case on all fs roots) we will u=
se
>>> trans->block_rsv, which will only have what the current transaction
>>> starter had reserved.
>>>
>>> What needs to be done instead is we need to have a block reserve th=
at
>>> any reservation that is done at create time for these inodes is mig=
rated
>>> to this special reserve, and then when you run the delayed inode it=
ems
>>> stuff you set trans->block_rsv to the special block reserve so the
>>> accounting is all done properly.
>>>
>>> This is just off the top of my head, there may be a better way to d=
o it,
>>> I've not actually looked that the delayed inode code at all.
>>>
>>> I would do this myself but I have a ever increasing list of shit to=
 do
>>> so will somebody pick this up and fix it please? =A0Thanks,
>>
>> Sorry, it's my miss.
>> I forgot to set trans->block_rsv to global_block_rsv, since we have =
migrated
>> the space from trans_block_rsv to global_block_rsv.
>>
>> I'll fix it soon.
>>
>
> There is another problem, we're failing xfstest 204. =A0I tried makin=
g
> reserve_metadata_bytes commit the transaction regardless of whether o=
r
> not there were pinned bytes but the test just hung there. =A0Usually =
it
> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
> 204 just creates a crap ton of files, which is what is killing us.
> There needs to be a way to start flushing delayed inode items so we c=
an
> reclaim the space they are holding onto so we don't get enospc, and i=
t
> needs to be better than just committing the transaction because that =
is
> dog slow. =A0Thanks,
>
> Josef

Is there a solution for this?

I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
(except the pluging). When starting a ceph rebuild on the btrfs
volumes I get a lot of warnings from block_rsv_use_bytes in
use_block_rsv:

[ 2157.922054] ------------[ cut here ]------------
[ 2157.927270] WARNING: at fs/btrfs/extent-tree.c:5683
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[ 2157.937123] Hardware name: ProLiant DL180 G6
[ 2157.942132] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 pcspkr serio_raw iTCO_wdt iTCO_vendor_support ghes hed
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[ 2157.967386] Pid: 10280, comm: btrfs-freespace Tainted: P        W
2.6.38.8-1.fits.4.el6.x86_64 #1
[ 2157.977554] Call Trace:
[ 2157.980383]  [<ffffffff8106482f>] ? warn_slowpath_common+0x7f/0xc0
[ 2157.987382]  [<ffffffff8106488a>] ? warn_slowpath_null+0x1a/0x20
[ 2157.994192]  [<ffffffffa0240b88>] ?
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[ 2158.002354]  [<ffffffffa026eda8>] ? read_extent_buffer+0xd8/0x1d0 [b=
trfs]
[ 2158.010014]  [<ffffffffa0231132>] ? split_leaf+0x142/0x8c0 [btrfs]
[ 2158.016990]  [<ffffffffa022a29b>] ? generic_bin_search+0x19b/0x210 [=
btrfs]
[ 2158.024784]  [<ffffffffa022ca1a>] ? btrfs_leaf_free_space+0x8a/0xe0 =
[btrfs]
[ 2158.032627]  [<ffffffffa0231f83>] ? btrfs_search_slot+0x6d3/0x7a0 [b=
trfs]
[ 2158.040325]  [<ffffffffa0244942>] ?
btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[ 2158.048477]  [<ffffffffa026ffda>] ? clear_extent_bit+0x17a/0x440 [bt=
rfs]
[ 2158.056026]  [<ffffffffa024ffc5>] ? add_pending_csums+0x45/0x70 [btr=
fs]
[ 2158.063530]  [<ffffffffa0252dad>] ?
btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[ 2158.071755]  [<ffffffffa0252f2c>] ?
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[ 2158.080172]  [<ffffffffa027049b>] ?
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[ 2158.088505]  [<ffffffff815406fb>] ? schedule_timeout+0x17b/0x2e0
[ 2158.095258]  [<ffffffff8118964d>] ? bio_endio+0x1d/0x40
[ 2158.101171]  [<ffffffffa0247764>] ? end_workqueue_fn+0xf4/0x130 [btr=
fs]
[ 2158.108621]  [<ffffffffa027b30e>] ? worker_loop+0x13e/0x540 [btrfs]
[ 2158.115703]  [<ffffffffa027b1d0>] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.122563]  [<ffffffffa027b1d0>] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.129413]  [<ffffffff81086356>] ? kthread+0x96/0xa0
[ 2158.135093]  [<ffffffff8100ce44>] ? kernel_thread_helper+0x4/0x10
[ 2158.141913]  [<ffffffff810862c0>] ? kthread+0x0/0xa0
[ 2158.147467]  [<ffffffff8100ce40>] ? kernel_thread_helper+0x0/0x10
[ 2158.154287] ---[ end trace 55e53c726a04ecd7 ]---

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
@ 2011-07-12 15:20       ` Christian Brunner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brunner @ 2011-07-12 15:20 UTC (permalink / raw)
  To: Josef Bacik; +Cc: miaox, linux-btrfs, ceph-devel

2011/6/7 Josef Bacik <josef@redhat.com>:
> On 06/06/2011 09:39 PM, Miao Xie wrote:
>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>> I got a lot of these when running stress.sh on my test box
>>>
>>>
>>>
>>> This is because use_block_rsv() is having to do a
>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>> reserved enough space for those operations to complete.  This is
>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>> trans->block_rsv, which will only have what the current transaction
>>> starter had reserved.
>>>
>>> What needs to be done instead is we need to have a block reserve that
>>> any reservation that is done at create time for these inodes is migrated
>>> to this special reserve, and then when you run the delayed inode items
>>> stuff you set trans->block_rsv to the special block reserve so the
>>> accounting is all done properly.
>>>
>>> This is just off the top of my head, there may be a better way to do it,
>>> I've not actually looked that the delayed inode code at all.
>>>
>>> I would do this myself but I have a ever increasing list of shit to do
>>> so will somebody pick this up and fix it please?  Thanks,
>>
>> Sorry, it's my miss.
>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>> the space from trans_block_rsv to global_block_rsv.
>>
>> I'll fix it soon.
>>
>
> There is another problem, we're failing xfstest 204.  I tried making
> reserve_metadata_bytes commit the transaction regardless of whether or
> not there were pinned bytes but the test just hung there.  Usually it
> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
> 204 just creates a crap ton of files, which is what is killing us.
> There needs to be a way to start flushing delayed inode items so we can
> reclaim the space they are holding onto so we don't get enospc, and it
> needs to be better than just committing the transaction because that is
> dog slow.  Thanks,
>
> Josef

Is there a solution for this?

I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
(except the pluging). When starting a ceph rebuild on the btrfs
volumes I get a lot of warnings from block_rsv_use_bytes in
use_block_rsv:

[ 2157.922054] ------------[ cut here ]------------
[ 2157.927270] WARNING: at fs/btrfs/extent-tree.c:5683
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[ 2157.937123] Hardware name: ProLiant DL180 G6
[ 2157.942132] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 pcspkr serio_raw iTCO_wdt iTCO_vendor_support ghes hed
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[ 2157.967386] Pid: 10280, comm: btrfs-freespace Tainted: P        W
2.6.38.8-1.fits.4.el6.x86_64 #1
[ 2157.977554] Call Trace:
[ 2157.980383]  [<ffffffff8106482f>] ? warn_slowpath_common+0x7f/0xc0
[ 2157.987382]  [<ffffffff8106488a>] ? warn_slowpath_null+0x1a/0x20
[ 2157.994192]  [<ffffffffa0240b88>] ?
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[ 2158.002354]  [<ffffffffa026eda8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[ 2158.010014]  [<ffffffffa0231132>] ? split_leaf+0x142/0x8c0 [btrfs]
[ 2158.016990]  [<ffffffffa022a29b>] ? generic_bin_search+0x19b/0x210 [btrfs]
[ 2158.024784]  [<ffffffffa022ca1a>] ? btrfs_leaf_free_space+0x8a/0xe0 [btrfs]
[ 2158.032627]  [<ffffffffa0231f83>] ? btrfs_search_slot+0x6d3/0x7a0 [btrfs]
[ 2158.040325]  [<ffffffffa0244942>] ?
btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[ 2158.048477]  [<ffffffffa026ffda>] ? clear_extent_bit+0x17a/0x440 [btrfs]
[ 2158.056026]  [<ffffffffa024ffc5>] ? add_pending_csums+0x45/0x70 [btrfs]
[ 2158.063530]  [<ffffffffa0252dad>] ?
btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[ 2158.071755]  [<ffffffffa0252f2c>] ?
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[ 2158.080172]  [<ffffffffa027049b>] ?
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[ 2158.088505]  [<ffffffff815406fb>] ? schedule_timeout+0x17b/0x2e0
[ 2158.095258]  [<ffffffff8118964d>] ? bio_endio+0x1d/0x40
[ 2158.101171]  [<ffffffffa0247764>] ? end_workqueue_fn+0xf4/0x130 [btrfs]
[ 2158.108621]  [<ffffffffa027b30e>] ? worker_loop+0x13e/0x540 [btrfs]
[ 2158.115703]  [<ffffffffa027b1d0>] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.122563]  [<ffffffffa027b1d0>] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.129413]  [<ffffffff81086356>] ? kthread+0x96/0xa0
[ 2158.135093]  [<ffffffff8100ce44>] ? kernel_thread_helper+0x4/0x10
[ 2158.141913]  [<ffffffff810862c0>] ? kthread+0x0/0xa0
[ 2158.147467]  [<ffffffff8100ce40>] ? kernel_thread_helper+0x0/0x10
[ 2158.154287] ---[ end trace 55e53c726a04ecd7 ]---

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-07-12 15:20       ` Christian Brunner
  (?)
@ 2011-07-12 15:25       ` Josef Bacik
  -1 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2011-07-12 15:25 UTC (permalink / raw)
  To: chb; +Cc: miaox, linux-btrfs, ceph-devel

On 07/12/2011 11:20 AM, Christian Brunner wrote:
> 2011/6/7 Josef Bacik <josef@redhat.com>:
>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>> I got a lot of these when running stress.sh on my test box
>>>>
>>>>
>>>>
>>>> This is because use_block_rsv() is having to do a
>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>> reserved enough space for those operations to complete.  This is
>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>> trans->block_rsv, which will only have what the current transaction
>>>> starter had reserved.
>>>>
>>>> What needs to be done instead is we need to have a block reserve that
>>>> any reservation that is done at create time for these inodes is migrated
>>>> to this special reserve, and then when you run the delayed inode items
>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>> accounting is all done properly.
>>>>
>>>> This is just off the top of my head, there may be a better way to do it,
>>>> I've not actually looked that the delayed inode code at all.
>>>>
>>>> I would do this myself but I have a ever increasing list of shit to do
>>>> so will somebody pick this up and fix it please?  Thanks,
>>>
>>> Sorry, it's my miss.
>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>> the space from trans_block_rsv to global_block_rsv.
>>>
>>> I'll fix it soon.
>>>
>>
>> There is another problem, we're failing xfstest 204.  I tried making
>> reserve_metadata_bytes commit the transaction regardless of whether or
>> not there were pinned bytes but the test just hung there.  Usually it
>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>> 204 just creates a crap ton of files, which is what is killing us.
>> There needs to be a way to start flushing delayed inode items so we can
>> reclaim the space they are holding onto so we don't get enospc, and it
>> needs to be better than just committing the transaction because that is
>> dog slow.  Thanks,
>>
>> Josef
> 
> Is there a solution for this?
> 
> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
> (except the pluging). When starting a ceph rebuild on the btrfs
> volumes I get a lot of warnings from block_rsv_use_bytes in
> use_block_rsv:
> 

Yeah there is something wonky going on here, I meant to take a look this
week but I will go ahead and look into it now.  I have a way to
reproduce it thankfully, but I may have you run my patches when I get
somewhere.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-07-12 15:20       ` Christian Brunner
  (?)
  (?)
@ 2011-07-13 14:56       ` Josef Bacik
  2011-07-14  7:27           ` Christian Brunner
  -1 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2011-07-13 14:56 UTC (permalink / raw)
  To: chb; +Cc: miaox, linux-btrfs, ceph-devel

On 07/12/2011 11:20 AM, Christian Brunner wrote:
> 2011/6/7 Josef Bacik <josef@redhat.com>:
>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>> I got a lot of these when running stress.sh on my test box
>>>>
>>>>
>>>>
>>>> This is because use_block_rsv() is having to do a
>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>> reserved enough space for those operations to complete.  This is
>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>> trans->block_rsv, which will only have what the current transaction
>>>> starter had reserved.
>>>>
>>>> What needs to be done instead is we need to have a block reserve that
>>>> any reservation that is done at create time for these inodes is migrated
>>>> to this special reserve, and then when you run the delayed inode items
>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>> accounting is all done properly.
>>>>
>>>> This is just off the top of my head, there may be a better way to do it,
>>>> I've not actually looked that the delayed inode code at all.
>>>>
>>>> I would do this myself but I have a ever increasing list of shit to do
>>>> so will somebody pick this up and fix it please?  Thanks,
>>>
>>> Sorry, it's my miss.
>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>> the space from trans_block_rsv to global_block_rsv.
>>>
>>> I'll fix it soon.
>>>
>>
>> There is another problem, we're failing xfstest 204.  I tried making
>> reserve_metadata_bytes commit the transaction regardless of whether or
>> not there were pinned bytes but the test just hung there.  Usually it
>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>> 204 just creates a crap ton of files, which is what is killing us.
>> There needs to be a way to start flushing delayed inode items so we can
>> reclaim the space they are holding onto so we don't get enospc, and it
>> needs to be better than just committing the transaction because that is
>> dog slow.  Thanks,
>>
>> Josef
> 
> Is there a solution for this?
> 
> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
> (except the pluging). When starting a ceph rebuild on the btrfs
> volumes I get a lot of warnings from block_rsv_use_bytes in
> use_block_rsv:
> 

Ok I think I've got this nailed down.  Will you run with this patch and make sure the warnings go away?  Thanks,

Josef

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 52d7eca..2263d29 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -112,9 +112,6 @@ struct btrfs_inode {
 	 */
 	u64 disk_i_size;
 
-	/* flags field from the on disk inode */
-	u32 flags;
-
 	/*
 	 * if this is a directory then index_cnt is the counter for the index
 	 * number for new files that are created
@@ -128,14 +125,8 @@ struct btrfs_inode {
 	 */
 	u64 last_unlink_trans;
 
-	/*
-	 * Counters to keep track of the number of extent item's we may use due
-	 * to delalloc and such.  outstanding_extents is the number of extent
-	 * items we think we'll end up using, and reserved_extents is the number
-	 * of extent items we've reserved metadata for.
-	 */
-	atomic_t outstanding_extents;
-	atomic_t reserved_extents;
+	/* flags field from the on disk inode */
+	u32 flags;
 
 	/*
 	 * ordered_data_close is set by truncate when a file that used
@@ -151,12 +142,21 @@ struct btrfs_inode {
 	unsigned orphan_meta_reserved:1;
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
-
 	/*
 	 * always compress this one file
 	 */
 	unsigned force_compress:4;
 
+	/*
+	 * Counters to keep track of the number of extent item's we may use due
+	 * to delalloc and such.  outstanding_extents is the number of extent
+	 * items we think we'll end up using, and reserved_extents is the number
+	 * of extent items we've reserved metadata for.
+	 */
+	spinlock_t extents_count_lock;
+	unsigned outstanding_extents;
+	unsigned reserved_extents;
+
 	struct btrfs_delayed_node *delayed_node;
 
 	struct inode vfs_inode;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be02cae..3ba4d5f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2133,7 +2133,7 @@ static inline bool btrfs_mixed_space_info(struct btrfs_space_info *space_info)
 
 /* extent-tree.c */
 static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
-						 int num_items)
+						 unsigned num_items)
 {
 	return (root->leafsize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) *
 		3 * num_items;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3e52b85..65a721c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3952,13 +3952,35 @@ static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes)
 	return num_bytes >>= 3;
 }
 
+static unsigned drop_outstanding_extent(struct inode *inode)
+{
+	unsigned dropped_extents = 0;
+
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents--;
+
+	/*
+	 * If we have more or the same amount of outsanding extents than we have
+	 * reserved then we need to leave the reserved extents count alone.
+	 */
+	if (BTRFS_I(inode)->outstanding_extents >=
+	    BTRFS_I(inode)->reserved_extents)
+		goto out;
+
+	dropped_extents = BTRFS_I(inode)->reserved_extents -
+		BTRFS_I(inode)->outstanding_extents;
+	BTRFS_I(inode)->reserved_extents -= dropped_extents;
+out:
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+	return dropped_extents;
+}
+
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_block_rsv *block_rsv = &root->fs_info->delalloc_block_rsv;
 	u64 to_reserve;
-	int nr_extents;
-	int reserved_extents;
+	unsigned nr_extents;
 	int ret;
 
 	if (btrfs_transaction_in_commit(root->fs_info))
@@ -3966,24 +3988,31 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 	num_bytes = ALIGN(num_bytes, root->sectorsize);
 
-	nr_extents = atomic_read(&BTRFS_I(inode)->outstanding_extents) + 1;
-	reserved_extents = atomic_read(&BTRFS_I(inode)->reserved_extents);
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents++;
 
-	if (nr_extents > reserved_extents) {
-		nr_extents -= reserved_extents;
+	if (BTRFS_I(inode)->outstanding_extents >
+	    BTRFS_I(inode)->reserved_extents) {
+		nr_extents = BTRFS_I(inode)->outstanding_extents -
+			BTRFS_I(inode)->reserved_extents;
 		to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
 	} else {
 		nr_extents = 0;
 		to_reserve = 0;
 	}
+	BTRFS_I(inode)->reserved_extents += nr_extents;
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);
 
 	to_reserve += calc_csum_metadata_size(inode, num_bytes);
 	ret = reserve_metadata_bytes(NULL, root, block_rsv, to_reserve, 1);
-	if (ret)
+	if (ret) {
+		/*
+		 * We don't need the return value since our reservation failed,
+		 * we just need to clean up our counter.
+		 */
+		drop_outstanding_extent(inode);
 		return ret;
-
-	atomic_add(nr_extents, &BTRFS_I(inode)->reserved_extents);
-	atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+	}
 
 	block_rsv_add_bytes(block_rsv, to_reserve, 1);
 
@@ -3997,35 +4026,14 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 to_free;
-	int nr_extents;
-	int reserved_extents;
+	unsigned dropped;
 
 	num_bytes = ALIGN(num_bytes, root->sectorsize);
-	atomic_dec(&BTRFS_I(inode)->outstanding_extents);
-	WARN_ON(atomic_read(&BTRFS_I(inode)->outstanding_extents) < 0);
-
-	reserved_extents = atomic_read(&BTRFS_I(inode)->reserved_extents);
-	do {
-		int old, new;
-
-		nr_extents = atomic_read(&BTRFS_I(inode)->outstanding_extents);
-		if (nr_extents >= reserved_extents) {
-			nr_extents = 0;
-			break;
-		}
-		old = reserved_extents;
-		nr_extents = reserved_extents - nr_extents;
-		new = reserved_extents - nr_extents;
-		old = atomic_cmpxchg(&BTRFS_I(inode)->reserved_extents,
-				     reserved_extents, new);
-		if (likely(old == reserved_extents))
-			break;
-		reserved_extents = old;
-	} while (1);
+	dropped = drop_outstanding_extent(inode);
 
 	to_free = calc_csum_metadata_size(inode, num_bytes);
-	if (nr_extents > 0)
-		to_free += btrfs_calc_trans_metadata_size(root, nr_extents);
+	if (dropped > 0)
+		to_free += btrfs_calc_trans_metadata_size(root, dropped);
 
 	btrfs_block_rsv_release(root, &root->fs_info->delalloc_block_rsv,
 				to_free);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 3c8c435..0997526 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1239,9 +1239,11 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		 * managed to copy.
 		 */
 		if (num_pages > dirty_pages) {
-			if (copied > 0)
-				atomic_inc(
-					&BTRFS_I(inode)->outstanding_extents);
+			if (copied > 0) {
+				spin_lock(&BTRFS_I(inode)->extents_count_lock);
+				BTRFS_I(inode)->outstanding_extents++;
+				spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+			}
 			btrfs_delalloc_release_space(inode,
 					(num_pages - dirty_pages) <<
 					PAGE_CACHE_SHIFT);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index faf516e..3a2be0c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1298,7 +1298,9 @@ static int btrfs_split_extent_hook(struct inode *inode,
 	if (!(orig->state & EXTENT_DELALLOC))
 		return 0;
 
-	atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents++;
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);
 	return 0;
 }
 
@@ -1316,7 +1318,9 @@ static int btrfs_merge_extent_hook(struct inode *inode,
 	if (!(other->state & EXTENT_DELALLOC))
 		return 0;
 
-	atomic_dec(&BTRFS_I(inode)->outstanding_extents);
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents--;
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);
 	return 0;
 }
 
@@ -1339,10 +1343,13 @@ static int btrfs_set_bit_hook(struct inode *inode,
 		u64 len = state->end + 1 - state->start;
 		bool do_list = !is_free_space_inode(root, inode);
 
-		if (*bits & EXTENT_FIRST_DELALLOC)
+		if (*bits & EXTENT_FIRST_DELALLOC) {
 			*bits &= ~EXTENT_FIRST_DELALLOC;
-		else
-			atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+		} else {
+			spin_lock(&BTRFS_I(inode)->extents_count_lock);
+			BTRFS_I(inode)->outstanding_extents++;
+			spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+		}
 
 		spin_lock(&root->fs_info->delalloc_lock);
 		BTRFS_I(inode)->delalloc_bytes += len;
@@ -1372,10 +1379,13 @@ static int btrfs_clear_bit_hook(struct inode *inode,
 		u64 len = state->end + 1 - state->start;
 		bool do_list = !is_free_space_inode(root, inode);
 
-		if (*bits & EXTENT_FIRST_DELALLOC)
+		if (*bits & EXTENT_FIRST_DELALLOC) {
 			*bits &= ~EXTENT_FIRST_DELALLOC;
-		else if (!(*bits & EXTENT_DO_ACCOUNTING))
-			atomic_dec(&BTRFS_I(inode)->outstanding_extents);
+		} else if (!(*bits & EXTENT_DO_ACCOUNTING)) {
+			spin_lock(&BTRFS_I(inode)->extents_count_lock);
+			BTRFS_I(inode)->outstanding_extents--;
+			spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+		}
 
 		if (*bits & EXTENT_DO_ACCOUNTING)
 			btrfs_delalloc_release_metadata(inode, len);
@@ -6777,8 +6787,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->index_cnt = (u64)-1;
 	ei->last_unlink_trans = 0;
 
-	atomic_set(&ei->outstanding_extents, 0);
-	atomic_set(&ei->reserved_extents, 0);
+	spin_lock_init(&ei->extents_count_lock);
+	ei->outstanding_extents = 0;
+	ei->reserved_extents = 0;
 
 	ei->ordered_data_close = 0;
 	ei->orphan_meta_reserved = 0;
@@ -6816,8 +6827,8 @@ void btrfs_destroy_inode(struct inode *inode)
 
 	WARN_ON(!list_empty(&inode->i_dentry));
 	WARN_ON(inode->i_data.nrpages);
-	WARN_ON(atomic_read(&BTRFS_I(inode)->outstanding_extents));
-	WARN_ON(atomic_read(&BTRFS_I(inode)->reserved_extents));
+	WARN_ON(BTRFS_I(inode)->outstanding_extents);
+	WARN_ON(BTRFS_I(inode)->reserved_extents);
 
 	/*
 	 * This can happen where we create an inode, but somebody else also
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 09c9a8d..42f4487 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -938,7 +938,9 @@ again:
 			  GFP_NOFS);
 
 	if (i_done != num_pages) {
-		atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+		spin_lock(&BTRFS_I(inode)->extents_count_lock);
+		BTRFS_I(inode)->outstanding_extents++;
+		spin_unlock(&BTRFS_I(inode)->extents_count_lock);
 		btrfs_delalloc_release_space(inode,
 				     (num_pages - i_done) << PAGE_CACHE_SHIFT);
 	}

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-07-13 14:56       ` Josef Bacik
@ 2011-07-14  7:27           ` Christian Brunner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brunner @ 2011-07-14  7:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: miaox, linux-btrfs, ceph-devel

2011/7/13 Josef Bacik <josef@redhat.com>:
> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>> 2011/6/7 Josef Bacik <josef@redhat.com>:
>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>> I got a lot of these when running stress.sh on my test box
>>>>>
>>>>>
>>>>>
>>>>> This is because use_block_rsv() is having to do a
>>>>> reserve_metadata_bytes(), which shouldn't happen as we should hav=
e
>>>>> reserved enough space for those operations to complete. =A0This i=
s
>>>>> happening because use_block_rsv() will call get_block_rsv(), whic=
h if
>>>>> root->ref_cows is set (which is the case on all fs roots) we will=
 use
>>>>> trans->block_rsv, which will only have what the current transacti=
on
>>>>> starter had reserved.
>>>>>
>>>>> What needs to be done instead is we need to have a block reserve =
that
>>>>> any reservation that is done at create time for these inodes is m=
igrated
>>>>> to this special reserve, and then when you run the delayed inode =
items
>>>>> stuff you set trans->block_rsv to the special block reserve so th=
e
>>>>> accounting is all done properly.
>>>>>
>>>>> This is just off the top of my head, there may be a better way to=
 do it,
>>>>> I've not actually looked that the delayed inode code at all.
>>>>>
>>>>> I would do this myself but I have a ever increasing list of shit =
to do
>>>>> so will somebody pick this up and fix it please? =A0Thanks,
>>>>
>>>> Sorry, it's my miss.
>>>> I forgot to set trans->block_rsv to global_block_rsv, since we hav=
e migrated
>>>> the space from trans_block_rsv to global_block_rsv.
>>>>
>>>> I'll fix it soon.
>>>>
>>>
>>> There is another problem, we're failing xfstest 204. =A0I tried mak=
ing
>>> reserve_metadata_bytes commit the transaction regardless of whether=
 or
>>> not there were pinned bytes but the test just hung there. =A0Usuall=
y it
>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes=
=2E
>>> 204 just creates a crap ton of files, which is what is killing us.
>>> There needs to be a way to start flushing delayed inode items so we=
 can
>>> reclaim the space they are holding onto so we don't get enospc, and=
 it
>>> needs to be better than just committing the transaction because tha=
t is
>>> dog slow. =A0Thanks,
>>>
>>> Josef
>>
>> Is there a solution for this?
>>
>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>> (except the pluging). When starting a ceph rebuild on the btrfs
>> volumes I get a lot of warnings from block_rsv_use_bytes in
>> use_block_rsv:
>>
>
> Ok I think I've got this nailed down. =A0Will you run with this patch=
 and make sure the warnings go away? =A0Thanks,

I'm sorry, I'm still getting a lot of warnings like the one below.

I've also noticed, that I'm not getting these messages when the
free_space_cache is disabled.

Christian

[  697.398097] ------------[ cut here ]------------
[  697.398109] WARNING: at fs/btrfs/extent-tree.c:5693
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[  697.398111] Hardware name: ProLiant DL180 G6
[  697.398112] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[  697.398122] Pid: 6591, comm: btrfs-freespace Tainted: P        W
3.0.0-1.fits.1.el6.x86_64 #1
[  697.398124] Call Trace:
[  697.398128]  [<ffffffff810630af>] warn_slowpath_common+0x7f/0xc0
[  697.398131]  [<ffffffff8106310a>] warn_slowpath_null+0x1a/0x20
[  697.398142]  [<ffffffffa022cb88>] btrfs_alloc_free_block+0x1f8/0x360=
 [btrfs]
[  697.398156]  [<ffffffffa025ae08>] ? read_extent_buffer+0xd8/0x1d0 [b=
trfs]
[  697.398316]  [<ffffffffa021d112>] split_leaf+0x142/0x8c0 [btrfs]
[  697.398325]  [<ffffffffa021629b>] ? generic_bin_search+0x19b/0x210 [=
btrfs]
[  697.398334]  [<ffffffffa0218a1a>] ? btrfs_leaf_free_space+0x8a/0xe0 =
[btrfs]
[  697.398344]  [<ffffffffa021df63>] btrfs_search_slot+0x6d3/0x7a0 [btr=
fs]
[  697.398355]  [<ffffffffa0230942>] btrfs_csum_file_blocks+0x632/0x830=
 [btrfs]
[  697.398369]  [<ffffffffa025c03a>] ? clear_extent_bit+0x17a/0x440 [bt=
rfs]
[  697.398382]  [<ffffffffa023c009>] add_pending_csums+0x49/0x70 [btrfs=
]
[  697.398395]  [<ffffffffa023ef5d>] btrfs_finish_ordered_io+0x22d/0x36=
0 [btrfs]
[  697.398408]  [<ffffffffa023f0dc>]
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[  697.398422]  [<ffffffffa025c4fb>]
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[  697.398425]  [<ffffffff81558b5b>] ? schedule_timeout+0x17b/0x2e0
[  697.398436]  [<ffffffffa02336d9>] ? end_workqueue_fn+0xe9/0x130 [btr=
fs]
[  697.398439]  [<ffffffff8118f24d>] bio_endio+0x1d/0x40
[  697.398451]  [<ffffffffa02336e4>] end_workqueue_fn+0xf4/0x130 [btrfs=
]
[  697.398464]  [<ffffffffa02671de>] worker_loop+0x13e/0x540 [btrfs]
[  697.398477]  [<ffffffffa02670a0>] ? btrfs_queue_worker+0x2d0/0x2d0 [=
btrfs]
[  697.398490]  [<ffffffffa02670a0>] ? btrfs_queue_worker+0x2d0/0x2d0 [=
btrfs]
[  697.398493]  [<ffffffff81085896>] kthread+0x96/0xa0
[  697.398496]  [<ffffffff81563844>] kernel_thread_helper+0x4/0x10
[  697.398499]  [<ffffffff81085800>] ? kthread_worker_fn+0x1a0/0x1a0
[  697.398502]  [<ffffffff81563840>] ? gs_change+0x13/0x13
[  697.398503] ---[ end trace 8c77269b0de3f0fb ]---
[  697.432225] ------------[ cut here ]------------
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
@ 2011-07-14  7:27           ` Christian Brunner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brunner @ 2011-07-14  7:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: miaox, linux-btrfs, ceph-devel

2011/7/13 Josef Bacik <josef@redhat.com>:
> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>> 2011/6/7 Josef Bacik <josef@redhat.com>:
>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>> I got a lot of these when running stress.sh on my test box
>>>>>
>>>>>
>>>>>
>>>>> This is because use_block_rsv() is having to do a
>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>> reserved enough space for those operations to complete.  This is
>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>> trans->block_rsv, which will only have what the current transaction
>>>>> starter had reserved.
>>>>>
>>>>> What needs to be done instead is we need to have a block reserve that
>>>>> any reservation that is done at create time for these inodes is migrated
>>>>> to this special reserve, and then when you run the delayed inode items
>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>> accounting is all done properly.
>>>>>
>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>> I've not actually looked that the delayed inode code at all.
>>>>>
>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>
>>>> Sorry, it's my miss.
>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>>> the space from trans_block_rsv to global_block_rsv.
>>>>
>>>> I'll fix it soon.
>>>>
>>>
>>> There is another problem, we're failing xfstest 204.  I tried making
>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>> not there were pinned bytes but the test just hung there.  Usually it
>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>> 204 just creates a crap ton of files, which is what is killing us.
>>> There needs to be a way to start flushing delayed inode items so we can
>>> reclaim the space they are holding onto so we don't get enospc, and it
>>> needs to be better than just committing the transaction because that is
>>> dog slow.  Thanks,
>>>
>>> Josef
>>
>> Is there a solution for this?
>>
>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>> (except the pluging). When starting a ceph rebuild on the btrfs
>> volumes I get a lot of warnings from block_rsv_use_bytes in
>> use_block_rsv:
>>
>
> Ok I think I've got this nailed down.  Will you run with this patch and make sure the warnings go away?  Thanks,

I'm sorry, I'm still getting a lot of warnings like the one below.

I've also noticed, that I'm not getting these messages when the
free_space_cache is disabled.

Christian

[  697.398097] ------------[ cut here ]------------
[  697.398109] WARNING: at fs/btrfs/extent-tree.c:5693
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[  697.398111] Hardware name: ProLiant DL180 G6
[  697.398112] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[  697.398122] Pid: 6591, comm: btrfs-freespace Tainted: P        W
3.0.0-1.fits.1.el6.x86_64 #1
[  697.398124] Call Trace:
[  697.398128]  [<ffffffff810630af>] warn_slowpath_common+0x7f/0xc0
[  697.398131]  [<ffffffff8106310a>] warn_slowpath_null+0x1a/0x20
[  697.398142]  [<ffffffffa022cb88>] btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[  697.398156]  [<ffffffffa025ae08>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[  697.398316]  [<ffffffffa021d112>] split_leaf+0x142/0x8c0 [btrfs]
[  697.398325]  [<ffffffffa021629b>] ? generic_bin_search+0x19b/0x210 [btrfs]
[  697.398334]  [<ffffffffa0218a1a>] ? btrfs_leaf_free_space+0x8a/0xe0 [btrfs]
[  697.398344]  [<ffffffffa021df63>] btrfs_search_slot+0x6d3/0x7a0 [btrfs]
[  697.398355]  [<ffffffffa0230942>] btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[  697.398369]  [<ffffffffa025c03a>] ? clear_extent_bit+0x17a/0x440 [btrfs]
[  697.398382]  [<ffffffffa023c009>] add_pending_csums+0x49/0x70 [btrfs]
[  697.398395]  [<ffffffffa023ef5d>] btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[  697.398408]  [<ffffffffa023f0dc>]
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[  697.398422]  [<ffffffffa025c4fb>]
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[  697.398425]  [<ffffffff81558b5b>] ? schedule_timeout+0x17b/0x2e0
[  697.398436]  [<ffffffffa02336d9>] ? end_workqueue_fn+0xe9/0x130 [btrfs]
[  697.398439]  [<ffffffff8118f24d>] bio_endio+0x1d/0x40
[  697.398451]  [<ffffffffa02336e4>] end_workqueue_fn+0xf4/0x130 [btrfs]
[  697.398464]  [<ffffffffa02671de>] worker_loop+0x13e/0x540 [btrfs]
[  697.398477]  [<ffffffffa02670a0>] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[  697.398490]  [<ffffffffa02670a0>] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[  697.398493]  [<ffffffff81085896>] kthread+0x96/0xa0
[  697.398496]  [<ffffffff81563844>] kernel_thread_helper+0x4/0x10
[  697.398499]  [<ffffffff81085800>] ? kthread_worker_fn+0x1a0/0x1a0
[  697.398502]  [<ffffffff81563840>] ? gs_change+0x13/0x13
[  697.398503] ---[ end trace 8c77269b0de3f0fb ]---
[  697.432225] ------------[ cut here ]------------
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-07-14  7:27           ` Christian Brunner
  (?)
@ 2011-07-14 15:53           ` Josef Bacik
  -1 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2011-07-14 15:53 UTC (permalink / raw)
  To: chb; +Cc: miaox, linux-btrfs, ceph-devel

On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik <josef@redhat.com>:
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik <josef@redhat.com>:
>>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>>> I got a lot of these when running stress.sh on my test box
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is because use_block_rsv() is having to do a
>>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>>> reserved enough space for those operations to complete.  This is
>>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>>> trans->block_rsv, which will only have what the current transaction
>>>>>> starter had reserved.
>>>>>>
>>>>>> What needs to be done instead is we need to have a block reserve that
>>>>>> any reservation that is done at create time for these inodes is migrated
>>>>>> to this special reserve, and then when you run the delayed inode items
>>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>>> accounting is all done properly.
>>>>>>
>>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>>> I've not actually looked that the delayed inode code at all.
>>>>>>
>>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>>
>>>>> Sorry, it's my miss.
>>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>>>> the space from trans_block_rsv to global_block_rsv.
>>>>>
>>>>> I'll fix it soon.
>>>>>
>>>>
>>>> There is another problem, we're failing xfstest 204.  I tried making
>>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>>> not there were pinned bytes but the test just hung there.  Usually it
>>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>>> 204 just creates a crap ton of files, which is what is killing us.
>>>> There needs to be a way to start flushing delayed inode items so we can
>>>> reclaim the space they are holding onto so we don't get enospc, and it
>>>> needs to be better than just committing the transaction because that is
>>>> dog slow.  Thanks,
>>>>
>>>> Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 
>

Ok I see what's wrong, our checksum calculation is completely bogus.
I'm in the middle of something big so I can't give you a nice clean
patch, so if you can just go into extent-tree.c and replace
calc_csum_metadata_size with this you should be good to go

static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes)
{
        struct btrfs_root *root = BTRFS_I(inode)->root;
        int num_leaves;
        int num_csums;
        u16 csum_size =
                btrfs_super_csum_size(&root->fs_info->super_copy);

        num_csums = (int)div64_u64(num_bytes, root->sectorsize);
        num_leaves = (int)((num_csums * csum_size) / root->leafsize);

        return btrfs_calc_trans_metadata_size(root, num_leaves);
}


Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-07-14  7:27           ` Christian Brunner
  (?)
  (?)
@ 2011-07-14 17:57           ` Josef Bacik
  -1 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2011-07-14 17:57 UTC (permalink / raw)
  To: chb; +Cc: miaox, linux-btrfs, ceph-devel

On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik <josef@redhat.com>:
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik <josef@redhat.com>:
>>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>>> I got a lot of these when running stress.sh on my test box
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is because use_block_rsv() is having to do a
>>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>>> reserved enough space for those operations to complete.  This is
>>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>>> trans->block_rsv, which will only have what the current transaction
>>>>>> starter had reserved.
>>>>>>
>>>>>> What needs to be done instead is we need to have a block reserve that
>>>>>> any reservation that is done at create time for these inodes is migrated
>>>>>> to this special reserve, and then when you run the delayed inode items
>>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>>> accounting is all done properly.
>>>>>>
>>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>>> I've not actually looked that the delayed inode code at all.
>>>>>>
>>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>>
>>>>> Sorry, it's my miss.
>>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>>>> the space from trans_block_rsv to global_block_rsv.
>>>>>
>>>>> I'll fix it soon.
>>>>>
>>>>
>>>> There is another problem, we're failing xfstest 204.  I tried making
>>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>>> not there were pinned bytes but the test just hung there.  Usually it
>>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>>> 204 just creates a crap ton of files, which is what is killing us.
>>>> There needs to be a way to start flushing delayed inode items so we can
>>>> reclaim the space they are holding onto so we don't get enospc, and it
>>>> needs to be better than just committing the transaction because that is
>>>> dog slow.  Thanks,
>>>>
>>>> Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 
>

Actually scratch that last note, it's wrong.  I'll send you an updated
patch when I've got this mess all sorted out.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Delayed inode operations not doing the right thing with enospc
  2011-07-14  7:27           ` Christian Brunner
                             ` (2 preceding siblings ...)
  (?)
@ 2011-07-14 21:12           ` Josef Bacik
  -1 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2011-07-14 21:12 UTC (permalink / raw)
  To: chb; +Cc: miaox, linux-btrfs, ceph-devel

On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik <josef@redhat.com>:
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik <josef@redhat.com>:
>>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>>> I got a lot of these when running stress.sh on my test box
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is because use_block_rsv() is having to do a
>>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>>> reserved enough space for those operations to complete.  This is
>>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>>> trans->block_rsv, which will only have what the current transaction
>>>>>> starter had reserved.
>>>>>>
>>>>>> What needs to be done instead is we need to have a block reserve that
>>>>>> any reservation that is done at create time for these inodes is migrated
>>>>>> to this special reserve, and then when you run the delayed inode items
>>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>>> accounting is all done properly.
>>>>>>
>>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>>> I've not actually looked that the delayed inode code at all.
>>>>>>
>>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>>
>>>>> Sorry, it's my miss.
>>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>>>> the space from trans_block_rsv to global_block_rsv.
>>>>>
>>>>> I'll fix it soon.
>>>>>
>>>>
>>>> There is another problem, we're failing xfstest 204.  I tried making
>>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>>> not there were pinned bytes but the test just hung there.  Usually it
>>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>>> 204 just creates a crap ton of files, which is what is killing us.
>>>> There needs to be a way to start flushing delayed inode items so we can
>>>> reclaim the space they are holding onto so we don't get enospc, and it
>>>> needs to be better than just committing the transaction because that is
>>>> dog slow.  Thanks,
>>>>
>>>> Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 

Ok ditch that previous patch and try this one, it should work.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 52d7eca..2263d29 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -112,9 +112,6 @@ struct btrfs_inode {
 	 */
 	u64 disk_i_size;

-	/* flags field from the on disk inode */
-	u32 flags;
-
 	/*
 	 * if this is a directory then index_cnt is the counter for the index
 	 * number for new files that are created
@@ -128,14 +125,8 @@ struct btrfs_inode {
 	 */
 	u64 last_unlink_trans;

-	/*
-	 * Counters to keep track of the number of extent item's we may use due
-	 * to delalloc and such.  outstanding_extents is the number of extent
-	 * items we think we'll end up using, and reserved_extents is the number
-	 * of extent items we've reserved metadata for.
-	 */
-	atomic_t outstanding_extents;
-	atomic_t reserved_extents;
+	/* flags field from the on disk inode */
+	u32 flags;

 	/*
 	 * ordered_data_close is set by truncate when a file that used
@@ -151,12 +142,21 @@ struct btrfs_inode {
 	unsigned orphan_meta_reserved:1;
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
-
 	/*
 	 * always compress this one file
 	 */
 	unsigned force_compress:4;

+	/*
+	 * Counters to keep track of the number of extent item's we may use due
+	 * to delalloc and such.  outstanding_extents is the number of extent
+	 * items we think we'll end up using, and reserved_extents is the number
+	 * of extent items we've reserved metadata for.
+	 */
+	spinlock_t extents_count_lock;
+	unsigned outstanding_extents;
+	unsigned reserved_extents;
+
 	struct btrfs_delayed_node *delayed_node;

 	struct inode vfs_inode;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be02cae..3ba4d5f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2133,7 +2133,7 @@ static inline bool btrfs_mixed_space_info(struct
btrfs_space_info *space_info)

 /* extent-tree.c */
 static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
-						 int num_items)
+						 unsigned num_items)
 {
 	return (root->leafsize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) *
 		3 * num_items;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3e52b85..65a721c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3952,13 +3952,35 @@ static u64 calc_csum_metadata_size(struct inode
*inode, u64 num_bytes)
 	return num_bytes >>= 3;
 }

+static unsigned drop_outstanding_extent(struct inode *inode)
+{
+	unsigned dropped_extents = 0;
+
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents--;
+
+	/*
+	 * If we have more or the same amount of outsanding extents than we have
+	 * reserved then we need to leave the reserved extents count alone.
+	 */
+	if (BTRFS_I(inode)->outstanding_extents >=
+	    BTRFS_I(inode)->reserved_extents)
+		goto out;
+
+	dropped_extents = BTRFS_I(inode)->reserved_extents -
+		BTRFS_I(inode)->outstanding_extents;
+	BTRFS_I(inode)->reserved_extents -= dropped_extents;
+out:
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+	return dropped_extents;
+}
+
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_block_rsv *block_rsv = &root->fs_info->delalloc_block_rsv;
 	u64 to_reserve;
-	int nr_extents;
-	int reserved_extents;
+	unsigned nr_extents;
 	int ret;

 	if (btrfs_transaction_in_commit(root->fs_info))
@@ -3966,24 +3988,31 @@ int btrfs_delalloc_reserve_metadata(struct inode
*inode, u64 num_bytes)

 	num_bytes = ALIGN(num_bytes, root->sectorsize);

-	nr_extents = atomic_read(&BTRFS_I(inode)->outstanding_extents) + 1;
-	reserved_extents = atomic_read(&BTRFS_I(inode)->reserved_extents);
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents++;

-	if (nr_extents > reserved_extents) {
-		nr_extents -= reserved_extents;
+	if (BTRFS_I(inode)->outstanding_extents >
+	    BTRFS_I(inode)->reserved_extents) {
+		nr_extents = BTRFS_I(inode)->outstanding_extents -
+			BTRFS_I(inode)->reserved_extents;
 		to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
 	} else {
 		nr_extents = 0;
 		to_reserve = 0;
 	}
+	BTRFS_I(inode)->reserved_extents += nr_extents;
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);

 	to_reserve += calc_csum_metadata_size(inode, num_bytes);
 	ret = reserve_metadata_bytes(NULL, root, block_rsv, to_reserve, 1);
-	if (ret)
+	if (ret) {
+		/*
+		 * We don't need the return value since our reservation failed,
+		 * we just need to clean up our counter.
+		 */
+		drop_outstanding_extent(inode);
 		return ret;
-
-	atomic_add(nr_extents, &BTRFS_I(inode)->reserved_extents);
-	atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+	}

 	block_rsv_add_bytes(block_rsv, to_reserve, 1);

@@ -3997,35 +4026,14 @@ void btrfs_delalloc_release_metadata(struct
inode *inode, u64 num_bytes)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 to_free;
-	int nr_extents;
-	int reserved_extents;
+	unsigned dropped;

 	num_bytes = ALIGN(num_bytes, root->sectorsize);
-	atomic_dec(&BTRFS_I(inode)->outstanding_extents);
-	WARN_ON(atomic_read(&BTRFS_I(inode)->outstanding_extents) < 0);
-
-	reserved_extents = atomic_read(&BTRFS_I(inode)->reserved_extents);
-	do {
-		int old, new;
-
-		nr_extents = atomic_read(&BTRFS_I(inode)->outstanding_extents);
-		if (nr_extents >= reserved_extents) {
-			nr_extents = 0;
-			break;
-		}
-		old = reserved_extents;
-		nr_extents = reserved_extents - nr_extents;
-		new = reserved_extents - nr_extents;
-		old = atomic_cmpxchg(&BTRFS_I(inode)->reserved_extents,
-				     reserved_extents, new);
-		if (likely(old == reserved_extents))
-			break;
-		reserved_extents = old;
-	} while (1);
+	dropped = drop_outstanding_extent(inode);

 	to_free = calc_csum_metadata_size(inode, num_bytes);
-	if (nr_extents > 0)
-		to_free += btrfs_calc_trans_metadata_size(root, nr_extents);
+	if (dropped > 0)
+		to_free += btrfs_calc_trans_metadata_size(root, dropped);

 	btrfs_block_rsv_release(root, &root->fs_info->delalloc_block_rsv,
 				to_free);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 3c8c435..0997526 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1239,9 +1239,11 @@ static noinline ssize_t
__btrfs_buffered_write(struct file *file,
 		 * managed to copy.
 		 */
 		if (num_pages > dirty_pages) {
-			if (copied > 0)
-				atomic_inc(
-					&BTRFS_I(inode)->outstanding_extents);
+			if (copied > 0) {
+				spin_lock(&BTRFS_I(inode)->extents_count_lock);
+				BTRFS_I(inode)->outstanding_extents++;
+				spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+			}
 			btrfs_delalloc_release_space(inode,
 					(num_pages - dirty_pages) <<
 					PAGE_CACHE_SHIFT);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index faf516e..3a2be0c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1298,7 +1298,9 @@ static int btrfs_split_extent_hook(struct inode
*inode,
 	if (!(orig->state & EXTENT_DELALLOC))
 		return 0;

-	atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents++;
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);
 	return 0;
 }

@@ -1316,7 +1318,9 @@ static int btrfs_merge_extent_hook(struct inode
*inode,
 	if (!(other->state & EXTENT_DELALLOC))
 		return 0;

-	atomic_dec(&BTRFS_I(inode)->outstanding_extents);
+	spin_lock(&BTRFS_I(inode)->extents_count_lock);
+	BTRFS_I(inode)->outstanding_extents--;
+	spin_unlock(&BTRFS_I(inode)->extents_count_lock);
 	return 0;
 }

@@ -1339,10 +1343,13 @@ static int btrfs_set_bit_hook(struct inode *inode,
 		u64 len = state->end + 1 - state->start;
 		bool do_list = !is_free_space_inode(root, inode);

-		if (*bits & EXTENT_FIRST_DELALLOC)
+		if (*bits & EXTENT_FIRST_DELALLOC) {
 			*bits &= ~EXTENT_FIRST_DELALLOC;
-		else
-			atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+		} else {
+			spin_lock(&BTRFS_I(inode)->extents_count_lock);
+			BTRFS_I(inode)->outstanding_extents++;
+			spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+		}

 		spin_lock(&root->fs_info->delalloc_lock);
 		BTRFS_I(inode)->delalloc_bytes += len;
@@ -1372,10 +1379,13 @@ static int btrfs_clear_bit_hook(struct inode *inode,
 		u64 len = state->end + 1 - state->start;
 		bool do_list = !is_free_space_inode(root, inode);

-		if (*bits & EXTENT_FIRST_DELALLOC)
+		if (*bits & EXTENT_FIRST_DELALLOC) {
 			*bits &= ~EXTENT_FIRST_DELALLOC;
-		else if (!(*bits & EXTENT_DO_ACCOUNTING))
-			atomic_dec(&BTRFS_I(inode)->outstanding_extents);
+		} else if (!(*bits & EXTENT_DO_ACCOUNTING)) {
+			spin_lock(&BTRFS_I(inode)->extents_count_lock);
+			BTRFS_I(inode)->outstanding_extents--;
+			spin_unlock(&BTRFS_I(inode)->extents_count_lock);
+		}

 		if (*bits & EXTENT_DO_ACCOUNTING)
 			btrfs_delalloc_release_metadata(inode, len);
@@ -6777,8 +6787,9 @@ struct inode *btrfs_alloc_inode(struct super_block
*sb)
 	ei->index_cnt = (u64)-1;
 	ei->last_unlink_trans = 0;

-	atomic_set(&ei->outstanding_extents, 0);
-	atomic_set(&ei->reserved_extents, 0);
+	spin_lock_init(&ei->extents_count_lock);
+	ei->outstanding_extents = 0;
+	ei->reserved_extents = 0;

 	ei->ordered_data_close = 0;
 	ei->orphan_meta_reserved = 0;
@@ -6816,8 +6827,8 @@ void btrfs_destroy_inode(struct inode *inode)

 	WARN_ON(!list_empty(&inode->i_dentry));
 	WARN_ON(inode->i_data.nrpages);
-	WARN_ON(atomic_read(&BTRFS_I(inode)->outstanding_extents));
-	WARN_ON(atomic_read(&BTRFS_I(inode)->reserved_extents));
+	WARN_ON(BTRFS_I(inode)->outstanding_extents);
+	WARN_ON(BTRFS_I(inode)->reserved_extents);

 	/*
 	 * This can happen where we create an inode, but somebody else also
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 09c9a8d..42f4487 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -938,7 +938,9 @@ again:
 			  GFP_NOFS);

 	if (i_done != num_pages) {
-		atomic_inc(&BTRFS_I(inode)->outstanding_extents);
+		spin_lock(&BTRFS_I(inode)->extents_count_lock);
+		BTRFS_I(inode)->outstanding_extents++;
+		spin_unlock(&BTRFS_I(inode)->extents_count_lock);
 		btrfs_delalloc_release_space(inode,
 				     (num_pages - i_done) << PAGE_CACHE_SHIFT);
 	}

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-07-14 21:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-03 18:46 Delayed inode operations not doing the right thing with enospc Josef Bacik
2011-06-07  1:39 ` Miao Xie
2011-06-07 13:23   ` Josef Bacik
2011-06-07 21:04   ` Josef Bacik
2011-07-12 15:20     ` Christian Brunner
2011-07-12 15:20       ` Christian Brunner
2011-07-12 15:25       ` Josef Bacik
2011-07-13 14:56       ` Josef Bacik
2011-07-14  7:27         ` Christian Brunner
2011-07-14  7:27           ` Christian Brunner
2011-07-14 15:53           ` Josef Bacik
2011-07-14 17:57           ` Josef Bacik
2011-07-14 21:12           ` Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.