From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:10812 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1034243AbcIZCeG (ORCPT ); Sun, 25 Sep 2016 22:34:06 -0400 Subject: Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero To: Goldwyn Rodrigues , Goldwyn Rodrigues , References: <1474570050-4715-1-git-send-email-rgoldwyn@suse.de> <628e1dcd-ddc3-3f37-8a47-c01c912db970@suse.com> From: Qu Wenruo Message-ID: <65e6515c-a9f3-d5f9-db6b-3ffd8d97f90e@cn.fujitsu.com> Date: Mon, 26 Sep 2016 10:33:55 +0800 MIME-Version: 1.0 In-Reply-To: <628e1dcd-ddc3-3f37-8a47-c01c912db970@suse.com> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote: > > > On 09/22/2016 08:06 PM, Qu Wenruo wrote: >> >> >> At 09/23/2016 02:47 AM, Goldwyn Rodrigues wrote: >>> From: Goldwyn Rodrigues >>> >>> While free'ing qgroup->reserved resources, we must check >>> if the page is already commmitted to disk or still in memory. >>> If not, the reserve free is doubly accounted, once while >>> invalidating the page, and the next time while free'ing >>> delalloc. This results is qgroup->reserved(u64) going subzero, >>> thus very large value. So, no further I/O can be performed. >>> >>> This is also expressed in the comments, but not performed. >>> >>> Testcase: >>> SCRATCH_DEV=/dev/vdb >>> SCRATCH_MNT=/mnt >>> mkfs.btrfs -f $SCRATCH_DEV >>> mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT >>> cd $SCRATCH_MNT >>> btrfs quota enable $SCRATCH_MNT >>> btrfs subvolume create a >>> btrfs qgroup limit 50m a $SCRATCH_MNT >>> sync >>> for c in {1..15}; do >>> dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; >>> done >>> >>> sleep 10 >>> sync >>> sleep 5 >>> >>> touch $SCRATCH_MNT/a/newfile >>> >>> echo "Removing file" >>> rm $SCRATCH_MNT/a/file >>> >>> Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") >>> Signed-off-by: Goldwyn Rodrigues >>> --- >>> fs/btrfs/inode.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index e6811c4..2e2a026 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -8917,7 +8917,8 @@ again: >>> * 2) Not written to disk >>> * This means the reserved space should be freed here. >>> */ >>> - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >>> + if (PageDirty(page)) >>> + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >>> if (!inode_evicting) { >>> clear_extent_bit(tree, page_start, page_end, >>> EXTENT_LOCKED | EXTENT_DIRTY | >>> >> Thanks for the test case. >> >> However for the fix, I'm afraid it may not be the root cause. >> >> Here, if the pages are dirty, then corresponding range is marked >> EXTENT_QGROUP_RESERVED. >> Then btrfs_qgroup_free_data() will clear that bit and reduce the number. >> >> If the pages are already committed, then corresponding range won't be >> marked EXTENT_QGROUP_RESERVED. >> Later btrfs_qgroup_free_data() won't reduce any bytes, since it will >> only reduce the bytes if it cleared EXTENT_QGROUP_RESERVED bit. >> >> If everything goes well there is no need to check PageDirty() here, as >> we have EXTENT_QGROUP_RESERVED bit for that accounting. >> >> So there is some other thing causing EXTENT_QGROUP_RESERVED bit out of >> sync with dirty pages. >> Considering you did it 15 times to reproduce the problem, maybe there is >> some race causing the problem? >> > > You can have pages marked as not dirty with EXTENT_QGROUP_RESERVED set > for a truncate operation. Performing dd on the same file, truncates the > file before overwriting, while the pages of the previous writes are > still in memory and not committed to disk. > > truncate_inode_page() -> truncate_complete_page() clears the dirty flag. > So, you can have a case where the EXTENT_QGROUP_RESERVED bit is set > while the page is not listed as dirty because the truncate "cleared" all > the dirty pages. > Sorry I still don't get the point. Would you please give a call flow of the timing dirtying page and calling btrfs_qgroup_reserve/free/release_data()? Like: __btrfs_buffered_write() |- btrfs_check_data_free_space() | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit |- btrfs_dirty_pages() <- Mark page dirty [[Timing of btrfs_invalidatepage()]] About your commit message "once while invalidating the page, and the next time while free'ing delalloc." "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref(). If so, it means one extent goes through full write back, and long before calling btrfs_qgroup_free_delayed_ref(), it will call btrfs_qgroup_release_data() to clear QGROUP_RESERVED. So the call will be: __btrfs_buffered_write() |- btrfs_check_data_free_space() | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit |- btrfs_dirty_pages() <- Mark page dirty run_delalloc_range() |- cow_file_range() |- extent_clear_unlock_delalloc() <- Clear page dirty btrfs_finish_ordered_io() |- insert_reserved_file_extent() |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit but not decrease reserved space run_one_delyaed_refs() |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space So the problem seems to be, btrfs_invalidatepage() is called after run_delalloc_range() but before btrfs_finish_ordered_io(). Did you mean that? [[About test case]] And for the test case, I can't reproduce the problem no matter if I apply the fix or not. Either way it just fails after 3 loops of dd, and later dd will all fail. But I can still remove the file and write new data into the fs. [[Extra protect about qgroup->reserved]] And for the underflowed qgroup reserve space, would you mind to add warning for that case? Just like what we did in qgroup excl/rfer values, so at least it won't make qgroup blocking any write. Thanks, Qu