* [PATCH] qgroup: Prevent qgroup->reserved from going subzero @ 2016-09-22 18:47 Goldwyn Rodrigues 2016-09-23 1:06 ` Qu Wenruo 0 siblings, 1 reply; 15+ messages in thread From: Goldwyn Rodrigues @ 2016-09-22 18:47 UTC (permalink / raw) To: linux-btrfs; +Cc: Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> While free'ing qgroup->reserved resources, we must check if the page is already commmitted to disk or still in memory. If not, the reserve free is doubly accounted, once while invalidating the page, and the next time while free'ing delalloc. This results is qgroup->reserved(u64) going subzero, thus very large value. So, no further I/O can be performed. This is also expressed in the comments, but not performed. Testcase: SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/mnt mkfs.btrfs -f $SCRATCH_DEV mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT cd $SCRATCH_MNT btrfs quota enable $SCRATCH_MNT btrfs subvolume create a btrfs qgroup limit 50m a $SCRATCH_MNT sync for c in {1..15}; do dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; done sleep 10 sync sleep 5 touch $SCRATCH_MNT/a/newfile echo "Removing file" rm $SCRATCH_MNT/a/file Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e6811c4..2e2a026 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8917,7 +8917,8 @@ again: * 2) Not written to disk * This means the reserved space should be freed here. */ - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); + if (PageDirty(page)) + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); if (!inode_evicting) { clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED | EXTENT_DIRTY | -- 2.6.6 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-22 18:47 [PATCH] qgroup: Prevent qgroup->reserved from going subzero Goldwyn Rodrigues @ 2016-09-23 1:06 ` Qu Wenruo 2016-09-23 13:43 ` Goldwyn Rodrigues 0 siblings, 1 reply; 15+ messages in thread From: Qu Wenruo @ 2016-09-23 1:06 UTC (permalink / raw) To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues At 09/23/2016 02:47 AM, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > While free'ing qgroup->reserved resources, we must check > if the page is already commmitted to disk or still in memory. > If not, the reserve free is doubly accounted, once while > invalidating the page, and the next time while free'ing > delalloc. This results is qgroup->reserved(u64) going subzero, > thus very large value. So, no further I/O can be performed. > > This is also expressed in the comments, but not performed. > > Testcase: > SCRATCH_DEV=/dev/vdb > SCRATCH_MNT=/mnt > mkfs.btrfs -f $SCRATCH_DEV > mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT > cd $SCRATCH_MNT > btrfs quota enable $SCRATCH_MNT > btrfs subvolume create a > btrfs qgroup limit 50m a $SCRATCH_MNT > sync > for c in {1..15}; do > dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; > done > > sleep 10 > sync > sleep 5 > > touch $SCRATCH_MNT/a/newfile > > echo "Removing file" > rm $SCRATCH_MNT/a/file > > Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > --- > fs/btrfs/inode.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index e6811c4..2e2a026 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -8917,7 +8917,8 @@ again: > * 2) Not written to disk > * This means the reserved space should be freed here. > */ > - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); > + if (PageDirty(page)) > + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); > if (!inode_evicting) { > clear_extent_bit(tree, page_start, page_end, > EXTENT_LOCKED | EXTENT_DIRTY | > Thanks for the test case. However for the fix, I'm afraid it may not be the root cause. Here, if the pages are dirty, then corresponding range is marked EXTENT_QGROUP_RESERVED. Then btrfs_qgroup_free_data() will clear that bit and reduce the number. If the pages are already committed, then corresponding range won't be marked EXTENT_QGROUP_RESERVED. Later btrfs_qgroup_free_data() won't reduce any bytes, since it will only reduce the bytes if it cleared EXTENT_QGROUP_RESERVED bit. If everything goes well there is no need to check PageDirty() here, as we have EXTENT_QGROUP_RESERVED bit for that accounting. So there is some other thing causing EXTENT_QGROUP_RESERVED bit out of sync with dirty pages. Considering you did it 15 times to reproduce the problem, maybe there is some race causing the problem? Thanks, Qu ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-23 1:06 ` Qu Wenruo @ 2016-09-23 13:43 ` Goldwyn Rodrigues 2016-09-26 2:33 ` Qu Wenruo 0 siblings, 1 reply; 15+ messages in thread From: Goldwyn Rodrigues @ 2016-09-23 13:43 UTC (permalink / raw) To: Qu Wenruo, Goldwyn Rodrigues, linux-btrfs On 09/22/2016 08:06 PM, Qu Wenruo wrote: > > > At 09/23/2016 02:47 AM, Goldwyn Rodrigues wrote: >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >> >> While free'ing qgroup->reserved resources, we must check >> if the page is already commmitted to disk or still in memory. >> If not, the reserve free is doubly accounted, once while >> invalidating the page, and the next time while free'ing >> delalloc. This results is qgroup->reserved(u64) going subzero, >> thus very large value. So, no further I/O can be performed. >> >> This is also expressed in the comments, but not performed. >> >> Testcase: >> SCRATCH_DEV=/dev/vdb >> SCRATCH_MNT=/mnt >> mkfs.btrfs -f $SCRATCH_DEV >> mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT >> cd $SCRATCH_MNT >> btrfs quota enable $SCRATCH_MNT >> btrfs subvolume create a >> btrfs qgroup limit 50m a $SCRATCH_MNT >> sync >> for c in {1..15}; do >> dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; >> done >> >> sleep 10 >> sync >> sleep 5 >> >> touch $SCRATCH_MNT/a/newfile >> >> echo "Removing file" >> rm $SCRATCH_MNT/a/file >> >> Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >> --- >> fs/btrfs/inode.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index e6811c4..2e2a026 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -8917,7 +8917,8 @@ again: >> * 2) Not written to disk >> * This means the reserved space should be freed here. >> */ >> - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >> + if (PageDirty(page)) >> + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >> if (!inode_evicting) { >> clear_extent_bit(tree, page_start, page_end, >> EXTENT_LOCKED | EXTENT_DIRTY | >> > Thanks for the test case. > > However for the fix, I'm afraid it may not be the root cause. > > Here, if the pages are dirty, then corresponding range is marked > EXTENT_QGROUP_RESERVED. > Then btrfs_qgroup_free_data() will clear that bit and reduce the number. > > If the pages are already committed, then corresponding range won't be > marked EXTENT_QGROUP_RESERVED. > Later btrfs_qgroup_free_data() won't reduce any bytes, since it will > only reduce the bytes if it cleared EXTENT_QGROUP_RESERVED bit. > > If everything goes well there is no need to check PageDirty() here, as > we have EXTENT_QGROUP_RESERVED bit for that accounting. > > So there is some other thing causing EXTENT_QGROUP_RESERVED bit out of > sync with dirty pages. > Considering you did it 15 times to reproduce the problem, maybe there is > some race causing the problem? > You can have pages marked as not dirty with EXTENT_QGROUP_RESERVED set for a truncate operation. Performing dd on the same file, truncates the file before overwriting, while the pages of the previous writes are still in memory and not committed to disk. truncate_inode_page() -> truncate_complete_page() clears the dirty flag. So, you can have a case where the EXTENT_QGROUP_RESERVED bit is set while the page is not listed as dirty because the truncate "cleared" all the dirty pages. -- Goldwyn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-23 13:43 ` Goldwyn Rodrigues @ 2016-09-26 2:33 ` Qu Wenruo 2016-09-26 14:31 ` Goldwyn Rodrigues 0 siblings, 1 reply; 15+ messages in thread From: Qu Wenruo @ 2016-09-26 2:33 UTC (permalink / raw) To: Goldwyn Rodrigues, Goldwyn Rodrigues, linux-btrfs At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote: > > > On 09/22/2016 08:06 PM, Qu Wenruo wrote: >> >> >> At 09/23/2016 02:47 AM, Goldwyn Rodrigues wrote: >>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>> >>> While free'ing qgroup->reserved resources, we must check >>> if the page is already commmitted to disk or still in memory. >>> If not, the reserve free is doubly accounted, once while >>> invalidating the page, and the next time while free'ing >>> delalloc. This results is qgroup->reserved(u64) going subzero, >>> thus very large value. So, no further I/O can be performed. >>> >>> This is also expressed in the comments, but not performed. >>> >>> Testcase: >>> SCRATCH_DEV=/dev/vdb >>> SCRATCH_MNT=/mnt >>> mkfs.btrfs -f $SCRATCH_DEV >>> mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT >>> cd $SCRATCH_MNT >>> btrfs quota enable $SCRATCH_MNT >>> btrfs subvolume create a >>> btrfs qgroup limit 50m a $SCRATCH_MNT >>> sync >>> for c in {1..15}; do >>> dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; >>> done >>> >>> sleep 10 >>> sync >>> sleep 5 >>> >>> touch $SCRATCH_MNT/a/newfile >>> >>> echo "Removing file" >>> rm $SCRATCH_MNT/a/file >>> >>> Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") >>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >>> --- >>> fs/btrfs/inode.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index e6811c4..2e2a026 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -8917,7 +8917,8 @@ again: >>> * 2) Not written to disk >>> * This means the reserved space should be freed here. >>> */ >>> - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >>> + if (PageDirty(page)) >>> + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >>> if (!inode_evicting) { >>> clear_extent_bit(tree, page_start, page_end, >>> EXTENT_LOCKED | EXTENT_DIRTY | >>> >> Thanks for the test case. >> >> However for the fix, I'm afraid it may not be the root cause. >> >> Here, if the pages are dirty, then corresponding range is marked >> EXTENT_QGROUP_RESERVED. >> Then btrfs_qgroup_free_data() will clear that bit and reduce the number. >> >> If the pages are already committed, then corresponding range won't be >> marked EXTENT_QGROUP_RESERVED. >> Later btrfs_qgroup_free_data() won't reduce any bytes, since it will >> only reduce the bytes if it cleared EXTENT_QGROUP_RESERVED bit. >> >> If everything goes well there is no need to check PageDirty() here, as >> we have EXTENT_QGROUP_RESERVED bit for that accounting. >> >> So there is some other thing causing EXTENT_QGROUP_RESERVED bit out of >> sync with dirty pages. >> Considering you did it 15 times to reproduce the problem, maybe there is >> some race causing the problem? >> > > You can have pages marked as not dirty with EXTENT_QGROUP_RESERVED set > for a truncate operation. Performing dd on the same file, truncates the > file before overwriting, while the pages of the previous writes are > still in memory and not committed to disk. > > truncate_inode_page() -> truncate_complete_page() clears the dirty flag. > So, you can have a case where the EXTENT_QGROUP_RESERVED bit is set > while the page is not listed as dirty because the truncate "cleared" all > the dirty pages. > Sorry I still don't get the point. Would you please give a call flow of the timing dirtying page and calling btrfs_qgroup_reserve/free/release_data()? Like: __btrfs_buffered_write() |- btrfs_check_data_free_space() | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit |- btrfs_dirty_pages() <- Mark page dirty [[Timing of btrfs_invalidatepage()]] About your commit message "once while invalidating the page, and the next time while free'ing delalloc." "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref(). If so, it means one extent goes through full write back, and long before calling btrfs_qgroup_free_delayed_ref(), it will call btrfs_qgroup_release_data() to clear QGROUP_RESERVED. So the call will be: __btrfs_buffered_write() |- btrfs_check_data_free_space() | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit |- btrfs_dirty_pages() <- Mark page dirty <data write back happens> run_delalloc_range() |- cow_file_range() |- extent_clear_unlock_delalloc() <- Clear page dirty <modifying metadata> btrfs_finish_ordered_io() |- insert_reserved_file_extent() |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit but not decrease reserved space <run delayed refs, normally happens in commit_trans> run_one_delyaed_refs() |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space So the problem seems to be, btrfs_invalidatepage() is called after run_delalloc_range() but before btrfs_finish_ordered_io(). Did you mean that? [[About test case]] And for the test case, I can't reproduce the problem no matter if I apply the fix or not. Either way it just fails after 3 loops of dd, and later dd will all fail. But I can still remove the file and write new data into the fs. [[Extra protect about qgroup->reserved]] And for the underflowed qgroup reserve space, would you mind to add warning for that case? Just like what we did in qgroup excl/rfer values, so at least it won't make qgroup blocking any write. Thanks, Qu ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-26 2:33 ` Qu Wenruo @ 2016-09-26 14:31 ` Goldwyn Rodrigues 2016-09-27 3:10 ` Qu Wenruo 0 siblings, 1 reply; 15+ messages in thread From: Goldwyn Rodrigues @ 2016-09-26 14:31 UTC (permalink / raw) To: Qu Wenruo, Goldwyn Rodrigues, linux-btrfs On 09/25/2016 09:33 PM, Qu Wenruo wrote: > > > At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote: >> >> >> On 09/22/2016 08:06 PM, Qu Wenruo wrote: >>> >>> >>> At 09/23/2016 02:47 AM, Goldwyn Rodrigues wrote: >>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>> >>>> While free'ing qgroup->reserved resources, we must check >>>> if the page is already commmitted to disk or still in memory. >>>> If not, the reserve free is doubly accounted, once while >>>> invalidating the page, and the next time while free'ing >>>> delalloc. This results is qgroup->reserved(u64) going subzero, >>>> thus very large value. So, no further I/O can be performed. >>>> >>>> This is also expressed in the comments, but not performed. This statement crept in error. >>>> >>>> Testcase: >>>> SCRATCH_DEV=/dev/vdb >>>> SCRATCH_MNT=/mnt >>>> mkfs.btrfs -f $SCRATCH_DEV >>>> mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT >>>> cd $SCRATCH_MNT >>>> btrfs quota enable $SCRATCH_MNT >>>> btrfs subvolume create a >>>> btrfs qgroup limit 50m a $SCRATCH_MNT >>>> sync >>>> for c in {1..15}; do >>>> dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; >>>> done >>>> >>>> sleep 10 >>>> sync >>>> sleep 5 >>>> >>>> touch $SCRATCH_MNT/a/newfile >>>> >>>> echo "Removing file" >>>> rm $SCRATCH_MNT/a/file >>>> >>>> Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") >>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>> --- >>>> fs/btrfs/inode.c | 3 ++- >>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>> index e6811c4..2e2a026 100644 >>>> --- a/fs/btrfs/inode.c >>>> +++ b/fs/btrfs/inode.c >>>> @@ -8917,7 +8917,8 @@ again: >>>> * 2) Not written to disk >>>> * This means the reserved space should be freed here. >>>> */ >>>> - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >>>> + if (PageDirty(page)) >>>> + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); >>>> if (!inode_evicting) { >>>> clear_extent_bit(tree, page_start, page_end, >>>> EXTENT_LOCKED | EXTENT_DIRTY | >>>> >>> Thanks for the test case. >>> >>> However for the fix, I'm afraid it may not be the root cause. >>> >>> Here, if the pages are dirty, then corresponding range is marked >>> EXTENT_QGROUP_RESERVED. >>> Then btrfs_qgroup_free_data() will clear that bit and reduce the number. >>> >>> If the pages are already committed, then corresponding range won't be >>> marked EXTENT_QGROUP_RESERVED. >>> Later btrfs_qgroup_free_data() won't reduce any bytes, since it will >>> only reduce the bytes if it cleared EXTENT_QGROUP_RESERVED bit. >>> >>> If everything goes well there is no need to check PageDirty() here, as >>> we have EXTENT_QGROUP_RESERVED bit for that accounting. >>> >>> So there is some other thing causing EXTENT_QGROUP_RESERVED bit out of >>> sync with dirty pages. >>> Considering you did it 15 times to reproduce the problem, maybe there is >>> some race causing the problem? >>> >> >> You can have pages marked as not dirty with EXTENT_QGROUP_RESERVED set >> for a truncate operation. Performing dd on the same file, truncates the >> file before overwriting, while the pages of the previous writes are >> still in memory and not committed to disk. >> >> truncate_inode_page() -> truncate_complete_page() clears the dirty flag. >> So, you can have a case where the EXTENT_QGROUP_RESERVED bit is set >> while the page is not listed as dirty because the truncate "cleared" all >> the dirty pages. >> > > Sorry I still don't get the point. > Would you please give a call flow of the timing dirtying page and > calling btrfs_qgroup_reserve/free/release_data()? > > Like: > __btrfs_buffered_write() > |- btrfs_check_data_free_space() > | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit > |- btrfs_dirty_pages() <- Mark page dirty > > > [[Timing of btrfs_invalidatepage()]] > About your commit message "once while invalidating the page, and the > next time while free'ing delalloc." > "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref(). > > If so, it means one extent goes through full write back, and long before > calling btrfs_qgroup_free_delayed_ref(), it will call > btrfs_qgroup_release_data() to clear QGROUP_RESERVED. > > So the call will be: > __btrfs_buffered_write() > |- btrfs_check_data_free_space() > | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit > |- btrfs_dirty_pages() <- Mark page dirty > > <data write back happens> > run_delalloc_range() > |- cow_file_range() > |- extent_clear_unlock_delalloc() <- Clear page dirty > > <modifying metadata> > > btrfs_finish_ordered_io() > |- insert_reserved_file_extent() > |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit > but not decrease reserved space > > <run delayed refs, normally happens in commit_trans> > run_one_delyaed_refs() > |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space > > > So the problem seems to be, btrfs_invalidatepage() is called after > run_delalloc_range() but before btrfs_finish_ordered_io(). > > Did you mean that? This happens event before a writeback happens. So, here is what is happening. This is in reference, and specific to the test case in the description. Process: dd - first time __btrfs_buffered_write() |- btrfs_check_data_free_space() | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit |- btrfs_dirty_pages() <- Mark page dirty Please note data writeback does _not_ happen/complete. Process: dd - next time, new process sys_open O_TRUNC . |-btrfs_setattr() |-truncate_pagecache() |-truncate_inode_pages_range() |-truncate_inode_page() - Page is cleared of Dirty flag. |-btrfs_invalidatepage(page) |-__btrfs_qgroup_release_data() |-btrfs_qgroup_free_refroot() - qgroup->reserved is reduced by PAGESIZE. Process: sync btrfs_sync_fs() |-btrfs_commit_transaction() |-btrfs_run_delayed_refs() |- qgroup_free_refroot() - Reduces reserved by the size of the extent (in this case, the filesize of dd (first time)), even though some of the PAGESIZEs have been reduced while performing the truncate on the file. I hope that makes it clear. > > [[About test case]] > And for the test case, I can't reproduce the problem no matter if I > apply the fix or not. > > Either way it just fails after 3 loops of dd, and later dd will all fail. > But I can still remove the file and write new data into the fs. > Strange? I can reproduce at every instance of running it. Even on 4.8.0-rc7 > > [[Extra protect about qgroup->reserved]] > And for the underflowed qgroup reserve space, would you mind to add > warning for that case? > Just like what we did in qgroup excl/rfer values, so at least it won't > make qgroup blocking any write. > Oh yes, I wonder why that is not placed there when it is placed in all other location where qgroup->reserved is reduced. Also, this has nothing to do with the comment of two ways of free'ing the qgroups, as suggested in the commit statement. Thats my bad. -- Goldwyn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-26 14:31 ` Goldwyn Rodrigues @ 2016-09-27 3:10 ` Qu Wenruo 2016-09-27 14:04 ` Goldwyn Rodrigues 0 siblings, 1 reply; 15+ messages in thread From: Qu Wenruo @ 2016-09-27 3:10 UTC (permalink / raw) To: Goldwyn Rodrigues, Goldwyn Rodrigues, linux-btrfs At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote: > > > On 09/25/2016 09:33 PM, Qu Wenruo wrote: >> >> >> At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote: >>> >>> [snipped] >> >> Sorry I still don't get the point. >> Would you please give a call flow of the timing dirtying page and >> calling btrfs_qgroup_reserve/free/release_data()? >> >> Like: >> __btrfs_buffered_write() >> |- btrfs_check_data_free_space() >> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >> |- btrfs_dirty_pages() <- Mark page dirty >> >> >> [[Timing of btrfs_invalidatepage()]] >> About your commit message "once while invalidating the page, and the >> next time while free'ing delalloc." >> "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref(). >> >> If so, it means one extent goes through full write back, and long before >> calling btrfs_qgroup_free_delayed_ref(), it will call >> btrfs_qgroup_release_data() to clear QGROUP_RESERVED. >> >> So the call will be: >> __btrfs_buffered_write() >> |- btrfs_check_data_free_space() >> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >> |- btrfs_dirty_pages() <- Mark page dirty >> >> <data write back happens> >> run_delalloc_range() >> |- cow_file_range() >> |- extent_clear_unlock_delalloc() <- Clear page dirty >> >> <modifying metadata> >> >> btrfs_finish_ordered_io() >> |- insert_reserved_file_extent() >> |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit >> but not decrease reserved space >> >> <run delayed refs, normally happens in commit_trans> >> run_one_delyaed_refs() >> |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space >> >> >> So the problem seems to be, btrfs_invalidatepage() is called after >> run_delalloc_range() but before btrfs_finish_ordered_io(). >> >> Did you mean that? > > This happens event before a writeback happens. So, here is what is > happening. This is in reference, and specific to the test case in the > description. The flowchart helps a lot, thanks. But still some flow is still strange. > > Process: dd - first time > __btrfs_buffered_write() > |- btrfs_check_data_free_space() > | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit > |- btrfs_dirty_pages() <- Mark page dirty > > Please note data writeback does _not_ happen/complete. This part is completely fine. > > Process: dd - next time, new process > sys_open O_TRUNC > . > |-btrfs_setattr() > |-truncate_pagecache() > |-truncate_inode_pages_range() > |-truncate_inode_page() - Page is cleared of Dirty flag. > |-btrfs_invalidatepage(page) > |-__btrfs_qgroup_release_data() > |-btrfs_qgroup_free_refroot() - qgroup->reserved is > reduced by PAGESIZE. That's OK too. No problem. Although the __btrfs_qgroup_release_data() is called by btrfs_qgroup_free_data(). Which cleared QGROUP_RESERVED flag. So the call flow should be |-btrfs_setattr() |-truncate_pagecache() |-truncate_inode_pages_range() |-truncate_inode_page() <- Clear page dirty |-btrfs_invalidatepage(page) |-btrfs_qgroup_free_data() <- reduce qgroup->reserved and clear QGROUP_RESERVED After btrfs_setattr(), the new dd write data. So __btrfs_buffered_write() happens again, dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved. So here I don't see any problem buffered_write: Mark dirty Mark QGROUP_RESERVED Increase qgroup->reserved btrfs_setattr: Clear dirty Clear QGROUP_RESERVED Decrease qgroup->reserved All pairs with each other, no leak no underflow. Until the last buffered_write(), which doesn't got truncated. > > > Process: sync > btrfs_sync_fs() > |-btrfs_commit_transaction() > |-btrfs_run_delayed_refs() > |- qgroup_free_refroot() - Reduces reserved by the size of the > extent (in this case, the filesize of dd (first time)), even though some > of the PAGESIZEs have been reduced while performing the truncate on the > file. The delayed_ref belongs to the last extent, which doesn't got truncated. And in that case, that last extent got into disk. run_dealloc_range() <- Write data into disk finish_ordered_io() <- Update metadata |- insert_reserved_file_extents() |- btrfs_alloc_reserved_file_extent() | |- btrfs_add_delayed_data_ref | <This will cause a new delayed_data_ref> |- btrfs_qgroup_release_data() <This will clear QGROURP_RESERVED> Then goes to your btrfs_sync_fs() => qgroup_free_refroot() [[I think I find the problem now]] Between btrfs_alloc_reserved_file_extent() and btrfs_qgroup_release_data(), there is a window that delayed_refs can be executed. Since that d*mn delayed_refs can be run at any time asynchronously. In that case, it will call qgroup_free_refroot() first at another process context, and later btrfs_qgroup_release_data() get schedualed, which will clear QGROUP_RESERVED again, causing the underflow. Although it's not the original cause you reported, would you mind to try the following quick-dirty fix without your fix, to see if it fixes the problem? ------- diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e6811c4..70efa14 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2168,15 +2168,15 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, ins.objectid = disk_bytenr; ins.offset = disk_num_bytes; ins.type = BTRFS_EXTENT_ITEM_KEY; - ret = btrfs_alloc_reserved_file_extent(trans, root, - root->root_key.objectid, - btrfs_ino(inode), file_pos, - ram_bytes, &ins); /* * Release the reserved range from inode dirty range map, as it is * already moved into delayed_ref_head */ btrfs_qgroup_release_data(inode, file_pos, ram_bytes); + ret = btrfs_alloc_reserved_file_extent(trans, root, + root->root_key.objectid, + btrfs_ino(inode), file_pos, + ram_bytes, &ins); out: btrfs_free_path(path); ------- > > I hope that makes it clear. > >> >> [[About test case]] >> And for the test case, I can't reproduce the problem no matter if I >> apply the fix or not. >> >> Either way it just fails after 3 loops of dd, and later dd will all fail. >> But I can still remove the file and write new data into the fs. >> > > Strange? I can reproduce at every instance of running it. Even on 4.8.0-rc7 Maybe related to the memory size. Since you're also using VM, maybe your VM RAM is too small, so dirty pages pressure triggers a commit halfway and cause the race? Thanks, Qu > >> >> [[Extra protect about qgroup->reserved]] >> And for the underflowed qgroup reserve space, would you mind to add >> warning for that case? >> Just like what we did in qgroup excl/rfer values, so at least it won't >> make qgroup blocking any write. >> > > Oh yes, I wonder why that is not placed there when it is placed in all > other location where qgroup->reserved is reduced. > > Also, this has nothing to do with the comment of two ways of free'ing > the qgroups, as suggested in the commit statement. Thats my bad. > ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-27 3:10 ` Qu Wenruo @ 2016-09-27 14:04 ` Goldwyn Rodrigues 2016-09-28 1:44 ` Qu Wenruo 0 siblings, 1 reply; 15+ messages in thread From: Goldwyn Rodrigues @ 2016-09-27 14:04 UTC (permalink / raw) To: Qu Wenruo, Goldwyn Rodrigues, linux-btrfs On 09/26/2016 10:10 PM, Qu Wenruo wrote: > > > At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote: >> >> >> On 09/25/2016 09:33 PM, Qu Wenruo wrote: >>> >>> >>> At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote: >>>> >>>> > [snipped] >>> >>> Sorry I still don't get the point. >>> Would you please give a call flow of the timing dirtying page and >>> calling btrfs_qgroup_reserve/free/release_data()? >>> >>> Like: >>> __btrfs_buffered_write() >>> |- btrfs_check_data_free_space() >>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >>> |- btrfs_dirty_pages() <- Mark page dirty >>> >>> >>> [[Timing of btrfs_invalidatepage()]] >>> About your commit message "once while invalidating the page, and the >>> next time while free'ing delalloc." >>> "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref(). >>> >>> If so, it means one extent goes through full write back, and long before >>> calling btrfs_qgroup_free_delayed_ref(), it will call >>> btrfs_qgroup_release_data() to clear QGROUP_RESERVED. >>> >>> So the call will be: >>> __btrfs_buffered_write() >>> |- btrfs_check_data_free_space() >>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >>> |- btrfs_dirty_pages() <- Mark page dirty >>> >>> <data write back happens> >>> run_delalloc_range() >>> |- cow_file_range() >>> |- extent_clear_unlock_delalloc() <- Clear page dirty >>> >>> <modifying metadata> >>> >>> btrfs_finish_ordered_io() >>> |- insert_reserved_file_extent() >>> |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit >>> but not decrease reserved space >>> >>> <run delayed refs, normally happens in commit_trans> >>> run_one_delyaed_refs() >>> |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space >>> >>> >>> So the problem seems to be, btrfs_invalidatepage() is called after >>> run_delalloc_range() but before btrfs_finish_ordered_io(). >>> >>> Did you mean that? >> >> This happens event before a writeback happens. So, here is what is >> happening. This is in reference, and specific to the test case in the >> description. > > The flowchart helps a lot, thanks. > > But still some flow is still strange. >> >> Process: dd - first time >> __btrfs_buffered_write() >> |- btrfs_check_data_free_space() >> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >> |- btrfs_dirty_pages() <- Mark page dirty >> >> Please note data writeback does _not_ happen/complete. > > This part is completely fine. > >> >> Process: dd - next time, new process >> sys_open O_TRUNC >> . >> |-btrfs_setattr() >> |-truncate_pagecache() >> |-truncate_inode_pages_range() >> |-truncate_inode_page() - Page is cleared of Dirty flag. >> |-btrfs_invalidatepage(page) >> |-__btrfs_qgroup_release_data() >> |-btrfs_qgroup_free_refroot() - qgroup->reserved is >> reduced by PAGESIZE. > > That's OK too. No problem. > Although the __btrfs_qgroup_release_data() is called by > btrfs_qgroup_free_data(). > Which cleared QGROUP_RESERVED flag. > > > So the call flow should be > > |-btrfs_setattr() > |-truncate_pagecache() > |-truncate_inode_pages_range() > |-truncate_inode_page() <- Clear page dirty > |-btrfs_invalidatepage(page) > |-btrfs_qgroup_free_data() <- reduce qgroup->reserved > and clear QGROUP_RESERVED > > After btrfs_setattr(), the new dd write data. > So __btrfs_buffered_write() happens again, > dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved. > > So here I don't see any problem > buffered_write: > Mark dirty > Mark QGROUP_RESERVED > Increase qgroup->reserved > > btrfs_setattr: > Clear dirty > Clear QGROUP_RESERVED > Decrease qgroup->reserved > > All pairs with each other, no leak no underflow. Yes, but the problem is run_one_delayed_ref() |-btrfs_qgroup_free_delayed_ref uses delayed_ref_head->qgroup_reserved to reduce the count. Which function is responsible for decreasing delayed_ref_head->qgroup_reserved when btrfs_invalidatepage() is reducing the count? > > Until the last buffered_write(), which doesn't got truncated. > >> >> >> Process: sync >> btrfs_sync_fs() >> |-btrfs_commit_transaction() >> |-btrfs_run_delayed_refs() >> |- qgroup_free_refroot() - Reduces reserved by the size of the >> extent (in this case, the filesize of dd (first time)), even though some >> of the PAGESIZEs have been reduced while performing the truncate on the >> file. > > The delayed_ref belongs to the last extent, which doesn't got truncated. > > And in that case, that last extent got into disk. > run_dealloc_range() <- Write data into disk > finish_ordered_io() <- Update metadata > |- insert_reserved_file_extents() > |- btrfs_alloc_reserved_file_extent() > | |- btrfs_add_delayed_data_ref > | <This will cause a new delayed_data_ref> > |- btrfs_qgroup_release_data() > <This will clear QGROURP_RESERVED> > > Then goes to your btrfs_sync_fs() => qgroup_free_refroot() > > > > [[I think I find the problem now]] > > Between btrfs_alloc_reserved_file_extent() and > btrfs_qgroup_release_data(), there is a window that delayed_refs can be > executed. > Since that d*mn delayed_refs can be run at any time asynchronously. > > In that case, it will call qgroup_free_refroot() first at another > process context, and later btrfs_qgroup_release_data() get schedualed, > which will clear QGROUP_RESERVED again, causing the underflow. > > Although it's not the original cause you reported, would you mind to try > the following quick-dirty fix without your fix, to see if it fixes the > problem? > > ------- > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index e6811c4..70efa14 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -2168,15 +2168,15 @@ static int insert_reserved_file_extent(struct > btrfs_trans_handle *trans, > ins.objectid = disk_bytenr; > ins.offset = disk_num_bytes; > ins.type = BTRFS_EXTENT_ITEM_KEY; > - ret = btrfs_alloc_reserved_file_extent(trans, root, > - root->root_key.objectid, > - btrfs_ino(inode), file_pos, > - ram_bytes, &ins); > /* > * Release the reserved range from inode dirty range map, as it is > * already moved into delayed_ref_head > */ > btrfs_qgroup_release_data(inode, file_pos, ram_bytes); > + ret = btrfs_alloc_reserved_file_extent(trans, root, > + root->root_key.objectid, > + btrfs_ino(inode), file_pos, > + ram_bytes, &ins); > out: > btrfs_free_path(path); > > ------- No, it does not work. > >> >> I hope that makes it clear. >> >>> >>> [[About test case]] >>> And for the test case, I can't reproduce the problem no matter if I >>> apply the fix or not. >>> >>> Either way it just fails after 3 loops of dd, and later dd will all >>> fail. >>> But I can still remove the file and write new data into the fs. >>> >> >> Strange? I can reproduce at every instance of running it. Even on >> 4.8.0-rc7 > > Maybe related to the memory size. > > Since you're also using VM, maybe your VM RAM is too small, so dirty > pages pressure triggers a commit halfway and cause the race? Not quite, this problem happens on a physical machine as well. Besides, this VM is not in memory pressure. I tried it again on 4.8.0-rc8. -- Goldwyn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-27 14:04 ` Goldwyn Rodrigues @ 2016-09-28 1:44 ` Qu Wenruo 2016-09-28 2:19 ` Goldwyn Rodrigues 0 siblings, 1 reply; 15+ messages in thread From: Qu Wenruo @ 2016-09-28 1:44 UTC (permalink / raw) To: Goldwyn Rodrigues, Goldwyn Rodrigues, linux-btrfs At 09/27/2016 10:04 PM, Goldwyn Rodrigues wrote: > > > On 09/26/2016 10:10 PM, Qu Wenruo wrote: >> >> >> At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote: >>> >>> >>> On 09/25/2016 09:33 PM, Qu Wenruo wrote: >>>> >>>> >>>> At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote: >>>>> >>>>> >> [snipped] >>>> >>>> Sorry I still don't get the point. >>>> Would you please give a call flow of the timing dirtying page and >>>> calling btrfs_qgroup_reserve/free/release_data()? >>>> >>>> Like: >>>> __btrfs_buffered_write() >>>> |- btrfs_check_data_free_space() >>>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >>>> |- btrfs_dirty_pages() <- Mark page dirty >>>> >>>> >>>> [[Timing of btrfs_invalidatepage()]] >>>> About your commit message "once while invalidating the page, and the >>>> next time while free'ing delalloc." >>>> "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref(). >>>> >>>> If so, it means one extent goes through full write back, and long before >>>> calling btrfs_qgroup_free_delayed_ref(), it will call >>>> btrfs_qgroup_release_data() to clear QGROUP_RESERVED. >>>> >>>> So the call will be: >>>> __btrfs_buffered_write() >>>> |- btrfs_check_data_free_space() >>>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >>>> |- btrfs_dirty_pages() <- Mark page dirty >>>> >>>> <data write back happens> >>>> run_delalloc_range() >>>> |- cow_file_range() >>>> |- extent_clear_unlock_delalloc() <- Clear page dirty >>>> >>>> <modifying metadata> >>>> >>>> btrfs_finish_ordered_io() >>>> |- insert_reserved_file_extent() >>>> |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit >>>> but not decrease reserved space >>>> >>>> <run delayed refs, normally happens in commit_trans> >>>> run_one_delyaed_refs() >>>> |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space >>>> >>>> >>>> So the problem seems to be, btrfs_invalidatepage() is called after >>>> run_delalloc_range() but before btrfs_finish_ordered_io(). >>>> >>>> Did you mean that? >>> >>> This happens event before a writeback happens. So, here is what is >>> happening. This is in reference, and specific to the test case in the >>> description. >> >> The flowchart helps a lot, thanks. >> >> But still some flow is still strange. >>> >>> Process: dd - first time >>> __btrfs_buffered_write() >>> |- btrfs_check_data_free_space() >>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit >>> |- btrfs_dirty_pages() <- Mark page dirty >>> >>> Please note data writeback does _not_ happen/complete. >> >> This part is completely fine. >> >>> >>> Process: dd - next time, new process >>> sys_open O_TRUNC >>> . >>> |-btrfs_setattr() >>> |-truncate_pagecache() >>> |-truncate_inode_pages_range() >>> |-truncate_inode_page() - Page is cleared of Dirty flag. >>> |-btrfs_invalidatepage(page) >>> |-__btrfs_qgroup_release_data() >>> |-btrfs_qgroup_free_refroot() - qgroup->reserved is >>> reduced by PAGESIZE. >> >> That's OK too. No problem. >> Although the __btrfs_qgroup_release_data() is called by >> btrfs_qgroup_free_data(). >> Which cleared QGROUP_RESERVED flag. >> >> >> So the call flow should be >> >> |-btrfs_setattr() >> |-truncate_pagecache() >> |-truncate_inode_pages_range() >> |-truncate_inode_page() <- Clear page dirty >> |-btrfs_invalidatepage(page) >> |-btrfs_qgroup_free_data() <- reduce qgroup->reserved >> and clear QGROUP_RESERVED >> >> After btrfs_setattr(), the new dd write data. >> So __btrfs_buffered_write() happens again, >> dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved. >> >> So here I don't see any problem >> buffered_write: >> Mark dirty >> Mark QGROUP_RESERVED >> Increase qgroup->reserved >> >> btrfs_setattr: >> Clear dirty >> Clear QGROUP_RESERVED >> Decrease qgroup->reserved >> >> All pairs with each other, no leak no underflow. > > Yes, but the problem is > run_one_delayed_ref() > |-btrfs_qgroup_free_delayed_ref > uses delayed_ref_head->qgroup_reserved to reduce the count. Which > function is responsible for decreasing delayed_ref_head->qgroup_reserved > when btrfs_invalidatepage() is reducing the count? Decreasing qgroup_reserved can happen in two ways: 1) Through qgroup_release_data() + qgroup_free_delayed_ref() combination That's used for case like writing pages into disk. The work flow is: __btrfs_buffered_write() |- btrfs_check_free_space() |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED flag Increase qgroup->reserved. run_dealloc_range() <- Write data into disk finish_ordered_io() <- Update metadata |- add_delayed_data_ref() <- Create a new data_ref, record | how much qgroup->reserved needs to | free |- btrfs_qgroup_release_data() <- Only clearing QGROUP_RESERVED flag But qgroup->reserved is not decreased commit_trans() |- run_delayed_refs() |- run_one_delayed_ref() |- btrfs_qgroup_free_delayed_ref() <- Decrease qgroup->reserved So if we hit btrfs_qgroup_free_delayed_ref(), it means one extent hits disk. The complexity is because qgroup is done at commit_trans() time, so qgroup excl/rfer is not updated at finish_ordered_io() time. If we decrease qgroup->reserved here, then we can easily exceed qgroup limit. Maybe the problem is about the delayed_ref. IIRC some case we can create delayed_ref whose refs is still 0. Which means the delayed ref although hit disk, but got freed in the same trans. If that's the case, we shouldn't call qgroup_free_delayed_ref() for such delayed_ref. 2) Through qgroup_free_data() That's for case where extent doesn't reach disk. The flow is much more straightforward: __btrfs_buffered_write() |- btrfs_check_free_space() |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED flag Increase qgroup->reserved. btrfs_invalidatepage() |- btrfs_qgroup_free_data() <- Clear QGROUP_RESERVED flag Decrease qgroup->reserved. > > >> >> Until the last buffered_write(), which doesn't got truncated. >> >>> >>> >>> Process: sync >>> btrfs_sync_fs() >>> |-btrfs_commit_transaction() >>> |-btrfs_run_delayed_refs() >>> |- qgroup_free_refroot() - Reduces reserved by the size of the >>> extent (in this case, the filesize of dd (first time)), even though some >>> of the PAGESIZEs have been reduced while performing the truncate on the >>> file. >> >> The delayed_ref belongs to the last extent, which doesn't got truncated. >> >> And in that case, that last extent got into disk. >> run_dealloc_range() <- Write data into disk >> finish_ordered_io() <- Update metadata >> |- insert_reserved_file_extents() >> |- btrfs_alloc_reserved_file_extent() >> | |- btrfs_add_delayed_data_ref >> | <This will cause a new delayed_data_ref> >> |- btrfs_qgroup_release_data() >> <This will clear QGROURP_RESERVED> >> >> Then goes to your btrfs_sync_fs() => qgroup_free_refroot() >> >> >> >> [[I think I find the problem now]] >> >> Between btrfs_alloc_reserved_file_extent() and >> btrfs_qgroup_release_data(), there is a window that delayed_refs can be >> executed. >> Since that d*mn delayed_refs can be run at any time asynchronously. >> >> In that case, it will call qgroup_free_refroot() first at another >> process context, and later btrfs_qgroup_release_data() get schedualed, >> which will clear QGROUP_RESERVED again, causing the underflow. >> >> Although it's not the original cause you reported, would you mind to try >> the following quick-dirty fix without your fix, to see if it fixes the >> problem? >> >> ------- >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index e6811c4..70efa14 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -2168,15 +2168,15 @@ static int insert_reserved_file_extent(struct >> btrfs_trans_handle *trans, >> ins.objectid = disk_bytenr; >> ins.offset = disk_num_bytes; >> ins.type = BTRFS_EXTENT_ITEM_KEY; >> - ret = btrfs_alloc_reserved_file_extent(trans, root, >> - root->root_key.objectid, >> - btrfs_ino(inode), file_pos, >> - ram_bytes, &ins); >> /* >> * Release the reserved range from inode dirty range map, as it is >> * already moved into delayed_ref_head >> */ >> btrfs_qgroup_release_data(inode, file_pos, ram_bytes); >> + ret = btrfs_alloc_reserved_file_extent(trans, root, >> + root->root_key.objectid, >> + btrfs_ino(inode), file_pos, >> + ram_bytes, &ins); >> out: >> btrfs_free_path(path); >> >> ------- > > No, it does not work. > >> >>> >>> I hope that makes it clear. >>> >>>> >>>> [[About test case]] >>>> And for the test case, I can't reproduce the problem no matter if I >>>> apply the fix or not. >>>> >>>> Either way it just fails after 3 loops of dd, and later dd will all >>>> fail. >>>> But I can still remove the file and write new data into the fs. >>>> >>> >>> Strange? I can reproduce at every instance of running it. Even on >>> 4.8.0-rc7 >> >> Maybe related to the memory size. >> >> Since you're also using VM, maybe your VM RAM is too small, so dirty >> pages pressure triggers a commit halfway and cause the race? > > Not quite, this problem happens on a physical machine as well. Besides, > this VM is not in memory pressure. I tried it again on 4.8.0-rc8. > Finally reproduced it. Although in a low chance, about 1/10. Under most case, the final remove gets executed without error. ------- sudo ./qgroup_underflow.sh btrfs-progs v4.7.3 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: Node size: 16384 Sector size: 4096 Filesystem size: 10.00GiB Block group profiles: Data: single 8.00MiB Metadata: DUP 1.00GiB System: DUP 8.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 10.00GiB /dev/vdb5 Create subvolume './a' 40+0 records in 40+0 records out 41943040 bytes (42 MB, 40 MiB) copied, 0.0450651 s, 931 MB/s 40+0 records in 40+0 records out 41943040 bytes (42 MB, 40 MiB) copied, 0.0771046 s, 544 MB/s dd: error writing '/mnt/scratch/a/file': Disk quota exceeded 10+0 records in 9+0 records out 9961472 bytes (10 MB, 9.5 MiB) copied, 0.0179116 s, 556 MB/s dd: error writing '/mnt/scratch/a/file': Disk quota exceeded 1+0 records in 0+0 records out 0 bytes copied, 0.00101051 s, 0.0 kB/s dd: error writing '/mnt/scratch/a/file': Disk quota exceeded 1+0 records in 0+0 records out 0 bytes copied, 0.000875674 s, 0.0 kB/s dd: error writing '/mnt/scratch/a/file': Disk quota exceeded 1+0 records in 0+0 records out 0 bytes copied, 0.000889771 s, 0.0 kB/s dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded dd: failed to open '/mnt/scratch/a/file': Disk quota exceeded Removing file ------- I'd try to tweak the script to improve the possibility and then use ftrace to track qgroup reserve and delayed_ref related event. Thanks, Qu ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-28 1:44 ` Qu Wenruo @ 2016-09-28 2:19 ` Goldwyn Rodrigues 2016-09-29 8:57 ` Qu Wenruo 0 siblings, 1 reply; 15+ messages in thread From: Goldwyn Rodrigues @ 2016-09-28 2:19 UTC (permalink / raw) To: Qu Wenruo, Goldwyn Rodrigues, linux-btrfs On 09/27/2016 08:44 PM, Qu Wenruo wrote: <snipped> > Finally reproduced it. > > Although in a low chance, about 1/10. > > Under most case, the final remove gets executed without error. Change 50m to 500m while setting the qgroup limit, the probability will increase. -- Goldwyn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-28 2:19 ` Goldwyn Rodrigues @ 2016-09-29 8:57 ` Qu Wenruo 2016-09-29 11:13 ` Goldwyn Rodrigues 0 siblings, 1 reply; 15+ messages in thread From: Qu Wenruo @ 2016-09-29 8:57 UTC (permalink / raw) To: Goldwyn Rodrigues, Goldwyn Rodrigues, linux-btrfs Thanks for your test script. I succeeded in pinning down the problem. The problem is, btrfs is invalidate pages that belongs to ordered extent(goes through writeback) The ftrace(with more tracepoints to trace qgroup->reserved change) shows: ------ btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096, reserved=4096, op=free qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096 btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096, reserved=4096, op=free qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096 btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096, reserved=4096, op=free qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096 btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096, reserved=4096, op=free qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096 btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096, reserved=4096, op=free qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096 ^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K. btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880, reserved=5222400, op=release ^^^^^ Here ordered_io finishes, write the whole 5M data, while QGROUP_RESERVED only returned 5M - 20K. ------ The problem is at btrfs_add_delayed_data_ref(), we use the assumption that, qgroup reserved space is the same as extent size(5M) But in fact, btrfs_qgroup_release_data() only release (5M - 20K). Leaking the 20K. Calltrace: btrfs_finish_ordered_io() |- inserted_reserved_file_extent() |- btrfs_alloc_reserved_file_extent() | ^^ Here we tell delayed_ref to free 5M later |- btrfs_qgroup_release_data() ^^ We only released (5M - 20K), as the first 5 pages are freed by btrfs_invalidatepage() If btrfs is designed to freeing pages which belongs to ordered_extent, then it totally breaks the qgroup assumption. Then we have several ways to fix it: 1) Fix race between ordered extent(writeback) with invalidatepage() Make btrfs_invalidatepage() skips pages that are already in ordered_extent, then we're OK. This is much like your original fix, but I'm not sure if DIRTY page flag is bond to ordered_extent. And more comment will help. 2) Make btrfs_qgroup_release_data() return accurate num bytes And then pass it to delayed_ref head. This is quite anti-intuition. If we're adding a new extent, how could it happen that some pages are already invalidated? Anyway, awesome work, exposed a quite strange race and makes us re-think the base of qgroup reserve space framework. Thanks, Qu At 09/28/2016 10:19 AM, Goldwyn Rodrigues wrote: > > > On 09/27/2016 08:44 PM, Qu Wenruo wrote: > > <snipped> > >> Finally reproduced it. >> >> Although in a low chance, about 1/10. >> >> Under most case, the final remove gets executed without error. > > Change 50m to 500m while setting the qgroup limit, the probability will > increase. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-29 8:57 ` Qu Wenruo @ 2016-09-29 11:13 ` Goldwyn Rodrigues 2016-09-30 1:21 ` Qu Wenruo 2016-09-30 2:18 ` Qu Wenruo 0 siblings, 2 replies; 15+ messages in thread From: Goldwyn Rodrigues @ 2016-09-29 11:13 UTC (permalink / raw) To: Qu Wenruo, Goldwyn Rodrigues, linux-btrfs On 09/29/2016 03:57 AM, Qu Wenruo wrote: > Thanks for your test script. > > I succeeded in pinning down the problem. > > The problem is, btrfs is invalidate pages that belongs to ordered > extent(goes through writeback) No, I don't think invalidated pages are going through writeback. The problem is that the space for extent allocation is done before the writeback and if pages are invalidated before writeback, it is not accounted for. > > The ftrace(with more tracepoints to trace qgroup->reserved change) shows: > ------ > btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096, > reserved=4096, op=free > qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096 > btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096, > reserved=4096, op=free > qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096 > btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096, > reserved=4096, op=free > qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096 > btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096, > reserved=4096, op=free > qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096 > btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096, > reserved=4096, op=free > qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096 This is a little confusing. There is no qgroup_update_reserve() function in the source. Where did you add this? > > > ^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K. > > btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880, > reserved=5222400, op=release > > ^^^^^ Here ordered_io finishes, write the whole 5M data, while > QGROUP_RESERVED only returned 5M - 20K. > ------ > > The problem is at btrfs_add_delayed_data_ref(), we use the assumption > that, qgroup reserved space is the same as extent size(5M) > But in fact, btrfs_qgroup_release_data() only release (5M - 20K). > Leaking the 20K. Yes, this is more like it. However, I would put it the other way. It releases 5M when it should have released 5M-20K, because 20k was released when invalidating page. > > Calltrace: > btrfs_finish_ordered_io() > |- inserted_reserved_file_extent() > |- btrfs_alloc_reserved_file_extent() > | ^^ Here we tell delayed_ref to free 5M later > |- btrfs_qgroup_release_data() > ^^ We only released (5M - 20K), as the first 5 pages > are freed by btrfs_invalidatepage() > > If btrfs is designed to freeing pages which belongs to ordered_extent, > then it totally breaks the qgroup assumption. > > Then we have several ways to fix it: > 1) Fix race between ordered extent(writeback) with invalidatepage() > Make btrfs_invalidatepage() skips pages that are already in > ordered_extent, then we're OK. > > This is much like your original fix, but I'm not sure if DIRTY page > flag is bond to ordered_extent. > And more comment will help. So, what you mean is that fix is correct, I just need to reword the patch header. > > 2) Make btrfs_qgroup_release_data() return accurate num bytes > And then pass it to delayed_ref head. > > This is quite anti-intuition. If we're adding a new extent, how > could it happen that some pages are already invalidated? > I agree. It is counter-intutive and will increase complexity. > Anyway, awesome work, exposed a quite strange race and makes us re-think > the base of qgroup reserve space framework. Thanks. -- Goldwyn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-29 11:13 ` Goldwyn Rodrigues @ 2016-09-30 1:21 ` Qu Wenruo 2016-09-30 2:18 ` Qu Wenruo 1 sibling, 0 replies; 15+ messages in thread From: Qu Wenruo @ 2016-09-30 1:21 UTC (permalink / raw) To: Goldwyn Rodrigues, Goldwyn Rodrigues, linux-btrfs At 09/29/2016 07:13 PM, Goldwyn Rodrigues wrote: > > > On 09/29/2016 03:57 AM, Qu Wenruo wrote: >> Thanks for your test script. >> >> I succeeded in pinning down the problem. >> >> The problem is, btrfs is invalidate pages that belongs to ordered >> extent(goes through writeback) > > > No, I don't think invalidated pages are going through writeback. The > problem is that the space for extent allocation is done before the > writeback and if pages are invalidated before writeback, it is not > accounted for. > >> >> The ftrace(with more tracepoints to trace qgroup->reserved change) shows: >> ------ >> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096 > > This is a little confusing. There is no qgroup_update_reserve() function > in the source. Where did you add this? Sorry, I just submitted them few minutes ago. Have CCed you. > >> >> >> ^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K. >> >> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880, >> reserved=5222400, op=release >> >> ^^^^^ Here ordered_io finishes, write the whole 5M data, while >> QGROUP_RESERVED only returned 5M - 20K. >> ------ >> >> The problem is at btrfs_add_delayed_data_ref(), we use the assumption >> that, qgroup reserved space is the same as extent size(5M) >> But in fact, btrfs_qgroup_release_data() only release (5M - 20K). >> Leaking the 20K. > > Yes, this is more like it. However, I would put it the other way. It > releases 5M when it should have released 5M-20K, because 20k was > released when invalidating page. > >> >> Calltrace: >> btrfs_finish_ordered_io() >> |- inserted_reserved_file_extent() >> |- btrfs_alloc_reserved_file_extent() >> | ^^ Here we tell delayed_ref to free 5M later >> |- btrfs_qgroup_release_data() >> ^^ We only released (5M - 20K), as the first 5 pages >> are freed by btrfs_invalidatepage() >> >> If btrfs is designed to freeing pages which belongs to ordered_extent, >> then it totally breaks the qgroup assumption. >> >> Then we have several ways to fix it: >> 1) Fix race between ordered extent(writeback) with invalidatepage() >> Make btrfs_invalidatepage() skips pages that are already in >> ordered_extent, then we're OK. >> >> This is much like your original fix, but I'm not sure if DIRTY page >> flag is bond to ordered_extent. >> And more comment will help. > > > So, what you mean is that fix is correct, I just need to reword the > patch header. Not patch header, but comment before PageDirty() check. I'm still digging the code to see if page DIRTY check is safe and if there is more elegant fix. Thanks, Qu > >> >> 2) Make btrfs_qgroup_release_data() return accurate num bytes >> And then pass it to delayed_ref head. >> >> This is quite anti-intuition. If we're adding a new extent, how >> could it happen that some pages are already invalidated? >> > > I agree. It is counter-intutive and will increase complexity. > >> Anyway, awesome work, exposed a quite strange race and makes us re-think >> the base of qgroup reserve space framework. > > Thanks. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-29 11:13 ` Goldwyn Rodrigues 2016-09-30 1:21 ` Qu Wenruo @ 2016-09-30 2:18 ` Qu Wenruo 2016-09-30 2:24 ` Qu Wenruo 1 sibling, 1 reply; 15+ messages in thread From: Qu Wenruo @ 2016-09-30 2:18 UTC (permalink / raw) To: Goldwyn Rodrigues, Goldwyn Rodrigues, linux-btrfs Your original fix is good. Feel free to my tag after enriching the comment in btrfs_invalidatepage(). Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Thanks again for your founding and sorry for the extra time reviewing. Thanks Qu At 09/29/2016 07:13 PM, Goldwyn Rodrigues wrote: > > > On 09/29/2016 03:57 AM, Qu Wenruo wrote: >> Thanks for your test script. >> >> I succeeded in pinning down the problem. >> >> The problem is, btrfs is invalidate pages that belongs to ordered >> extent(goes through writeback) > > > No, I don't think invalidated pages are going through writeback. The > problem is that the space for extent allocation is done before the > writeback and if pages are invalidated before writeback, it is not > accounted for. > >> >> The ftrace(with more tracepoints to trace qgroup->reserved change) shows: >> ------ >> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096 >> btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096, >> reserved=4096, op=free >> qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096 > > This is a little confusing. There is no qgroup_update_reserve() function > in the source. Where did you add this? > >> >> >> ^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K. >> >> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880, >> reserved=5222400, op=release >> >> ^^^^^ Here ordered_io finishes, write the whole 5M data, while >> QGROUP_RESERVED only returned 5M - 20K. >> ------ >> >> The problem is at btrfs_add_delayed_data_ref(), we use the assumption >> that, qgroup reserved space is the same as extent size(5M) >> But in fact, btrfs_qgroup_release_data() only release (5M - 20K). >> Leaking the 20K. > > Yes, this is more like it. However, I would put it the other way. It > releases 5M when it should have released 5M-20K, because 20k was > released when invalidating page. > >> >> Calltrace: >> btrfs_finish_ordered_io() >> |- inserted_reserved_file_extent() >> |- btrfs_alloc_reserved_file_extent() >> | ^^ Here we tell delayed_ref to free 5M later >> |- btrfs_qgroup_release_data() >> ^^ We only released (5M - 20K), as the first 5 pages >> are freed by btrfs_invalidatepage() >> >> If btrfs is designed to freeing pages which belongs to ordered_extent, >> then it totally breaks the qgroup assumption. >> >> Then we have several ways to fix it: >> 1) Fix race between ordered extent(writeback) with invalidatepage() >> Make btrfs_invalidatepage() skips pages that are already in >> ordered_extent, then we're OK. >> >> This is much like your original fix, but I'm not sure if DIRTY page >> flag is bond to ordered_extent. >> And more comment will help. > > > So, what you mean is that fix is correct, I just need to reword the > patch header. > >> >> 2) Make btrfs_qgroup_release_data() return accurate num bytes >> And then pass it to delayed_ref head. >> >> This is quite anti-intuition. If we're adding a new extent, how >> could it happen that some pages are already invalidated? >> > > I agree. It is counter-intutive and will increase complexity. > >> Anyway, awesome work, exposed a quite strange race and makes us re-think >> the base of qgroup reserve space framework. > > Thanks. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero 2016-09-30 2:18 ` Qu Wenruo @ 2016-09-30 2:24 ` Qu Wenruo 0 siblings, 0 replies; 15+ messages in thread From: Qu Wenruo @ 2016-09-30 2:24 UTC (permalink / raw) To: Goldwyn Rodrigues, Goldwyn Rodrigues, linux-btrfs At 09/30/2016 10:18 AM, Qu Wenruo wrote: > Your original fix is good. > > Feel free to my tag after enriching the comment in btrfs_invalidatepage(). Feel free to "add" my tag... > > Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > > Thanks again for your founding and sorry for the extra time reviewing. for your "finding"... My English just sucks.... > > Thanks > Qu > > > At 09/29/2016 07:13 PM, Goldwyn Rodrigues wrote: >> >> >> On 09/29/2016 03:57 AM, Qu Wenruo wrote: >>> Thanks for your test script. >>> >>> I succeeded in pinning down the problem. >>> >>> The problem is, btrfs is invalidate pages that belongs to ordered >>> extent(goes through writeback) >> >> >> No, I don't think invalidated pages are going through writeback. The >> problem is that the space for extent allocation is done before the >> writeback and if pages are invalidated before writeback, it is not >> accounted for. >> >>> >>> The ftrace(with more tracepoints to trace qgroup->reserved change) >>> shows: >>> ------ >>> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096, >>> reserved=4096, op=free >>> qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096 >>> btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096, >>> reserved=4096, op=free >>> qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096 >>> btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096, >>> reserved=4096, op=free >>> qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096 >>> btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096, >>> reserved=4096, op=free >>> qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096 >>> btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096, >>> reserved=4096, op=free >>> qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096 >> >> This is a little confusing. There is no qgroup_update_reserve() function >> in the source. Where did you add this? >> >>> >>> >>> ^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K. >>> >>> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880, >>> reserved=5222400, op=release >>> >>> ^^^^^ Here ordered_io finishes, write the whole 5M data, while >>> QGROUP_RESERVED only returned 5M - 20K. >>> ------ >>> >>> The problem is at btrfs_add_delayed_data_ref(), we use the assumption >>> that, qgroup reserved space is the same as extent size(5M) >>> But in fact, btrfs_qgroup_release_data() only release (5M - 20K). >>> Leaking the 20K. >> >> Yes, this is more like it. However, I would put it the other way. It >> releases 5M when it should have released 5M-20K, because 20k was >> released when invalidating page. >> >>> >>> Calltrace: >>> btrfs_finish_ordered_io() >>> |- inserted_reserved_file_extent() >>> |- btrfs_alloc_reserved_file_extent() >>> | ^^ Here we tell delayed_ref to free 5M later >>> |- btrfs_qgroup_release_data() >>> ^^ We only released (5M - 20K), as the first 5 pages >>> are freed by btrfs_invalidatepage() >>> >>> If btrfs is designed to freeing pages which belongs to ordered_extent, >>> then it totally breaks the qgroup assumption. >>> >>> Then we have several ways to fix it: >>> 1) Fix race between ordered extent(writeback) with invalidatepage() >>> Make btrfs_invalidatepage() skips pages that are already in >>> ordered_extent, then we're OK. >>> >>> This is much like your original fix, but I'm not sure if DIRTY page >>> flag is bond to ordered_extent. >>> And more comment will help. >> >> >> So, what you mean is that fix is correct, I just need to reword the >> patch header. >> >>> >>> 2) Make btrfs_qgroup_release_data() return accurate num bytes >>> And then pass it to delayed_ref head. >>> >>> This is quite anti-intuition. If we're adding a new extent, how >>> could it happen that some pages are already invalidated? >>> >> >> I agree. It is counter-intutive and will increase complexity. >> >>> Anyway, awesome work, exposed a quite strange race and makes us re-think >>> the base of qgroup reserve space framework. >> >> Thanks. >> > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] qgroup: Prevent qgroup->reserved from going subzero @ 2016-09-30 15:40 Goldwyn Rodrigues 0 siblings, 0 replies; 15+ messages in thread From: Goldwyn Rodrigues @ 2016-09-30 15:40 UTC (permalink / raw) To: linux-btrfs; +Cc: quwenruo, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> While free'ing qgroup->reserved resources, we much check if the page has not been invalidated by a truncate operation by checking if the page is still dirty before reducing the qgroup resources. Resources in such a case are free'd when the entire extent is released by delayed_ref. This fixes a double accounting while releasing resources in case of truncating a file, reproduced by the following testcase. SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/mnt mkfs.btrfs -f $SCRATCH_DEV mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT cd $SCRATCH_MNT btrfs quota enable $SCRATCH_MNT btrfs subvolume create a btrfs qgroup limit 500m a $SCRATCH_MNT sync for c in {1..15}; do dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; done sleep 10 sync sleep 5 touch $SCRATCH_MNT/a/newfile echo "Removing file" rm $SCRATCH_MNT/a/file Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> --- fs/btrfs/inode.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e6811c4..bc1a004 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8915,9 +8915,14 @@ again: * So even we call qgroup_free_data(), it won't decrease reserved * space. * 2) Not written to disk - * This means the reserved space should be freed here. + * This means the reserved space should be freed here. However, + * if a truncate invalidates the page (by clearing PageDirty) + * and the page is accounted for while allocating extent + * in btrfs_check_data_free_space() we let delayed_ref to + * free the entire extent. */ - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); + if (PageDirty(page)) + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); if (!inode_evicting) { clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED | EXTENT_DIRTY | -- 2.10.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-09-30 15:41 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-22 18:47 [PATCH] qgroup: Prevent qgroup->reserved from going subzero Goldwyn Rodrigues 2016-09-23 1:06 ` Qu Wenruo 2016-09-23 13:43 ` Goldwyn Rodrigues 2016-09-26 2:33 ` Qu Wenruo 2016-09-26 14:31 ` Goldwyn Rodrigues 2016-09-27 3:10 ` Qu Wenruo 2016-09-27 14:04 ` Goldwyn Rodrigues 2016-09-28 1:44 ` Qu Wenruo 2016-09-28 2:19 ` Goldwyn Rodrigues 2016-09-29 8:57 ` Qu Wenruo 2016-09-29 11:13 ` Goldwyn Rodrigues 2016-09-30 1:21 ` Qu Wenruo 2016-09-30 2:18 ` Qu Wenruo 2016-09-30 2:24 ` Qu Wenruo 2016-09-30 15:40 Goldwyn Rodrigues
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.