All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
@ 2017-11-20 18:54 John Lightsey
  2017-11-21  0:58 ` Changwei Ge
  0 siblings, 1 reply; 12+ messages in thread
From: John Lightsey @ 2017-11-20 18:54 UTC (permalink / raw)
  To: ocfs2-devel

In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel
list:

https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html

cPanel encountered this bug after upgrading our cluster to the 4.9
Debian stable kernel. In our environment, the bug would trigger every
few hours.

The core problem seems to be that the size of dw_zero_list is not
tracked correctly. This causes the ocfs2_lock_allocators() call in
ocfs2_dio_end_io_write() to underestimate the number of extents needed.
As a result, meta_ac is null when it's needed in ocfs2_grow_tree().

The attached patch is a forward-ported version of the fix we applied to
Debian's 4.9 kernel to correct the issue.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Fix-OCFS2-extent-split-estimation-for-dio-allocators.patch
Type: text/x-patch
Size: 1712 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20171120/be39dbe1/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20171120/be39dbe1/attachment-0001.bin 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-20 18:54 [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514! John Lightsey
@ 2017-11-21  0:58 ` Changwei Ge
  2017-11-21  2:45   ` John Lightsey
  2017-11-21  3:04   ` piaojun
  0 siblings, 2 replies; 12+ messages in thread
From: Changwei Ge @ 2017-11-21  0:58 UTC (permalink / raw)
  To: ocfs2-devel

Hi John,
It's better to paste your patch directly into message body. It's easy 
for reviewing.

So I copied your patch below:

> The dw_zero_count tracking was assuming that w_unwritten_list would
> always contain one element. The actual count is now tracked whenever
> the list is extended.
> ---
>  fs/ocfs2/aops.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 88a31e9340a0..eb0a81368dbb 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt {
>  	struct ocfs2_cached_dealloc_ctxt w_dealloc;
>  
>  	struct list_head		w_unwritten_list;
> +
> +	unsigned int			w_unwritten_count;
>  };
>  
>  void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages)
> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp,
>  
>  	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
>  	INIT_LIST_HEAD(&wc->w_unwritten_list);
> +	wc->w_unwritten_count = 0;

I think you don't have to evaluate ::w_unwritten_count to zero since 
kzalloc already did that.

>  
>  	*wcp = wc;
>  
> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode,
>  	desc->c_clear_unwritten = 0;
>  	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
>  	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
> +	wc->w_unwritten_count++;

You increase ::w_unwritten_coun once a new _ue_ is attached to 
::w_unwritten_list. So if no _ue_ ever is attached, ::w_unwritten_list 
is still empty. I think your change has the same effect with origin.

Moreover I don't see the relation between the reported crash issue and 
your patch change. Can you elaborate further?

Thanks,
Changwei

>  	new = NULL;
>  unlock:
>  	spin_unlock(&oi->ip_lock);
> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock,
>  		ue->ue_phys = desc->c_phys;
>  
>  		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
> -		dwc->dw_zero_count++;
> +		dwc->dw_zero_count += wc->w_unwritten_count;
>  	}
>  
>  	ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc);
> -- 
> 2.11.0



On 2017/11/21 2:56, John Lightsey wrote:
> In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel
> list:
> 
> https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html
> 
> cPanel encountered this bug after upgrading our cluster to the 4.9
> Debian stable kernel. In our environment, the bug would trigger every
> few hours.
> 
> The core problem seems to be that the size of dw_zero_list is not
> tracked correctly. This causes the ocfs2_lock_allocators() call in
> ocfs2_dio_end_io_write() to underestimate the number of extents needed.
> As a result, meta_ac is null when it's needed in ocfs2_grow_tree().
> 
> The attached patch is a forward-ported version of the fix we applied to
> Debian's 4.9 kernel to correct the issue.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-21  0:58 ` Changwei Ge
@ 2017-11-21  2:45   ` John Lightsey
  2017-11-21  5:58     ` Changwei Ge
  2017-11-21  3:04   ` piaojun
  1 sibling, 1 reply; 12+ messages in thread
From: John Lightsey @ 2017-11-21  2:45 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, 2017-11-21 at 00:58 +0000, Changwei Ge wrote:
> > @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct
> > ocfs2_write_ctxt **wcp,
> > ?
> > ?	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
> > ?	INIT_LIST_HEAD(&wc->w_unwritten_list);
> > +	wc->w_unwritten_count = 0;
> 
> I think you don't have to evaluate ::w_unwritten_count to zero since?
> kzalloc already did that.

Very true. I was following the example of how dwc was handling the
dw_zero_count. You'll have to forgive me a bit. I'm very unfamiliar
with the linux kernel codebase.

> 
> > ?
> > ?	*wcp = wc;
> > ?
> > @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode
> > *inode,
> > ?	desc->c_clear_unwritten = 0;
> > ?	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
> > ?	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
> > +	wc->w_unwritten_count++;
> 
> You increase ::w_unwritten_coun once a new _ue_ is attached to?
> ::w_unwritten_list. So if no _ue_ ever is attached,
> ::w_unwritten_list?
> is still empty. I think your change has the same effect with origin.
> 
> Moreover I don't see the relation between the reported crash issue
> and?your patch change. Can you elaborate further?

The important part is in the next segment in the patch. This block is
just using w_unwritten_count to track the size of w_unwritten_list.

> > @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode
> > *inode, sector_t iblock,
> > ?		ue->ue_phys = desc->c_phys;
> > ?
> > ?		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
> > -		dwc->dw_zero_count++;
> > +		dwc->dw_zero_count += wc->w_unwritten_count;
> > ?	}
> > ?
> > 

dw_zero_count is tracking the number of elements in dw_zero_list.

The old version assumed that after dw_zero_list and w_unwritten_list
were spliced together, that the new length was dw_zero_count + 1. This
assumption is not correct if w_unwritten_list contained more than one
element.

The length of dw_zero_list is used by ocfs2_dio_end_io_write() to
determine whether or not meta_ac will be needed to complete the write:

????ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2,
????????????????????&data_ac, &meta_ac);

This will return with success and a null meta_ac if there are at least
dw_zero_count * 2 extents available for the write.

Since dw_zero_count was not being calculated correctly, this will
occasionally result in the write getting into ocfs2_grow_tree() with a
null meta_ac following this chain:

ocfs2_dio_end_io_write()
ocfs2_mark_extent_written()
ocfs2_change_extent_flag()
ocfs2_split_extent()
ocfs2_split_and_insert()
ocfs2_grow_tree()

That's my understanding of what's causing the bug.

Our OCFS2 cluster was crashing every two to three hours after we
upgraded to a 4.x kernel. We've gone about 18 hours with this patch
applied and no crashes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20171120/b68bb5fa/attachment.bin 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-21  0:58 ` Changwei Ge
  2017-11-21  2:45   ` John Lightsey
@ 2017-11-21  3:04   ` piaojun
  2017-11-21  4:24     ` John Lightsey
  1 sibling, 1 reply; 12+ messages in thread
From: piaojun @ 2017-11-21  3:04 UTC (permalink / raw)
  To: ocfs2-devel

Hi John,

On 2017/11/21 8:58, Changwei Ge wrote:
> Hi John,
> It's better to paste your patch directly into message body. It's easy 
> for reviewing.
> 
> So I copied your patch below:
> 
>> The dw_zero_count tracking was assuming that w_unwritten_list would
>> always contain one element. The actual count is now tracked whenever
>> the list is extended.
>> ---
>>  fs/ocfs2/aops.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index 88a31e9340a0..eb0a81368dbb 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt {
>>  	struct ocfs2_cached_dealloc_ctxt w_dealloc;
>>  
>>  	struct list_head		w_unwritten_list;
>> +
>> +	unsigned int			w_unwritten_count;
>>  };
>>  
>>  void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages)
>> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp,
>>  
>>  	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
>>  	INIT_LIST_HEAD(&wc->w_unwritten_list);
>> +	wc->w_unwritten_count = 0;
> 
> I think you don't have to evaluate ::w_unwritten_count to zero since 
> kzalloc already did that.
> 
>>  
>>  	*wcp = wc;
>>  
>> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode,
>>  	desc->c_clear_unwritten = 0;
>>  	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
>>  	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
>> +	wc->w_unwritten_count++;
> 
> You increase ::w_unwritten_coun once a new _ue_ is attached to 
> ::w_unwritten_list. So if no _ue_ ever is attached, ::w_unwritten_list 
> is still empty. I think your change has the same effect with origin.
> 
> Moreover I don't see the relation between the reported crash issue and 
> your patch change. Can you elaborate further?
> 
> Thanks,
> Changwei
> 
>>  	new = NULL;
>>  unlock:
>>  	spin_unlock(&oi->ip_lock);
>> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock,
>>  		ue->ue_phys = desc->c_phys;
>>  
>>  		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
>> -		dwc->dw_zero_count++;
>> +		dwc->dw_zero_count += wc->w_unwritten_count;

I prefer using a loop to calculate 'dwc->dw_zero_count' rather than
introducing a new variable as below:

list_for_each(iter, &wc->w_unwritten_list)
	dwc->dw_zero_count++;

thanks,
Jun

>>  	}
>>  
>>  	ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc);
>> -- 
>> 2.11.0
> 
> 
> 
> On 2017/11/21 2:56, John Lightsey wrote:
>> In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel
>> list:
>>
>> https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html
>>
>> cPanel encountered this bug after upgrading our cluster to the 4.9
>> Debian stable kernel. In our environment, the bug would trigger every
>> few hours.
>>
>> The core problem seems to be that the size of dw_zero_list is not
>> tracked correctly. This causes the ocfs2_lock_allocators() call in
>> ocfs2_dio_end_io_write() to underestimate the number of extents needed.
>> As a result, meta_ac is null when it's needed in ocfs2_grow_tree().
>>
>> The attached patch is a forward-ported version of the fix we applied to
>> Debian's 4.9 kernel to correct the issue.
>>
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-21  3:04   ` piaojun
@ 2017-11-21  4:24     ` John Lightsey
  0 siblings, 0 replies; 12+ messages in thread
From: John Lightsey @ 2017-11-21  4:24 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, 2017-11-21 at 11:04 +0800, piaojun wrote:
> > > ?	new = NULL;
> > > ?unlock:
> > > ?	spin_unlock(&oi->ip_lock);
> > > @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode
> > > *inode, sector_t iblock,
> > > ?		ue->ue_phys = desc->c_phys;
> > > ?
> > > ?		list_splice_tail_init(&wc->w_unwritten_list,
> > > &dwc->dw_zero_list);
> > > -		dwc->dw_zero_count++;
> > > +		dwc->dw_zero_count += wc->w_unwritten_count;
> 
> I prefer using a loop to calculate 'dwc->dw_zero_count' rather than
> introducing a new variable as below:
> 
> list_for_each(iter, &wc->w_unwritten_list)
> 	dwc->dw_zero_count++;

If you want to iterate through the list to calculate the length rather
than tracking the length as the list grows, it would make more sense to
remove dw_zero_count.

If you'd prefer walking all of the elements to calculate the length,
this can just be done in ocfs2_dio_end_io_write() where the length of
the list is actually used.

I assumed that the intent of dw_zero_count was to avoid an unnecessary
iteration through dw_zero_list to determine the length.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20171120/85820571/attachment-0001.bin 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-21  2:45   ` John Lightsey
@ 2017-11-21  5:58     ` Changwei Ge
  2017-11-21 21:05       ` John Lightsey
  0 siblings, 1 reply; 12+ messages in thread
From: Changwei Ge @ 2017-11-21  5:58 UTC (permalink / raw)
  To: ocfs2-devel

On 2017/11/21 10:45, John Lightsey wrote:
> On Tue, 2017-11-21 at 00:58 +0000, Changwei Ge wrote:
>>> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct
>>> ocfs2_write_ctxt **wcp,
>>>   
>>>  ?	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
>>>  ?	INIT_LIST_HEAD(&wc->w_unwritten_list);
>>> +	wc->w_unwritten_count = 0;
>>
>> I think you don't have to evaluate ::w_unwritten_count to zero since
>> kzalloc already did that.
> 
> Very true. I was following the example of how dwc was handling the
> dw_zero_count. You'll have to forgive me a bit. I'm very unfamiliar
> with the linux kernel codebase.
> 
>>
>>>   
>>>  ?	*wcp = wc;
>>>   
>>> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode
>>> *inode,
>>>  ?	desc->c_clear_unwritten = 0;
>>>  ?	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
>>>  ?	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
>>> +	wc->w_unwritten_count++;
>>
>> You increase ::w_unwritten_coun once a new _ue_ is attached to
>> ::w_unwritten_list. So if no _ue_ ever is attached,
>> ::w_unwritten_list
>> is still empty. I think your change has the same effect with origin.
>>
>> Moreover I don't see the relation between the reported crash issue
>> and?your patch change. Can you elaborate further?
> 
> The important part is in the next segment in the patch. This block is
> just using w_unwritten_count to track the size of w_unwritten_list.
> 
>>> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode
>>> *inode, sector_t iblock,
>>>  ?		ue->ue_phys = desc->c_phys;
>>>   
>>>  ?		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
>>> -		dwc->dw_zero_count++;
>>> +		dwc->dw_zero_count += wc->w_unwritten_count;
>>>  ?	}
>>>   
>>>
> 
> dw_zero_count is tracking the number of elements in dw_zero_list.
> 
> The old version assumed that after dw_zero_list and w_unwritten_list
> were spliced together, that the new length was dw_zero_count + 1. This
> assumption is not correct if w_unwritten_list contained more than one
> element.
> 
> The length of dw_zero_list is used by ocfs2_dio_end_io_write() to
> determine whether or not meta_ac will be needed to complete the write:
> 
>  ????ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2,
>  ????????????????????&data_ac, &meta_ac);
Hi John,

Thanks for reporting.
I probably get your point.

Can your tell me how did you format your volume?
What's your _cluster size_ and _block size_?
Your can obtain such information via debugfs.ocfs2 <your volume> -R 
'stats' | grep 'Cluster Size'

It's better for you provide a way to reproduce this issue so that we can 
perform some test.

Thanks,
Changwei

> 
> This will return with success and a null meta_ac if there are at least
> dw_zero_count * 2 extents available for the write.
> 
> Since dw_zero_count was not being calculated correctly, this will
> occasionally result in the write getting into ocfs2_grow_tree() with a
> null meta_ac following this chain:
> 
> ocfs2_dio_end_io_write()
> ocfs2_mark_extent_written()
> ocfs2_change_extent_flag()
> ocfs2_split_extent()
> ocfs2_split_and_insert()
> ocfs2_grow_tree()
> 
> That's my understanding of what's causing the bug.
> 
> Our OCFS2 cluster was crashing every two to three hours after we
> upgraded to a 4.x kernel. We've gone about 18 hours with this patch
> applied and no crashes.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-21  5:58     ` Changwei Ge
@ 2017-11-21 21:05       ` John Lightsey
  2017-11-24  5:46         ` alex chen
  0 siblings, 1 reply; 12+ messages in thread
From: John Lightsey @ 2017-11-21 21:05 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, 2017-11-21 at 05:58 +0000, Changwei Ge wrote:

> Can your tell me how did you format your volume?
> What's your _cluster size_ and _block size_?
> Your can obtain such information via debugfs.ocfs2 <your volume> -R?
> 'stats' | grep 'Cluster Size'
> 
> It's better for you provide a way to reproduce this issue so that we
> can?
> perform some test.
> 

The issue recurred in our cluster today, so at best my patch is just
decreasing the frequency of the crashes.

Our setup has 10 machines sharing two OCFS2 mountpoints over fibre
channel.

Both OCFS2 partitions have block size bits set to 12 and cluster size
bits set to 20.

The two partitions contain around 310 files total with 200 of those
being qcow2 files. The only inodes getting any read and write activity
are the qcow2 files.

The qcow2 files were created as sparse files (preallocation=metadata)
and some are reflinked copies.

It's not clear to me exactly why it's passing through
ocfs2_lock_allocators() without allocating meta_ac. These qcow files
wouldn't be written concurrently by different systems in the OCFS2
cluster.

Is it possible the 2 x multiplier in the ocfs2_lock_allocators call is
not large enough?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20171121/29b65745/attachment.bin 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-21 21:05       ` John Lightsey
@ 2017-11-24  5:46         ` alex chen
  2017-11-24  7:03           ` Changwei Ge
  2017-11-28 14:34           ` John Lightsey
  0 siblings, 2 replies; 12+ messages in thread
From: alex chen @ 2017-11-24  5:46 UTC (permalink / raw)
  To: ocfs2-devel

Hi John,

I think a better method to solve this problem.

On 2017/11/22 5:05, John Lightsey wrote:
> On Tue, 2017-11-21 at 05:58 +0000, Changwei Ge wrote:
> 
>> Can your tell me how did you format your volume?
>> What's your _cluster size_ and _block size_?
>> Your can obtain such information via debugfs.ocfs2 <your volume> -R 
>> 'stats' | grep 'Cluster Size'
>>
>> It's better for you provide a way to reproduce this issue so that we
>> can 
>> perform some test.
>>
> 
> The issue recurred in our cluster today, so at best my patch is just
> decreasing the frequency of the crashes.

We need to check the free number of the records in each loop to mark extent written,
because the last extent block may be changed through many times marking extent written
and the num_free_extents also be changed. In the worst case, the num_free_extents may
become less than at the beginning of the loop. So we should not estimate the free
number of the records at the beginning of the loop to mark extent written.

I'd appreciate it if you could test the following patch and feedback the result.

Signed-off-by: Alex Chen <alex.chen@huawei.com>
---
 fs/ocfs2/aops.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 64 insertions(+), 12 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index d151632..80725f2 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2272,6 +2272,35 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
 	return ret;
 }

+static int ocfs2_dio_should_restart(struct ocfs2_extent_tree *et,
+		struct ocfs2_alloc_context *meta_ac, int max_rec_needed)
+{
+	int status = 0, free_extents;
+
+	free_extents = ocfs2_num_free_extents(et);
+	if (free_extents < 0) {
+		status = free_extents;
+		mlog_errno(status);
+		return status;
+	}
+
+	/*
+	 * there are two cases which could cause us to EAGAIN in the
+	 * we-need-more-metadata case:
+	 * 1) we haven't reserved *any*
+	 * 2) we are so fragmented, we've needed to add metadata too
+	 *    many times.
+	 */
+	if (free_extents < max_rec_needed) {
+		if (!meta_ac || (ocfs2_alloc_context_bits_left(meta_ac)
+				< ocfs2_extend_meta_needed(et->et_root_el)))
+			status = 1;
+	}
+
+	return status;
+}
+
+#define OCFS2_MAX_REC_NEED_SPLIT 2
 static int ocfs2_dio_end_io_write(struct inode *inode,
 				  struct ocfs2_dio_write_ctxt *dwc,
 				  loff_t offset,
@@ -2281,14 +2310,13 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
 	struct ocfs2_extent_tree et;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_inode_info *oi = OCFS2_I(inode);
-	struct ocfs2_unwritten_extent *ue = NULL;
+	struct ocfs2_unwritten_extent *ue = NULL, *tmp_ue;
 	struct buffer_head *di_bh = NULL;
 	struct ocfs2_dinode *di;
-	struct ocfs2_alloc_context *data_ac = NULL;
 	struct ocfs2_alloc_context *meta_ac = NULL;
 	handle_t *handle = NULL;
 	loff_t end = offset + bytes;
-	int ret = 0, credits = 0, locked = 0;
+	int ret = 0, credits = 0, locked = 0, restart_func = 0;

 	ocfs2_init_dealloc_ctxt(&dealloc);

@@ -2330,8 +2358,9 @@ static int ocfs2_dio_end_io_write(struct inode *inode,

 	ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode), di_bh);

-	ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2,
-				    &data_ac, &meta_ac);
+restart_all:
+	ret = ocfs2_lock_allocators(inode, &et, 0, OCFS2_MAX_REC_NEED_SPLIT,
+				    NULL, &meta_ac);
 	if (ret) {
 		mlog_errno(ret);
 		goto unlock;
@@ -2343,7 +2372,7 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 		mlog_errno(ret);
-		goto unlock;
+		goto free_ac;
 	}
 	ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
 				      OCFS2_JOURNAL_ACCESS_WRITE);
@@ -2352,7 +2381,17 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
 		goto commit;
 	}

-	list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
+	list_for_each_entry_safe(ue, tmp_ue, &dwc->dw_zero_list, ue_node) {
+		ret = ocfs2_dio_should_restart(&et, meta_ac,
+				OCFS2_MAX_REC_NEED_SPLIT * 2);
+		if (ret < 0) {
+			mlog_errno(ret);
+			break;
+		} else if (ret == 1) {
+			restart_func = 1;
+			break;
+		}
+
 		ret = ocfs2_mark_extent_written(inode, &et, handle,
 						ue->ue_cpos, 1,
 						ue->ue_phys,
@@ -2361,24 +2400,37 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
 			mlog_errno(ret);
 			break;
 		}
+
+		dwc->dw_zero_count--;
+		list_del_init(&ue->ue_node);
+		spin_lock(&oi->ip_lock);
+		list_del_init(&ue->ue_ip_node);
+		spin_unlock(&oi->ip_lock);
+		kfree(ue);
 	}

-	if (end > i_size_read(inode)) {
+	if (!restart_func && end > i_size_read(inode)) {
 		ret = ocfs2_set_inode_size(handle, inode, di_bh, end);
 		if (ret < 0)
 			mlog_errno(ret);
 	}
+
 commit:
 	ocfs2_commit_trans(osb, handle);
+free_ac:
+	if (meta_ac) {
+		ocfs2_free_alloc_context(meta_ac);
+		meta_ac = NULL;
+	}
+	if (restart_func) {
+		restart_func = 0;
+		goto restart_all;
+	}
 unlock:
 	up_write(&oi->ip_alloc_sem);
 	ocfs2_inode_unlock(inode, 1);
 	brelse(di_bh);
 out:
-	if (data_ac)
-		ocfs2_free_alloc_context(data_ac);
-	if (meta_ac)
-		ocfs2_free_alloc_context(meta_ac);
 	ocfs2_run_deallocs(osb, &dealloc);
 	if (locked)
 		inode_unlock(inode);
-- 

> 
> Our setup has 10 machines sharing two OCFS2 mountpoints over fibre
> channel.
> 
> Both OCFS2 partitions have block size bits set to 12 and cluster size
> bits set to 20.
> 
> The two partitions contain around 310 files total with 200 of those
> being qcow2 files. The only inodes getting any read and write activity
> are the qcow2 files.
> 
> The qcow2 files were created as sparse files (preallocation=metadata)
> and some are reflinked copies.
> 
> It's not clear to me exactly why it's passing through
> ocfs2_lock_allocators() without allocating meta_ac. These qcow files
> wouldn't be written concurrently by different systems in the OCFS2
> cluster.
> 
> Is it possible the 2 x multiplier in the ocfs2_lock_allocators call is
> not large enough?
> 
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-24  5:46         ` alex chen
@ 2017-11-24  7:03           ` Changwei Ge
  2017-11-24 10:06             ` alex chen
  2017-11-28 14:34           ` John Lightsey
  1 sibling, 1 reply; 12+ messages in thread
From: Changwei Ge @ 2017-11-24  7:03 UTC (permalink / raw)
  To: ocfs2-devel

Hi Alex,
I just reviewed your patch and a few questions were come up with.

On 2017/11/24 13:49, alex chen wrote:
> Hi John,
> 
> I think a better method to solve this problem.
> 
> On 2017/11/22 5:05, John Lightsey wrote:
>> On Tue, 2017-11-21 at 05:58 +0000, Changwei Ge wrote:
>>
>>> Can your tell me how did you format your volume?
>>> What's your _cluster size_ and _block size_?
>>> Your can obtain such information via debugfs.ocfs2 <your volume> -R
>>> 'stats' | grep 'Cluster Size'
>>>
>>> It's better for you provide a way to reproduce this issue so that we
>>> can
>>> perform some test.
>>>
>>
>> The issue recurred in our cluster today, so at best my patch is just
>> decreasing the frequency of the crashes.
> 
> We need to check the free number of the records in each loop to mark extent written,
> because the last extent block may be changed through many times marking extent written
> and the num_free_extents also be changed. In the worst case, the num_free_extents may
> become less than at the beginning of the loop. So we should not estimate the free
> number of the records at the beginning of the loop to mark extent written.
> 

 From John's description, ::dw_zero_count was not calculated properly 
and your patch seems not to address this issue. Do you agree that it's 
possible that we get a wrong ::dw_zero_count?

> I'd appreciate it if you could test the following patch and feedback the result.
> 
> Signed-off-by: Alex Chen <alex.chen@huawei.com>
> ---
>   fs/ocfs2/aops.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 64 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index d151632..80725f2 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2272,6 +2272,35 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
>   	return ret;
>   }
> 
> +static int ocfs2_dio_should_restart(struct ocfs2_extent_tree *et,
> +		struct ocfs2_alloc_context *meta_ac, int max_rec_needed)
> +{
> +	int status = 0, free_extents;
> +
> +	free_extents = ocfs2_num_free_extents(et);
> +	if (free_extents < 0) {
> +		status = free_extents;
> +		mlog_errno(status);
> +		return status;
> +	}
> +
> +	/*
> +	 * there are two cases which could cause us to EAGAIN in the
> +	 * we-need-more-metadata case:
> +	 * 1) we haven't reserved *any*
> +	 * 2) we are so fragmented, we've needed to add metadata too
> +	 *    many times.

What does point 2 mean?

> +	 */
> +	if (free_extents < max_rec_needed) {
> +		if (!meta_ac || (ocfs2_alloc_context_bits_left(meta_ac)
> +				< ocfs2_extend_meta_needed(et->et_root_el)))
> +			status = 1;
> +	}
> +
> +	return status;
> +}
> +
> +#define OCFS2_MAX_REC_NEED_SPLIT 2

Um, I don't figure why the Maximum value is 2, I suppose this is less 
than previously estimated one.

>   static int ocfs2_dio_end_io_write(struct inode *inode,
>   				  struct ocfs2_dio_write_ctxt *dwc,
>   				  loff_t offset,
> @@ -2281,14 +2310,13 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>   	struct ocfs2_extent_tree et;
>   	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>   	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> -	struct ocfs2_unwritten_extent *ue = NULL;
> +	struct ocfs2_unwritten_extent *ue = NULL, *tmp_ue;
>   	struct buffer_head *di_bh = NULL;
>   	struct ocfs2_dinode *di;
> -	struct ocfs2_alloc_context *data_ac = NULL;
>   	struct ocfs2_alloc_context *meta_ac = NULL;
>   	handle_t *handle = NULL;
>   	loff_t end = offset + bytes;
> -	int ret = 0, credits = 0, locked = 0;
> +	int ret = 0, credits = 0, locked = 0, restart_func = 0;

I think _ restart_func_ is not named properly. Can we change it into 
realloc or something like that?

> 
>   	ocfs2_init_dealloc_ctxt(&dealloc);
> 
> @@ -2330,8 +2358,9 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> 
>   	ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode), di_bh);
> 
> -	ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2,
> -				    &data_ac, &meta_ac);
> +restart_all:
> +	ret = ocfs2_lock_allocators(inode, &et, 0, OCFS2_MAX_REC_NEED_SPLIT,
> +				    NULL, &meta_ac);
>   	if (ret) {
>   		mlog_errno(ret);
>   		goto unlock;
> @@ -2343,7 +2372,7 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>   	if (IS_ERR(handle)) {
>   		ret = PTR_ERR(handle);
>   		mlog_errno(ret);
> -		goto unlock;
> +		goto free_ac;
>   	}
>   	ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
>   				      OCFS2_JOURNAL_ACCESS_WRITE);
> @@ -2352,7 +2381,17 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>   		goto commit;
>   	}
> 
> -	list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
> +	list_for_each_entry_safe(ue, tmp_ue, &dwc->dw_zero_list, ue_node) {
> +		ret = ocfs2_dio_should_restart(&et, meta_ac,
> +				OCFS2_MAX_REC_NEED_SPLIT * 2);

Why extent block's free record number would change during marking 
written, which compels code path has to re-alloc.

Thanks,
Changwei

> +		if (ret < 0) {
> +			mlog_errno(ret);
> +			break;
> +		} else if (ret == 1) {
> +			restart_func = 1;
> +			break;
> +		}
> +
>   		ret = ocfs2_mark_extent_written(inode, &et, handle,
>   						ue->ue_cpos, 1,
>   						ue->ue_phys,
> @@ -2361,24 +2400,37 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>   			mlog_errno(ret);
>   			break;
>   		}
> +
> +		dwc->dw_zero_count--;
> +		list_del_init(&ue->ue_node);
> +		spin_lock(&oi->ip_lock);
> +		list_del_init(&ue->ue_ip_node);
> +		spin_unlock(&oi->ip_lock);
> +		kfree(ue);
>   	}
> 
> -	if (end > i_size_read(inode)) {
> +	if (!restart_func && end > i_size_read(inode)) {
>   		ret = ocfs2_set_inode_size(handle, inode, di_bh, end);
>   		if (ret < 0)
>   			mlog_errno(ret);
>   	}
> +
>   commit:
>   	ocfs2_commit_trans(osb, handle);
> +free_ac:
> +	if (meta_ac) {
> +		ocfs2_free_alloc_context(meta_ac);
> +		meta_ac = NULL;
> +	}
> +	if (restart_func) {
> +		restart_func = 0;
> +		goto restart_all;
> +	}
>   unlock:
>   	up_write(&oi->ip_alloc_sem);
>   	ocfs2_inode_unlock(inode, 1);
>   	brelse(di_bh);
>   out:
> -	if (data_ac)
> -		ocfs2_free_alloc_context(data_ac);
> -	if (meta_ac)
> -		ocfs2_free_alloc_context(meta_ac);
>   	ocfs2_run_deallocs(osb, &dealloc);
>   	if (locked)
>   		inode_unlock(inode);
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-24  7:03           ` Changwei Ge
@ 2017-11-24 10:06             ` alex chen
  0 siblings, 0 replies; 12+ messages in thread
From: alex chen @ 2017-11-24 10:06 UTC (permalink / raw)
  To: ocfs2-devel

Hi Changwei,

Thanks for your reviews.

On 2017/11/24 15:03, Changwei Ge wrote:
> Hi Alex,
> I just reviewed your patch and a few questions were come up with.
> 
> On 2017/11/24 13:49, alex chen wrote:
>> Hi John,
>>
>> I think a better method to solve this problem.
>>
>> On 2017/11/22 5:05, John Lightsey wrote:
>>> On Tue, 2017-11-21 at 05:58 +0000, Changwei Ge wrote:
>>>
>>>> Can your tell me how did you format your volume?
>>>> What's your _cluster size_ and _block size_?
>>>> Your can obtain such information via debugfs.ocfs2 <your volume> -R
>>>> 'stats' | grep 'Cluster Size'
>>>>
>>>> It's better for you provide a way to reproduce this issue so that we
>>>> can
>>>> perform some test.
>>>>
>>>
>>> The issue recurred in our cluster today, so at best my patch is just
>>> decreasing the frequency of the crashes.
>>
>> We need to check the free number of the records in each loop to mark extent written,
>> because the last extent block may be changed through many times marking extent written
>> and the num_free_extents also be changed. In the worst case, the num_free_extents may
>> become less than at the beginning of the loop. So we should not estimate the free
>> number of the records at the beginning of the loop to mark extent written.
>>
> 
>  From John's description, ::dw_zero_count was not calculated properly 
> and your patch seems not to address this issue. Do you agree that it's 
> possible that we get a wrong ::dw_zero_count?

Yes, the old dw_zero_count is not calculated properly, but we don't use it because
we just need know the max needed number of splinting the one extent.
> 
>> I'd appreciate it if you could test the following patch and feedback the result.
>>
>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>> ---
>>   fs/ocfs2/aops.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 64 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index d151632..80725f2 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -2272,6 +2272,35 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
>>   	return ret;
>>   }
>>
>> +static int ocfs2_dio_should_restart(struct ocfs2_extent_tree *et,
>> +		struct ocfs2_alloc_context *meta_ac, int max_rec_needed)
>> +{
>> +	int status = 0, free_extents;
>> +
>> +	free_extents = ocfs2_num_free_extents(et);
>> +	if (free_extents < 0) {
>> +		status = free_extents;
>> +		mlog_errno(status);
>> +		return status;
>> +	}
>> +
>> +	/*
>> +	 * there are two cases which could cause us to EAGAIN in the
>> +	 * we-need-more-metadata case:
>> +	 * 1) we haven't reserved *any*
>> +	 * 2) we are so fragmented, we've needed to add metadata too
>> +	 *    many times.
> 
> What does point 2 mean?

The point 2 mean we have reserved metadata, but the left bit in allocate context
is not enough for allocating a new extent block.

> 
>> +	 */
>> +	if (free_extents < max_rec_needed) {
>> +		if (!meta_ac || (ocfs2_alloc_context_bits_left(meta_ac)
>> +				< ocfs2_extend_meta_needed(et->et_root_el)))
>> +			status = 1;
>> +	}
>> +
>> +	return status;
>> +}
>> +
>> +#define OCFS2_MAX_REC_NEED_SPLIT 2
> 
> Um, I don't figure why the Maximum value is 2, I suppose this is less 
> than previously estimated one.

the max needed number of splinting the one extent in the worst case - that we're writing in
the middle of the extent.

> 
>>   static int ocfs2_dio_end_io_write(struct inode *inode,
>>   				  struct ocfs2_dio_write_ctxt *dwc,
>>   				  loff_t offset,
>> @@ -2281,14 +2310,13 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>>   	struct ocfs2_extent_tree et;
>>   	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>   	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>> -	struct ocfs2_unwritten_extent *ue = NULL;
>> +	struct ocfs2_unwritten_extent *ue = NULL, *tmp_ue;
>>   	struct buffer_head *di_bh = NULL;
>>   	struct ocfs2_dinode *di;
>> -	struct ocfs2_alloc_context *data_ac = NULL;
>>   	struct ocfs2_alloc_context *meta_ac = NULL;
>>   	handle_t *handle = NULL;
>>   	loff_t end = offset + bytes;
>> -	int ret = 0, credits = 0, locked = 0;
>> +	int ret = 0, credits = 0, locked = 0, restart_func = 0;
> 
> I think _ restart_func_ is not named properly. Can we change it into 
> realloc or something like that?
> 
>>
>>   	ocfs2_init_dealloc_ctxt(&dealloc);
>>
>> @@ -2330,8 +2358,9 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>>
>>   	ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode), di_bh);
>>
>> -	ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2,
>> -				    &data_ac, &meta_ac);
>> +restart_all:
>> +	ret = ocfs2_lock_allocators(inode, &et, 0, OCFS2_MAX_REC_NEED_SPLIT,
>> +				    NULL, &meta_ac);
>>   	if (ret) {
>>   		mlog_errno(ret);
>>   		goto unlock;
>> @@ -2343,7 +2372,7 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>>   	if (IS_ERR(handle)) {
>>   		ret = PTR_ERR(handle);
>>   		mlog_errno(ret);
>> -		goto unlock;
>> +		goto free_ac;
>>   	}
>>   	ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
>>   				      OCFS2_JOURNAL_ACCESS_WRITE);
>> @@ -2352,7 +2381,17 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>>   		goto commit;
>>   	}
>>
>> -	list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
>> +	list_for_each_entry_safe(ue, tmp_ue, &dwc->dw_zero_list, ue_node) {
>> +		ret = ocfs2_dio_should_restart(&et, meta_ac,
>> +				OCFS2_MAX_REC_NEED_SPLIT * 2);
> 
> Why extent block's free record number would change during marking 
> written, which compels code path has to re-alloc.
> 

One way to set the last extent block is as following:
ocfs2_mark_extent_written
 ocfs2_change_extent_flag
  ocfs2_split_extent
   ocfs2_try_to_merge_extent
    ocfs2_rotate_tree_left
     __ocfs2_rotate_tree_left
      ocfs2_et_set_last_eb_blk


Thanks,
Alex

> Thanks,
> Changwei
> 
>> +		if (ret < 0) {
>> +			mlog_errno(ret);
>> +			break;
>> +		} else if (ret == 1) {
>> +			restart_func = 1;
>> +			break;
>> +		}
>> +
>>   		ret = ocfs2_mark_extent_written(inode, &et, handle,
>>   						ue->ue_cpos, 1,
>>   						ue->ue_phys,
>> @@ -2361,24 +2400,37 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>>   			mlog_errno(ret);
>>   			break;
>>   		}
>> +
>> +		dwc->dw_zero_count--;
>> +		list_del_init(&ue->ue_node);
>> +		spin_lock(&oi->ip_lock);
>> +		list_del_init(&ue->ue_ip_node);
>> +		spin_unlock(&oi->ip_lock);
>> +		kfree(ue);
>>   	}
>>
>> -	if (end > i_size_read(inode)) {
>> +	if (!restart_func && end > i_size_read(inode)) {
>>   		ret = ocfs2_set_inode_size(handle, inode, di_bh, end);
>>   		if (ret < 0)
>>   			mlog_errno(ret);
>>   	}
>> +
>>   commit:
>>   	ocfs2_commit_trans(osb, handle);
>> +free_ac:
>> +	if (meta_ac) {
>> +		ocfs2_free_alloc_context(meta_ac);
>> +		meta_ac = NULL;
>> +	}
>> +	if (restart_func) {
>> +		restart_func = 0;
>> +		goto restart_all;
>> +	}
>>   unlock:
>>   	up_write(&oi->ip_alloc_sem);
>>   	ocfs2_inode_unlock(inode, 1);
>>   	brelse(di_bh);
>>   out:
>> -	if (data_ac)
>> -		ocfs2_free_alloc_context(data_ac);
>> -	if (meta_ac)
>> -		ocfs2_free_alloc_context(meta_ac);
>>   	ocfs2_run_deallocs(osb, &dealloc);
>>   	if (locked)
>>   		inode_unlock(inode);
>>
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-24  5:46         ` alex chen
  2017-11-24  7:03           ` Changwei Ge
@ 2017-11-28 14:34           ` John Lightsey
  2017-11-29  4:37             ` alex chen
  1 sibling, 1 reply; 12+ messages in thread
From: John Lightsey @ 2017-11-28 14:34 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, 2017-11-24 at 13:46 +0800, alex chen wrote:
> We need to check the free number of the records in each loop to mark
> extent written,
> because the last extent block may be changed through many times
> marking extent written
> and the num_free_extents also be changed. In the worst case, the
> num_free_extents may
> become less than at the beginning of the loop. So we should not
> estimate the free
> number of the records at the beginning of the loop to mark extent
> written.
> 
> I'd appreciate it if you could test the following patch and feedback
> the result.

I managed to reproduce the bug in a test environment using the
following method. Some of the specific details here are definitely
irrelevant:

- Setup a 20GB iscsi lun going to a spinning disk drive.

- Configure the OCFS cluster with three KVM VMs.

- Connect the iscsi lun to all three VMs.

- Format an OCFS2 partition on the iscsi lun with block size 1k and
cluster size 4k.

- Mount the OCFS2 partition on one VM.

- Write out a 1GB file with a random pattern of 4k chunks. 4/5 of the
4k chunks are filled with nulls. 1/5 are filled with data.

- Run fallocate -d <filename> to make sure the file is sparse.

- Copy the test file so that the next step can be run repeatedly with
copies.

- Use directio to rewrite the copy of the file in 64k chunks of null
bytes.


In my test setup, the assertion failure happens on the next loop
iteration after the number of free extents drops from 59 to 0. The call
to ocfs2_split_extent() in ocfs2_change_extent_flag() is what actually
reduces the number of free extents to 0. The count drops all at once in
this case, not by 1 or 2 per loop iteration.

With your patch applied, it does handle this sudden reduction in the
number of free extents, and it's able to entirely overwrite the 1GB
file without any problems.

Is it safe to bring up a few nodes in our production OCFS2 cluster with
the patched 4.9 kernel while the remainder nodes are running a 3.16
kernel?

The downtime required to switch our cluster forward to a 4.9 kernel and
then back to a 3.16 kernel is hard to justify, but I can definitely
test one or two nodes in our production environment if it will be a
realistic test.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20171128/a8f6a070/attachment.bin 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
  2017-11-28 14:34           ` John Lightsey
@ 2017-11-29  4:37             ` alex chen
  0 siblings, 0 replies; 12+ messages in thread
From: alex chen @ 2017-11-29  4:37 UTC (permalink / raw)
  To: ocfs2-devel

Hi John,

On 2017/11/28 22:34, John Lightsey wrote:
> On Fri, 2017-11-24 at 13:46 +0800, alex chen wrote:
>> We need to check the free number of the records in each loop to mark
>> extent written,
>> because the last extent block may be changed through many times
>> marking extent written
>> and the num_free_extents also be changed. In the worst case, the
>> num_free_extents may
>> become less than at the beginning of the loop. So we should not
>> estimate the free
>> number of the records at the beginning of the loop to mark extent
>> written.
>>
>> I'd appreciate it if you could test the following patch and feedback
>> the result.
> 
> I managed to reproduce the bug in a test environment using the
> following method. Some of the specific details here are definitely
> irrelevant:
> 
> - Setup a 20GB iscsi lun going to a spinning disk drive.
> 
> - Configure the OCFS cluster with three KVM VMs.
> 
> - Connect the iscsi lun to all three VMs.
> 
> - Format an OCFS2 partition on the iscsi lun with block size 1k and
> cluster size 4k.
> 
> - Mount the OCFS2 partition on one VM.
> 
> - Write out a 1GB file with a random pattern of 4k chunks. 4/5 of the
> 4k chunks are filled with nulls. 1/5 are filled with data.
> 
> - Run fallocate -d <filename> to make sure the file is sparse.
> 
> - Copy the test file so that the next step can be run repeatedly with
> copies.
> 
> - Use directio to rewrite the copy of the file in 64k chunks of null
> bytes.
> 
> 
> In my test setup, the assertion failure happens on the next loop
> iteration after the number of free extents drops from 59 to 0. The call
> to ocfs2_split_extent() in ocfs2_change_extent_flag() is what actually
> reduces the number of free extents to 0. The count drops all at once in
> this case, not by 1 or 2 per loop iteration.
> 
> With your patch applied, it does handle this sudden reduction in the
> number of free extents, and it's able to entirely overwrite the 1GB
> file without any problems.

Thanks for your test.

> 
> Is it safe to bring up a few nodes in our production OCFS2 cluster with
> the patched 4.9 kernel while the remainder nodes are running a 3.16
> kernel?
>

IMO, it is best to ensure the kernel version of nodes in the cluster is consistent.

> The downtime required to switch our cluster forward to a 4.9 kernel and
> then back to a 3.16 kernel is hard to justify, but I can definitely
> test one or two nodes in our production environment if it will be a
> realistic test.
> 
I think this patch is only tested in one node because we lock the inode_lock
when we make the extent written.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-11-29  4:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-20 18:54 [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514! John Lightsey
2017-11-21  0:58 ` Changwei Ge
2017-11-21  2:45   ` John Lightsey
2017-11-21  5:58     ` Changwei Ge
2017-11-21 21:05       ` John Lightsey
2017-11-24  5:46         ` alex chen
2017-11-24  7:03           ` Changwei Ge
2017-11-24 10:06             ` alex chen
2017-11-28 14:34           ` John Lightsey
2017-11-29  4:37             ` alex chen
2017-11-21  3:04   ` piaojun
2017-11-21  4:24     ` John Lightsey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.