All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
@ 2015-12-17 15:08 Gang He
  2015-12-18  9:09 ` Eric Ren
  2015-12-25  9:05 ` Eric Ren
  0 siblings, 2 replies; 8+ messages in thread
From: Gang He @ 2015-12-17 15:08 UTC (permalink / raw)
  To: ocfs2-devel

Hello Mark and all,
In the past days, I and Eric were looking at a customer issue, the customer is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently, some day ago I sent a mail to the list for some discussions, you can read some details via the link https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
But, this problem does not happen under SLES10 (sp1 - sp4), the customer upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is why the customer complains, he hope we can give a investigation, to see how to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
According to our code reviewing and some testings, we found that the root cause to let buffer read get starvation.
The suspicious code in aops.c 
 274 static int ocfs2_readpage(struct file *file, struct page *page)
 275 {
 276         struct inode *inode = page->mapping->host;
 277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
 278         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
 279         int ret, unlock = 1;
 280         long delta;
 281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
 282
 283         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
 284                              (page ? page->index : 0));
 285
 286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<= here, using nonblock way to get lock will bring many times retry, spend too much time
 287         if (ret != 0) {
 288                 if (ret == AOP_TRUNCATED_PAGE)
 289                         unlock = 0;
 290                 mlog_errno(ret);
 291                 goto out;
 292         }
 293
 294         if (down_read_trylock(&oi->ip_alloc_sem) == 0) {  <<= here, the same problem with above
 295                 /*
 296                  * Unlock the page and cycle ip_alloc_sem so that we don't
 297                  * busyloop waiting for ip_alloc_sem to unlock
 298                  */
 299                 ret = AOP_TRUNCATED_PAGE;
 300                 unlock_page(page);
 301                 unlock = 0;
 302                 down_read(&oi->ip_alloc_sem);
 303                 up_read(&oi->ip_alloc_sem);
 304                 goto out_inode_unlock;
 305         }


As you can see, using nonblock way to get lock will bring many time retry, spend too much time.
We can't modify the code to using block way to get the lock, as this will bring a dead lock.
Actually, we did some testing when trying to use block way to get the lock here, the deadlock problems were encountered.
But, in SLES10 source code, there is not any using nonblock way to get lock in buffer reading/writing, this is why buffer reading/writing are very fair to get IO when reading/writing the same file from multiple nodes. 
Why the dead locks happen on SLES11? you can see the source code, there are some code change, especially inode alloc_sem lock.
On SLES11, to get  inode alloc_sem lock is moved into ocfs2_readpage and ocfs2_write_begin, why we need to do that? this will let us to bring a dead lock factor, to avoid the dead locks, we will have to use nonblocking way to get some locks in ocfs2_readpage, the result will let buffer reading be unfair to get IO. and that, to avoid CPU busy loop, add some code to get the lock with block way in case can't get a lock in nonblock way, waste too much time also.
Finally, I want to discuss with your guys, how to fix this issue? could we move the inode alloc_sem lock back to ocfs2_file_aio_read/ocfs2_file_aio_write? 
we can get the inode alloc_sem lock before calling into ocfs2_readpage/ocfs2_write_begin, just like SLES10. 
Since I have not enough background behind these code changes, hope you can give some comments.

Thanks a lot.
Gang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
  2015-12-17 15:08 [Ocfs2-devel] The root cause analysis about buffer read getting starvation Gang He
@ 2015-12-18  9:09 ` Eric Ren
  2015-12-18 23:28   ` Mark Fasheh
  2015-12-25  9:05 ` Eric Ren
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Ren @ 2015-12-18  9:09 UTC (permalink / raw)
  To: ocfs2-devel

Hi all,

On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: 
> Hello Mark and all,
> In the past days, I and Eric were looking at a customer issue, the customer is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently, some day ago I sent a mail to the list for some discussions, you can read some details via the link https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
> But, this problem does not happen under SLES10 (sp1 - sp4), the customer upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is why the customer complains, he hope we can give a investigation, to see how to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
> According to our code reviewing and some testings, we found that the root cause to let buffer read get starvation.
> The suspicious code in aops.c 
>  274 static int ocfs2_readpage(struct file *file, struct page *page)
>  275 {
>  276         struct inode *inode = page->mapping->host;
>  277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>  278         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>  279         int ret, unlock = 1;
>  280         long delta;
>  281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
>  282
>  283         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>  284                              (page ? page->index : 0));
>  285
>  286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<= here, using nonblock way to get lock will bring many times retry, spend too much time
>  287         if (ret != 0) {
>  288                 if (ret == AOP_TRUNCATED_PAGE)
>  289                         unlock = 0;
>  290                 mlog_errno(ret);
>  291                 goto out;
>  292         }
>  293
>  294         if (down_read_trylock(&oi->ip_alloc_sem) == 0) {  <<= here, the same problem with above
>  295                 /*
>  296                  * Unlock the page and cycle ip_alloc_sem so that we don't
>  297                  * busyloop waiting for ip_alloc_sem to unlock
>  298                  */
>  299                 ret = AOP_TRUNCATED_PAGE;
>  300                 unlock_page(page);
>  301                 unlock = 0;
>  302                 down_read(&oi->ip_alloc_sem);
>  303                 up_read(&oi->ip_alloc_sem);
>  304                 goto out_inode_unlock;
>  305         }
> 
> 
> As you can see, using nonblock way to get lock will bring many time retry, spend too much time.
> We can't modify the code to using block way to get the lock, as this will bring a dead lock.
> Actually, we did some testing when trying to use block way to get the lock here, the deadlock problems were encountered.
> But, in SLES10 source code, there is not any using nonblock way to get lock in buffer reading/writing, this is why buffer reading/writing are very fair to get IO when reading/writing the same file from multiple nodes. 
SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. down_read(), wich has the
potential deaklock between page lock / ip_alloc_sem when one node get the cluster lock and
does writing and reading on same file on it. This deadlock was fixed by this commit:
---
commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8
Author: Mark Fasheh <mark.fasheh@oracle.com>
Date:   Mon May 14 11:38:51 2007 -0700

    ocfs2: trylock in ocfs2_readpage()

    Similarly to the page lock / cluster lock inversion in ocfs2_readpage, we
    can deadlock on ip_alloc_sem. We can down_read_trylock() instead and just
    return AOP_TRUNCATED_PAGE if the operation fails.

    Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 8e7cafb..3030670 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct page *page)
                goto out;
        }

-       down_read(&OCFS2_I(inode)->ip_alloc_sem);
+       if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) {
+               ret = AOP_TRUNCATED_PAGE;
+               goto out_meta_unlock;
+       }

        /*
         * i_size might have just been updated as we grabed the meta lock.  We
@@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct page *page)
        ocfs2_data_unlock(inode, 0);
 out_alloc:
        up_read(&OCFS2_I(inode)->ip_alloc_sem);
+out_meta_unlock:
        ocfs2_meta_unlock(inode, 0);
 out:
        if (unlock)
---

But somehow with this patch, performance in the scenario become very bad. I don't how this could happen? because the reading node just has only one
thread reading the shared file, then down_read_trylock() should always get ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?

Thanks,
Eric
> Why the dead locks happen on SLES11? you can see the source code, there are some code change, especially inode alloc_sem lock.
> On SLES11, to get  inode alloc_sem lock is moved into ocfs2_readpage and ocfs2_write_begin, why we need to do that? this will let us to bring a dead lock factor, to avoid the dead locks, we will have to use nonblocking way to get some locks in ocfs2_readpage, the result will let buffer reading be unfair to get IO. and that, to avoid CPU busy loop, add some code to get the lock with block way in case can't get a lock in nonblock way, waste too much time also.
> Finally, I want to discuss with your guys, how to fix this issue? could we move the inode alloc_sem lock back to ocfs2_file_aio_read/ocfs2_file_aio_write? 
> we can get the inode alloc_sem lock before calling into ocfs2_readpage/ocfs2_write_begin, just like SLES10. 
> Since I have not enough background behind these code changes, hope you can give some comments.
> 
> Thanks a lot.
> Gang
> 
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
  2015-12-18  9:09 ` Eric Ren
@ 2015-12-18 23:28   ` Mark Fasheh
  2015-12-21  2:53     ` Gang He
  2015-12-21 11:23     ` Eric Ren
  0 siblings, 2 replies; 8+ messages in thread
From: Mark Fasheh @ 2015-12-18 23:28 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Dec 18, 2015 at 05:09:39PM +0800, Eric Ren wrote:
> Hi all,
> 
> On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: 
> > Hello Mark and all,
> > In the past days, I and Eric were looking at a customer issue, the customer is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently, some day ago I sent a mail to the list for some discussions, you can read some details via the link https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
> > But, this problem does not happen under SLES10 (sp1 - sp4), the customer upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is why the customer complains, he hope we can give a investigation, to see how to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
> > According to our code reviewing and some testings, we found that the root cause to let buffer read get starvation.
> > The suspicious code in aops.c 
> >  274 static int ocfs2_readpage(struct file *file, struct page *page)
> >  275 {
> >  276         struct inode *inode = page->mapping->host;
> >  277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  278         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
> >  279         int ret, unlock = 1;
> >  280         long delta;
> >  281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
> >  282
> >  283         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
> >  284                              (page ? page->index : 0));
> >  285
> >  286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<= here, using nonblock way to get lock will bring many times retry, spend too much time
> >  287         if (ret != 0) {
> >  288                 if (ret == AOP_TRUNCATED_PAGE)
> >  289                         unlock = 0;
> >  290                 mlog_errno(ret);
> >  291                 goto out;
> >  292         }
> >  293
> >  294         if (down_read_trylock(&oi->ip_alloc_sem) == 0) {  <<= here, the same problem with above
> >  295                 /*
> >  296                  * Unlock the page and cycle ip_alloc_sem so that we don't
> >  297                  * busyloop waiting for ip_alloc_sem to unlock
> >  298                  */
> >  299                 ret = AOP_TRUNCATED_PAGE;
> >  300                 unlock_page(page);
> >  301                 unlock = 0;
> >  302                 down_read(&oi->ip_alloc_sem);
> >  303                 up_read(&oi->ip_alloc_sem);
> >  304                 goto out_inode_unlock;
> >  305         }
> > 
> > 
> > As you can see, using nonblock way to get lock will bring many time retry, spend too much time.
> > We can't modify the code to using block way to get the lock, as this will bring a dead lock.
> > Actually, we did some testing when trying to use block way to get the lock here, the deadlock problems were encountered.
> > But, in SLES10 source code, there is not any using nonblock way to get lock in buffer reading/writing, this is why buffer reading/writing are very fair to get IO when reading/writing the same file from multiple nodes. 
> SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. down_read(), wich has the
> potential deaklock between page lock / ip_alloc_sem when one node get the cluster lock and
> does writing and reading on same file on it. This deadlock was fixed by this commit:

You are correct here - the change was introduced to solve a deadlock between
page lock and ip_alloc_sem(). Basically, ->readpage is going to be called
with the page lock held and we need to be aware of that.


> ---
> commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8
> Author: Mark Fasheh <mark.fasheh@oracle.com>
> Date:   Mon May 14 11:38:51 2007 -0700
> 
>     ocfs2: trylock in ocfs2_readpage()
> 
>     Similarly to the page lock / cluster lock inversion in ocfs2_readpage, we
>     can deadlock on ip_alloc_sem. We can down_read_trylock() instead and just
>     return AOP_TRUNCATED_PAGE if the operation fails.
> 
>     Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 8e7cafb..3030670 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct page *page)
>                 goto out;
>         }
> 
> -       down_read(&OCFS2_I(inode)->ip_alloc_sem);
> +       if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) {
> +               ret = AOP_TRUNCATED_PAGE;
> +               goto out_meta_unlock;
> +       }
> 
>         /*
>          * i_size might have just been updated as we grabed the meta lock.  We
> @@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct page *page)
>         ocfs2_data_unlock(inode, 0);
>  out_alloc:
>         up_read(&OCFS2_I(inode)->ip_alloc_sem);
> +out_meta_unlock:
>         ocfs2_meta_unlock(inode, 0);
>  out:
>         if (unlock)
> ---
> 
> But somehow with this patch, performance in the scenario become very bad. I don't how this could happen? because the reading node just has only one
> thread reading the shared file, then down_read_trylock() should always get ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?

Hmm, there's only one thread and it can't get the lock? Any chance you might
put some debug prints around where we acquire ip_alloc_sem? It would be
interesting to see where it get taken to prevent this from happening.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
  2015-12-18 23:28   ` Mark Fasheh
@ 2015-12-21  2:53     ` Gang He
  2015-12-21  7:36       ` Junxiao Bi
  2015-12-21 11:23     ` Eric Ren
  1 sibling, 1 reply; 8+ messages in thread
From: Gang He @ 2015-12-21  2:53 UTC (permalink / raw)
  To: ocfs2-devel

Hello Mark and all,


>>> 
> On Fri, Dec 18, 2015 at 05:09:39PM +0800, Eric Ren wrote:
>> Hi all,
>> 
>> On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: 
>> > Hello Mark and all,
>> > In the past days, I and Eric were looking at a customer issue, the customer 
> is complaining that buffer reading sometimes lasts too much time ( 1 - 10 
> seconds) in case reading/writing the same file from different nodes 
> concurrently, some day ago I sent a mail to the list for some discussions, 
> you can read some details via the link 
> https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
>> > But, this problem does not happen under SLES10 (sp1 - sp4), the customer 
> upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is 
> why the customer complains, he hope we can give a investigation, to see how 
> to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
>> > According to our code reviewing and some testings, we found that the root 
> cause to let buffer read get starvation.
>> > The suspicious code in aops.c 
>> >  274 static int ocfs2_readpage(struct file *file, struct page *page)
>> >  275 {
>> >  276         struct inode *inode = page->mapping->host;
>> >  277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>> >  278         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>> >  279         int ret, unlock = 1;
>> >  280         long delta;
>> >  281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
>> >  282
>> >  283         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>> >  284                              (page ? page->index : 0));
>> >  285
>> >  286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<= 
> here, using nonblock way to get lock will bring many times retry, spend too 
> much time
>> >  287         if (ret != 0) {
>> >  288                 if (ret == AOP_TRUNCATED_PAGE)
>> >  289                         unlock = 0;
>> >  290                 mlog_errno(ret);
>> >  291                 goto out;
>> >  292         }
>> >  293
>> >  294         if (down_read_trylock(&oi->ip_alloc_sem) == 0) {  <<= here, the 
> same problem with above
>> >  295                 /*
>> >  296                  * Unlock the page and cycle ip_alloc_sem so that we 
> don't
>> >  297                  * busyloop waiting for ip_alloc_sem to unlock
>> >  298                  */
>> >  299                 ret = AOP_TRUNCATED_PAGE;
>> >  300                 unlock_page(page);
>> >  301                 unlock = 0;
>> >  302                 down_read(&oi->ip_alloc_sem);
>> >  303                 up_read(&oi->ip_alloc_sem);
>> >  304                 goto out_inode_unlock;
>> >  305         }
>> > 
>> > 
>> > As you can see, using nonblock way to get lock will bring many time retry, 
> spend too much time.
>> > We can't modify the code to using block way to get the lock, as this will 
> bring a dead lock.
>> > Actually, we did some testing when trying to use block way to get the lock 
> here, the deadlock problems were encountered.
>> > But, in SLES10 source code, there is not any using nonblock way to get lock 
> in buffer reading/writing, this is why buffer reading/writing are very fair 
> to get IO when reading/writing the same file from multiple nodes. 
>> SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. 
> down_read(), wich has the
>> potential deaklock between page lock / ip_alloc_sem when one node get the 
> cluster lock and
>> does writing and reading on same file on it. This deadlock was fixed by this 
> commit:
> 
> You are correct here - the change was introduced to solve a deadlock between
> page lock and ip_alloc_sem(). Basically, ->readpage is going to be called
> with the page lock held and we need to be aware of that.
Hello guys, my main question is, why we changed ip_alloc_sem lock/unlock position from SLES10 to SLES11?
In SLES10, we get ip_alloc_sem lock before calling generic_file_read() or generic_file_write_nolock in file.c, 
but in SLES11, we get ip_alloc_sem lock in ocfs2_readpage in aops.c, and more, getting/putting the page lock and ip_alloc_sem lock orders are NOT consistent in read/write path.
I just want to know the background behind this code evolution. If we keep getting the ip_alloc_sem lock before calling generic_file_aio_read in SLES11, the deadlock can be avoided?
then, we need not to use nonblocking way to get the lock in read_page(), buffer read will not getting starvation in such case, the read/write IO behavior will be the same with SLES10.

Thanks
Gang

> 
> 
>> ---
>> commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8
>> Author: Mark Fasheh <mark.fasheh@oracle.com>
>> Date:   Mon May 14 11:38:51 2007 -0700
>> 
>>     ocfs2: trylock in ocfs2_readpage()
>> 
>>     Similarly to the page lock / cluster lock inversion in ocfs2_readpage, 
> we
>>     can deadlock on ip_alloc_sem. We can down_read_trylock() instead and 
> just
>>     return AOP_TRUNCATED_PAGE if the operation fails.
>> 
>>     Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
>> 
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index 8e7cafb..3030670 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct page 
> *page)
>>                 goto out;
>>         }
>> 
>> -       down_read(&OCFS2_I(inode)->ip_alloc_sem);
>> +       if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) {
>> +               ret = AOP_TRUNCATED_PAGE;
>> +               goto out_meta_unlock;
>> +       }
>> 
>>         /*
>>          * i_size might have just been updated as we grabed the meta lock.  
> We
>> @@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct page 
> *page)
>>         ocfs2_data_unlock(inode, 0);
>>  out_alloc:
>>         up_read(&OCFS2_I(inode)->ip_alloc_sem);
>> +out_meta_unlock:
>>         ocfs2_meta_unlock(inode, 0);
>>  out:
>>         if (unlock)
>> ---
>> 
>> But somehow with this patch, performance in the scenario become very bad. I 
> don't how this could happen? because the reading node just has only one
>> thread reading the shared file, then down_read_trylock() should always get 
> ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?
> 
> Hmm, there's only one thread and it can't get the lock? Any chance you might
> put some debug prints around where we acquire ip_alloc_sem? It would be
> interesting to see where it get taken to prevent this from happening.
> 	--Mark
> 
> --
> Mark Fasheh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
  2015-12-21  2:53     ` Gang He
@ 2015-12-21  7:36       ` Junxiao Bi
  2015-12-21  9:12         ` Gang He
  0 siblings, 1 reply; 8+ messages in thread
From: Junxiao Bi @ 2015-12-21  7:36 UTC (permalink / raw)
  To: ocfs2-devel

On 12/21/2015 10:53 AM, Gang He wrote:
> Hello Mark and all,
> 
> 
[ snip ]

>> > 
>> > You are correct here - the change was introduced to solve a deadlock between
>> > page lock and ip_alloc_sem(). Basically, ->readpage is going to be called
>> > with the page lock held and we need to be aware of that.
> Hello guys, my main question is, why we changed ip_alloc_sem lock/unlock position from SLES10 to SLES11?
> In SLES10, we get ip_alloc_sem lock before calling generic_file_read() or generic_file_write_nolock in file.c, 
> but in SLES11, we get ip_alloc_sem lock in ocfs2_readpage in aops.c, and more, getting/putting the page lock and ip_alloc_sem lock orders are NOT consistent in read/write path.
> I just want to know the background behind this code evolution. If we keep getting the ip_alloc_sem lock before calling generic_file_aio_read in SLES11, the deadlock can be avoided?
> then, we need not to use nonblocking way to get the lock in read_page(), buffer read will not getting starvation in such case, the read/write IO behavior will be the same with SLES10.
Holding locks during generic_file_read() will stop reader and writer
running parallel.
For ip_alloc_sem, running parallel is bad as reader and writer may touch
different pages.
For inode_lock, looks acceptable, parallel running reader and writer
will cause a lock ping-pang issue and keep truncating and flushing pages
caches, this will cause bad performance. Of course, need fixing the
recursive locking issue, or it will be very easy to run into deadlock.

Thanks,
Junxiao.
> 
> Thanks
> Gang
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
  2015-12-21  7:36       ` Junxiao Bi
@ 2015-12-21  9:12         ` Gang He
  0 siblings, 0 replies; 8+ messages in thread
From: Gang He @ 2015-12-21  9:12 UTC (permalink / raw)
  To: ocfs2-devel

Hello guys,


>>> 
> On 12/21/2015 10:53 AM, Gang He wrote:
>> Hello Mark and all,
>> 
>> 
> [ snip ]
> 
>>> > 
>>> > You are correct here - the change was introduced to solve a deadlock between
>>> > page lock and ip_alloc_sem(). Basically, ->readpage is going to be called
>>> > with the page lock held and we need to be aware of that.
>> Hello guys, my main question is, why we changed ip_alloc_sem lock/unlock 
> position from SLES10 to SLES11?
>> In SLES10, we get ip_alloc_sem lock before calling generic_file_read() or 
> generic_file_write_nolock in file.c, 
>> but in SLES11, we get ip_alloc_sem lock in ocfs2_readpage in aops.c, and 
> more, getting/putting the page lock and ip_alloc_sem lock orders are NOT 
> consistent in read/write path.
>> I just want to know the background behind this code evolution. If we keep 
> getting the ip_alloc_sem lock before calling generic_file_aio_read in SLES11, 
> the deadlock can be avoided?
>> then, we need not to use nonblocking way to get the lock in read_page(), 
> buffer read will not getting starvation in such case, the read/write IO 
> behavior will be the same with SLES10.
> Holding locks during generic_file_read() will stop reader and writer
> running parallel.
> For ip_alloc_sem, running parallel is bad as reader and writer may touch
> different pages.
> For inode_lock, looks acceptable, parallel running reader and writer
> will cause a lock ping-pang issue and keep truncating and flushing pages
> caches, this will cause bad performance. Of course, need fixing the
> recursive locking issue, or it will be very easy to run into deadlock.
The main concern is, why we do not use the related locks in read/write like SLES10 way? 
since the customer run a program, which read/write the same file from different nodes, 
the read behavior is not the same with SLES10, occasional long time read IO will let 
the customer program can't run normally. I want to discuss if we can do something to fix this problem.
Otherwise, this read behavior in parallel read/write IO is considered as a by-design issue, we will tell the truth to the customer.
Here, the read code in SLES10 like,
static ssize_t ocfs2_file_read(struct file *filp,
                               char __user *buf,
                               size_t count,
                               loff_t *ppos)
{
   ...
  
        ret = ocfs2_lock_buffer_inodes(&ctxt, NULL);  <<= cluster meta/data lock
        if (ret < 0) {
                mlog_errno(ret);
                goto bail_unlock;
        }

        down_read(&OCFS2_I(inode)->ip_alloc_sem); <<= get the lock before calling generic_file_read, then we need not to get any lock by non-block and retry way in the function, the read will become fair to get IO.

        ret = generic_file_read(filp, buf, count, ppos);

        up_read(&OCFS2_I(inode)->ip_alloc_sem);
   ...
}
   
But, the read code in SLES11 like,
there is not  cluster meta lock in ocfs2_file_aio_read(), the locks are moved into readpage function.

static int ocfs2_readpage(struct file *file, struct page *page)
{
        ...
        ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page); <<= here, nonblock way to get the lock, it is not fair when there is parallel write from another node, next block way locking and retry (several hundreds times ) will cost a long time, this is why in the customer program occasional one read will cost too long time ( 1 - 15 secs). 
        if (ret != 0) {
                if (ret == AOP_TRUNCATED_PAGE)
                        unlock = 0;
                mlog_errno(ret);
                goto out;
        }

        if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
                /*
                 * Unlock the page and cycle ip_alloc_sem so that we don't
                 * busyloop waiting for ip_alloc_sem to unlock
                 */
                ret = AOP_TRUNCATED_PAGE;
                unlock_page(page);
                unlock = 0;
                down_read(&oi->ip_alloc_sem);
                up_read(&oi->ip_alloc_sem);
                goto out_inode_unlock;
        }
  ...
}

Thanks
Gang

> 
> Thanks,
> Junxiao.
>> 
>> Thanks
>> Gang
>> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
  2015-12-18 23:28   ` Mark Fasheh
  2015-12-21  2:53     ` Gang He
@ 2015-12-21 11:23     ` Eric Ren
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Ren @ 2015-12-21 11:23 UTC (permalink / raw)
  To: ocfs2-devel

Hello Mark,

...snip..
> > SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. down_read(), wich has the
> > potential deaklock between page lock / ip_alloc_sem when one node get the cluster lock and
> > does writing and reading on same file on it. This deadlock was fixed by this commit:
> 
> You are correct here - the change was introduced to solve a deadlock between
> page lock and ip_alloc_sem(). Basically, ->readpage is going to be called
> with the page lock held and we need to be aware of that.
...snip..
> > But somehow with this patch, performance in the scenario become very bad. I don't how this could happen? because the reading node just has only one
> > thread reading the shared file, then down_read_trylock() should always get ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?
> 
> Hmm, there's only one thread and it can't get the lock? Any chance you might
No, it can always get the lock in this case. Sorry, I made a false testing
result. There're probably mainly two factors:

1. none-isolated testing environment - include nodes, network and shared disk;
2. testing program from customer - sleep for 1s after finishing ~1M read/write each time,
   thus the overlap time of read/write on two nodes is random; so the shoter overlap time is,
   the better performance looks.
   
Sorry again for bothering your time.
--Eric
> put some debug prints around where we acquire ip_alloc_sem? It would be
> interesting to see where it get taken to prevent this from happening.
> 	--Mark
> 
> --
> Mark Fasheh
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] The root cause analysis about buffer read getting starvation
  2015-12-17 15:08 [Ocfs2-devel] The root cause analysis about buffer read getting starvation Gang He
  2015-12-18  9:09 ` Eric Ren
@ 2015-12-25  9:05 ` Eric Ren
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Ren @ 2015-12-25  9:05 UTC (permalink / raw)
  To: ocfs2-devel

Hi Mark and Gang,

> In the past days, I and Eric were looking at a customer issue, the customer is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently, some day ago I sent a mail to the list for some discussions, you can read some details via the link https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
> But, this problem does not happen under SLES10 (sp1 - sp4), the customer upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is why the customer complains, he hope we can give a investigation, to see how to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
Sorry for the previous false testing report, which must mislead you somehow. Actually, sle10
sp4(not sure how about sp2/3) is free of the problem we're discussing, according to both code
analysis and new more careful testing result. As Gang said, the cluster lock and local lock
has changed a lot. The 2 biggest change in my head is:
1. for read, cluster lock is put into *_readpage(s), leaving the rest part of read() such as
copying data from page cache to user space. This can result in page cache being truncated by
downconvert thread on behalf of another node before copied to user space, especially in
fierce contention of cluster lock.

2. The order of getting cluster lock and local lock is changed, in order to avoid
deadlock, then the unblocked way to get cluster lock was introduced. Gang concern about this
more. So I won't repeat it;-)

Here I want to focus on the 1st case, because there's only one thread, i.e either reading or
writing on each node in this case, so local lock may not matter too much, besides the lock
inversion problem.

OK, just some thoughts on 1st issue. Please correct me if anything wrong.
1. to be convinient for discussion, I'd like to give read() and write() path here:
 vfs_read()
  ocfs2_file_aio_read()
    ocfs2_inode_lock_atime
    ocfs2_inode_unlock()
      generic_file_aio_read()
        do_generic_file_read()
           page_cache_sync_readahead() or page_cache_async_readahead()
              ocfs2_readpages()
                  ocfs2_inode_lock_full(NOBLOCK)
                  down_read_trylock(ip_alloc_sem)
              ocfs2_readpage() /* continue to retry until to succeed */
                  ocfs2_inode_lock_with_page(NOBLOCK)
                  down_read_trylock(ip_alloc_sem)
                  ocfs2_inode_unlock
              PAGE OK: COPY PAGE TO USER SPACE

 vfs_write()
    ocfs2_file_aio_write()
      ocfs2_rw_lock(EX)
       generic_buffered_write()
        generic_perform_write()
         ocfs2_write_begin()
              ocfs2_inode_lock_full()
              dow_write(ip_alloc_sem)
          COPY FROM USER SPACE
          ocfs2_write_end()
               up_write(ip_alloc_sem)
               ocfs2_inode_unlock
      ocfs2_rw_unlock()

2. There is a race window to truncate page cache, making the effort of readahead and
*_readpage stuff go to waste!

For read(), the cluster lock just be held when doing readpage() stuff, to get data from
disk  to page cache. When the contention of cluster lock for same file is rather fierce,
so it's likely EX lock from another node has been already queued. Once returning from readpage(),
cluster lock will been released, which will kick up downconvert thread to work. Then ocfs2dc will
truncate all cached pages:
---
3583 static int ocfs2_data_convert_worker(struct ocfs2_lock_res *lockres,
3584                                      int blocking)
3585 {
 ...
3616         sync_mapping_buffers(mapping);
3617         if (blocking == DLM_LOCK_EX) {
3618                 truncate_inode_pages(mapping, 0);
3619         } else {
   ....
---

3. testing
I'd like to elaborate how I tested and the result data if you have intersts.
Not here, or this thead would extend too long. Overall, it looks that the more fiercer
the contention is, the more unfair read() is treated...

4. kernel version info
Linux sle10sp1 2.6.16.46-0.12-smp #1 SMP Thu May 17 14:00:09 UTC 2007 x86_64 x86_64 x86_64
GNU/Linux

sle10sp4:~ # uname -a
Linux sle10sp4 2.6.16.60-0.85.1-smp #1 SMP Thu Mar 17 11:45:06 UTC 2011 x86_64 x86_64
x86_64 GNU/Linux

5. readpage() on sle10 sp1 is like:
---
211 static int ocfs2_readpage(struct file *file, struct page *page)
212 {
213         int ret;
214 
215         mlog_entry("(0x%p, %lu)\n", file, (page ? page->index : 0));
216 
217         ret = block_read_full_page(page, ocfs2_get_block);
218 
219         mlog_exit(ret);
220 
221         return ret;
222 }
---

It's not long time for me on ocfs2, so I'm not really sure if I'm reporting stupid
issue again ;-)  Look forward your comments on this!

Thanks,
Eric
> According to our code reviewing and some testings, we found that the root cause to let buffer read get starvation.
> The suspicious code in aops.c 
>  274 static int ocfs2_readpage(struct file *file, struct page *page)
>  275 {
>  276         struct inode *inode = page->mapping->host;
>  277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>  278         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>  279         int ret, unlock = 1;
>  280         long delta;
>  281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
>  282
>  283         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>  284                              (page ? page->index : 0));
>  285
>  286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<= here, using nonblock way to get lock will bring many times retry, spend too much time
>  287         if (ret != 0) {
>  288                 if (ret == AOP_TRUNCATED_PAGE)
>  289                         unlock = 0;
>  290                 mlog_errno(ret);
>  291                 goto out;
>  292         }
>  293
>  294         if (down_read_trylock(&oi->ip_alloc_sem) == 0) {  <<= here, the same problem with above
>  295                 /*
>  296                  * Unlock the page and cycle ip_alloc_sem so that we don't
>  297                  * busyloop waiting for ip_alloc_sem to unlock
>  298                  */
>  299                 ret = AOP_TRUNCATED_PAGE;
>  300                 unlock_page(page);
>  301                 unlock = 0;
>  302                 down_read(&oi->ip_alloc_sem);
>  303                 up_read(&oi->ip_alloc_sem);
>  304                 goto out_inode_unlock;
>  305         }
> 
> 
> As you can see, using nonblock way to get lock will bring many time retry, spend too much time.
> We can't modify the code to using block way to get the lock, as this will bring a dead lock.
> Actually, we did some testing when trying to use block way to get the lock here, the deadlock problems were encountered.
> But, in SLES10 source code, there is not any using nonblock way to get lock in buffer reading/writing, this is why buffer reading/writing are very fair to get IO when reading/writing the same file from multiple nodes. 
> Why the dead locks happen on SLES11? you can see the source code, there are some code change, especially inode alloc_sem lock.
> On SLES11, to get  inode alloc_sem lock is moved into ocfs2_readpage and ocfs2_write_begin, why we need to do that? this will let us to bring a dead lock factor, to avoid the dead locks, we will have to use nonblocking way to get some locks in ocfs2_readpage, the result will let buffer reading be unfair to get IO. and that, to avoid CPU busy loop, add some code to get the lock with block way in case can't get a lock in nonblock way, waste too much time also.
> Finally, I want to discuss with your guys, how to fix this issue? could we move the inode alloc_sem lock back to ocfs2_file_aio_read/ocfs2_file_aio_write? 
> we can get the inode alloc_sem lock before calling into ocfs2_readpage/ocfs2_write_begin, just like SLES10. 
> Since I have not enough background behind these code changes, hope you can give some comments.
> 
> Thanks a lot.
> Gang
> 
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-12-25  9:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-17 15:08 [Ocfs2-devel] The root cause analysis about buffer read getting starvation Gang He
2015-12-18  9:09 ` Eric Ren
2015-12-18 23:28   ` Mark Fasheh
2015-12-21  2:53     ` Gang He
2015-12-21  7:36       ` Junxiao Bi
2015-12-21  9:12         ` Gang He
2015-12-21 11:23     ` Eric Ren
2015-12-25  9:05 ` Eric Ren

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.