[Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
@ 2015-12-08  3:21 Gang He
  2015-12-08  3:55 ` Joseph Qi
  0 siblings, 1 reply; 8+ messages in thread
From: Gang He @ 2015-12-08  3:21 UTC (permalink / raw)
  To: ocfs2-devel

Hello Guys,

There is a issue from the customer, who is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently.
According to the demo code from the customer, we also can reproduce this issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually this issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io mode, this issue will disappear.
Base on my investigation, the root cause is the buffer-io using cluster-lock is different from direct-io, I do not know why buffer-io use cluster-lock like this way.
the code details are as below,
in aops.c file,
 281 static int ocfs2_readpage(struct file *file, struct page *page)
 282 {
 283         struct inode *inode = page->mapping->host;
 284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
 285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
 286         int ret, unlock = 1;
 287
 288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
 289                              (page ? page->index : 0));
 290
 291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this line
 292         if (ret != 0) {
 293                 if (ret == AOP_TRUNCATED_PAGE)
 294                         unlock = 0;
 295                 mlog_errno(ret);
 296                 goto out;
 297         } 

in dlmglue.c file,
2 int ocfs2_inode_lock_with_page(struct inode *inode,
2443                               struct buffer_head **ret_bh,
2444                               int ex,
2445                               struct page *page)
2446 {
2447         int ret;
2448
2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster lock? this way will let reading IO get starvation. 
2450         if (ret == -EAGAIN) {
2451                 unlock_page(page);
2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
2453                         ocfs2_inode_unlock(inode, ex);
2454                 ret = AOP_TRUNCATED_PAGE;
2455         }
2456
2457         return ret;
2458 }

If you know the background behind the code, please tell us, why not use block way to get the lock in reading a page, then reading IO will get the page fairly when there is a concurrent writing IO from the other node.
Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to blocking way), the reading IO will not be blocked too much time (can erase the customer's complaining), but a new problem arises, sometimes the reading IO and writing IO get a dead lock (why dead lock? I am looking at). 

Thanks
Gang  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
  2015-12-08  3:21 [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently Gang He
@ 2015-12-08  3:55 ` Joseph Qi
  2015-12-08  4:51   ` Gang He
  2015-12-08  4:51   ` Eric Ren
  0 siblings, 2 replies; 8+ messages in thread
From: Joseph Qi @ 2015-12-08  3:55 UTC (permalink / raw)
  To: ocfs2-devel

Hi Gang,
Eric and I have discussed this case before.
Using NONBLOCK here is because there is a lock inversion between inode
lock and page lock. You can refer to the comments of
ocfs2_inode_lock_with_page for details.
Actually I have found that NONBLOCK mode is only used in lock inversion
cases.

Thanks,
Joseph

On 2015/12/8 11:21, Gang He wrote:
> Hello Guys,
> 
> There is a issue from the customer, who is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently.
> According to the demo code from the customer, we also can reproduce this issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually this issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io mode, this issue will disappear.
> Base on my investigation, the root cause is the buffer-io using cluster-lock is different from direct-io, I do not know why buffer-io use cluster-lock like this way.
> the code details are as below,
> in aops.c file,
>  281 static int ocfs2_readpage(struct file *file, struct page *page)
>  282 {
>  283         struct inode *inode = page->mapping->host;
>  284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>  285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>  286         int ret, unlock = 1;
>  287
>  288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>  289                              (page ? page->index : 0));
>  290
>  291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this line
>  292         if (ret != 0) {
>  293                 if (ret == AOP_TRUNCATED_PAGE)
>  294                         unlock = 0;
>  295                 mlog_errno(ret);
>  296                 goto out;
>  297         } 
>  
> in dlmglue.c file,
> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
> 2443                               struct buffer_head **ret_bh,
> 2444                               int ex,
> 2445                               struct page *page)
> 2446 {
> 2447         int ret;
> 2448
> 2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster lock? this way will let reading IO get starvation. 
> 2450         if (ret == -EAGAIN) {
> 2451                 unlock_page(page);
> 2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> 2453                         ocfs2_inode_unlock(inode, ex);
> 2454                 ret = AOP_TRUNCATED_PAGE;
> 2455         }
> 2456
> 2457         return ret;
> 2458 }
> 
> If you know the background behind the code, please tell us, why not use block way to get the lock in reading a page, then reading IO will get the page fairly when there is a concurrent writing IO from the other node.
> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to blocking way), the reading IO will not be blocked too much time (can erase the customer's complaining), but a new problem arises, sometimes the reading IO and writing IO get a dead lock (why dead lock? I am looking at). 
> 
> 
> Thanks
> Gang  
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
  2015-12-08  3:55 ` Joseph Qi
@ 2015-12-08  4:51   ` Gang He
  2015-12-08  4:51   ` Eric Ren
  1 sibling, 0 replies; 8+ messages in thread
From: Gang He @ 2015-12-08  4:51 UTC (permalink / raw)
  To: ocfs2-devel

Hello Jeseph,


>>> 
> Hi Gang,
> Eric and I have discussed this case before.
> Using NONBLOCK here is because there is a lock inversion between inode
> lock and page lock. You can refer to the comments of
> ocfs2_inode_lock_with_page for details.
> Actually I have found that NONBLOCK mode is only used in lock inversion
> cases.
Could you tell more details? for a inode object, here are three type cluster locks,
ip_open_lockres is used to protect open/delete posix semantic,
ip_rw_lockres looks to be used to protect write in direct-io case, just feel the words "rw" let people be confused,
ip_inode_lockres to be used more widely, could you tell more, why there needs a lock inversion, not use block-way to get lock in read/write code path?
second, when the read path can not get the lock with the NONBLOCK mode, why need to call ocfs2_inode_lock(inode, ret_bh, ex) with a block way in the next code?
actually, two lock wrappers use the same cluster lock (ip_inode_lockres), you know, the next block-way locking will cost too much time.
 
Thanks a lot.
Gang  

> 
> Thanks,
> Joseph
> 
> On 2015/12/8 11:21, Gang He wrote:
>> Hello Guys,
>> 
>> There is a issue from the customer, who is complaining that buffer reading 
> sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the 
> same file from different nodes concurrently.
>> According to the demo code from the customer, we also can reproduce this 
> issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually 
> this issue can be reproduced in openSuSe 13.2 (more newer code), but in 
> direct-io mode, this issue will disappear.
>> Base on my investigation, the root cause is the buffer-io using cluster-lock 
> is different from direct-io, I do not know why buffer-io use cluster-lock like 
> this way.
>> the code details are as below,
>> in aops.c file,
>>  281 static int ocfs2_readpage(struct file *file, struct page *page)
>>  282 {
>>  283         struct inode *inode = page->mapping->host;
>>  284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>>  285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>>  286         int ret, unlock = 1;
>>  287
>>  288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>>  289                              (page ? page->index : 0));
>>  290
>>  291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== 
> this line
>>  292         if (ret != 0) {
>>  293                 if (ret == AOP_TRUNCATED_PAGE)
>>  294                         unlock = 0;
>>  295                 mlog_errno(ret);
>>  296                 goto out;
>>  297         } 
>>  
>> in dlmglue.c file,
>> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
>> 2443                               struct buffer_head **ret_bh,
>> 2444                               int ex,
>> 2445                               struct page *page)
>> 2446 {
>> 2447         int ret;
>> 2448
>> 2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, 
> OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster 
> lock? this way will let reading IO get starvation. 
>> 2450         if (ret == -EAGAIN) {
>> 2451                 unlock_page(page);
>> 2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>> 2453                         ocfs2_inode_unlock(inode, ex);
>> 2454                 ret = AOP_TRUNCATED_PAGE;
>> 2455         }
>> 2456
>> 2457         return ret;
>> 2458 }
>> 
>> If you know the background behind the code, please tell us, why not use 
> block way to get the lock in reading a page, then reading IO will get the 
> page fairly when there is a concurrent writing IO from the other node.
>> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to 
> blocking way), the reading IO will not be blocked too much time (can erase 
> the customer's complaining), but a new problem arises, sometimes the reading 
> IO and writing IO get a dead lock (why dead lock? I am looking at). 
>> 
>> 
>> Thanks
>> Gang  
>> 
>> 
>> 
>> .
>> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
  2015-12-08  3:55 ` Joseph Qi
  2015-12-08  4:51   ` Gang He
@ 2015-12-08  4:51   ` Eric Ren
  2015-12-08  6:15     ` Joseph Qi
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Ren @ 2015-12-08  4:51 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: 
> Hi Gang,
> Eric and I have discussed this case before.
> Using NONBLOCK here is because there is a lock inversion between inode
> lock and page lock. You can refer to the comments of
> ocfs2_inode_lock_with_page for details.
> Actually I have found that NONBLOCK mode is only used in lock inversion
> cases.
Yes, that comments can helps;-) I try to explain it, if any problem please correct me.

On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely doing reading) has
loced this page. Before thread A stepping into ocfs2_inode_lock_full(), on node 2, thread B(doing
writing) required EX lock on the same inode, if lockres->l_level is PR, so lockres needs to
been downconverted PR->NL on node1, i.e. ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread().

On node 1, if tread ocfs2dc proceeds like this:
ocfs2_downconvert_thread()
 ocfs2_downconvert_thread_do_work()
  ocfs2_process_blocked_lock()
   ocfs2_unblock_lock()
    lockres->l_ops->downconvert_worker(lockres, blocking)
      ocfs2_data_convert_worker() {
        ...
        3557         if (blocking == DLM_LOCK_EX) {
        3558                 truncate_inode_pages(mapping, 0);
        3559         } else {
        3560                 /* We only need to wait on the I/O if we're not also
        3561                  * truncating pages because truncate_inode_pages waits
        3562                  * for us above. We don't truncate pages if we're
        3563                  * blocking anything < EXMODE because we want to keep
        3564                  * them around in that case. */
        3565                 filemap_fdatawait(mapping);
        3566         }
        ...
      }
We can see truncate_inode_pages()
            truncate_inode_pages_range()
              trylock_page()/lock_page()/find_lock_page()

if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window that flag
OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like ocfs2_blocking_ast()->
ocfs2_generic_handle_bast(), so thread A may be blocked.

Now, ocfs2dc wants page lock, but thread A hold page lock and has been blocked because of
thread B. Deadlock occurs, right? Again, please correct me if any.

So, the author had to use UNBLOCK way here thus create livelock problem. But, I cannot understand
very clearly how livelock come up. Joseph or anyone else, could you please elaborate it?

Thanks,
Eric

> 
> Thanks,
> Joseph
> 
> On 2015/12/8 11:21, Gang He wrote:
> > Hello Guys,
> > 
> > There is a issue from the customer, who is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently.
> > According to the demo code from the customer, we also can reproduce this issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually this issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io mode, this issue will disappear.
> > Base on my investigation, the root cause is the buffer-io using cluster-lock is different from direct-io, I do not know why buffer-io use cluster-lock like this way.
> > the code details are as below,
> > in aops.c file,
> >  281 static int ocfs2_readpage(struct file *file, struct page *page)
> >  282 {
> >  283         struct inode *inode = page->mapping->host;
> >  284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
> >  286         int ret, unlock = 1;
> >  287
> >  288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
> >  289                              (page ? page->index : 0));
> >  290
> >  291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this line
> >  292         if (ret != 0) {
> >  293                 if (ret == AOP_TRUNCATED_PAGE)
> >  294                         unlock = 0;
> >  295                 mlog_errno(ret);
> >  296                 goto out;
> >  297         } 
> >  
> > in dlmglue.c file,
> > 2 int ocfs2_inode_lock_with_page(struct inode *inode,
> > 2443                               struct buffer_head **ret_bh,
> > 2444                               int ex,
> > 2445                               struct page *page)
> > 2446 {
> > 2447         int ret;
> > 2448
> > 2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster lock? this way will let reading IO get starvation. 
> > 2450         if (ret == -EAGAIN) {
> > 2451                 unlock_page(page);
> > 2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> > 2453                         ocfs2_inode_unlock(inode, ex);
> > 2454                 ret = AOP_TRUNCATED_PAGE;
> > 2455         }
> > 2456
> > 2457         return ret;
> > 2458 }
> > 
> > If you know the background behind the code, please tell us, why not use block way to get the lock in reading a page, then reading IO will get the page fairly when there is a concurrent writing IO from the other node.
> > Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to blocking way), the reading IO will not be blocked too much time (can erase the customer's complaining), but a new problem arises, sometimes the reading IO and writing IO get a dead lock (why dead lock? I am looking at). 
> > 
> > 
> > Thanks
> > Gang  
> > 
> > 
> > 
> > .
> > 
> 
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
  2015-12-08  4:51   ` Eric Ren
@ 2015-12-08  6:15     ` Joseph Qi
  2015-12-08  8:26       ` Eric Ren
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Qi @ 2015-12-08  6:15 UTC (permalink / raw)
  To: ocfs2-devel

On 2015/12/8 12:51, Eric Ren wrote:
> Hi,
> 
> On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: 
>> Hi Gang,
>> Eric and I have discussed this case before.
>> Using NONBLOCK here is because there is a lock inversion between inode
>> lock and page lock. You can refer to the comments of
>> ocfs2_inode_lock_with_page for details.
>> Actually I have found that NONBLOCK mode is only used in lock inversion
>> cases.
> Yes, that comments can helps;-) I try to explain it, if any problem please correct me.
> 
> On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely doing reading) has
> loced this page. Before thread A stepping into ocfs2_inode_lock_full(), on node 2, thread B(doing
> writing) required EX lock on the same inode, if lockres->l_level is PR, so lockres needs to
> been downconverted PR->NL on node1, i.e. ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread().
> 
> On node 1, if tread ocfs2dc proceeds like this:
> ocfs2_downconvert_thread()
>  ocfs2_downconvert_thread_do_work()
>   ocfs2_process_blocked_lock()
>    ocfs2_unblock_lock()
>     lockres->l_ops->downconvert_worker(lockres, blocking)
>       ocfs2_data_convert_worker() {
>         ...
>         3557         if (blocking == DLM_LOCK_EX) {
>         3558                 truncate_inode_pages(mapping, 0);
>         3559         } else {
>         3560                 /* We only need to wait on the I/O if we're not also
>         3561                  * truncating pages because truncate_inode_pages waits
>         3562                  * for us above. We don't truncate pages if we're
>         3563                  * blocking anything < EXMODE because we want to keep
>         3564                  * them around in that case. */
>         3565                 filemap_fdatawait(mapping);
>         3566         }
>         ...
>       }
> We can see truncate_inode_pages()
>             truncate_inode_pages_range()
>               trylock_page()/lock_page()/find_lock_page()
> 
> if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window that flag
> OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like ocfs2_blocking_ast()->
> ocfs2_generic_handle_bast(), so thread A may be blocked.
> 
> Now, ocfs2dc wants page lock, but thread A hold page lock and has been blocked because of
> thread B. Deadlock occurs, right? Again, please correct me if any.
Yes, you are right.

> 
> So, the author had to use UNBLOCK way here thus create livelock problem. But, I cannot understand
> very clearly how livelock come up. Joseph or anyone else, could you please elaborate it?
Firstly it takes inode lock in NONBLOCK way, if it couldn't be fulfilled
currently and returns -EAGAIN, then unlock page and make downconvert go
on. This is to release the deadlock.
But we do have to take page lock and inode lock to safely accomplish the
request. So it has to return to VFS and retry with page lock. Here
taking inode lock and then unlocking it immediately in BLOCK way wants
to cache the inode and increase the chance of the next retry. But if
another node successfully takes the inode lock again immediately after
this node unlocks inode, the flow you described above happens again and
this node still couldn't be fulfilled. And so forth.

> 
> Thanks,
> Eric
> 
>>
>> Thanks,
>> Joseph
>>
>> On 2015/12/8 11:21, Gang He wrote:
>>> Hello Guys,
>>>
>>> There is a issue from the customer, who is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently.
>>> According to the demo code from the customer, we also can reproduce this issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually this issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io mode, this issue will disappear.
>>> Base on my investigation, the root cause is the buffer-io using cluster-lock is different from direct-io, I do not know why buffer-io use cluster-lock like this way.
>>> the code details are as below,
>>> in aops.c file,
>>>  281 static int ocfs2_readpage(struct file *file, struct page *page)
>>>  282 {
>>>  283         struct inode *inode = page->mapping->host;
>>>  284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>>>  285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>>>  286         int ret, unlock = 1;
>>>  287
>>>  288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>>>  289                              (page ? page->index : 0));
>>>  290
>>>  291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this line
>>>  292         if (ret != 0) {
>>>  293                 if (ret == AOP_TRUNCATED_PAGE)
>>>  294                         unlock = 0;
>>>  295                 mlog_errno(ret);
>>>  296                 goto out;
>>>  297         } 
>>>  
>>> in dlmglue.c file,
>>> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
>>> 2443                               struct buffer_head **ret_bh,
>>> 2444                               int ex,
>>> 2445                               struct page *page)
>>> 2446 {
>>> 2447         int ret;
>>> 2448
>>> 2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster lock? this way will let reading IO get starvation. 
>>> 2450         if (ret == -EAGAIN) {
>>> 2451                 unlock_page(page);
>>> 2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>> 2453                         ocfs2_inode_unlock(inode, ex);
>>> 2454                 ret = AOP_TRUNCATED_PAGE;
>>> 2455         }
>>> 2456
>>> 2457         return ret;
>>> 2458 }
>>>
>>> If you know the background behind the code, please tell us, why not use block way to get the lock in reading a page, then reading IO will get the page fairly when there is a concurrent writing IO from the other node.
>>> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to blocking way), the reading IO will not be blocked too much time (can erase the customer's complaining), but a new problem arises, sometimes the reading IO and writing IO get a dead lock (why dead lock? I am looking at). 
>>>
>>>
>>> Thanks
>>> Gang  
>>>
>>>
>>>
>>> .
>>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> .
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
  2015-12-08  6:15     ` Joseph Qi
@ 2015-12-08  8:26       ` Eric Ren
  2015-12-09  6:07         ` Joseph Qi
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Ren @ 2015-12-08  8:26 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joseph,
On Tue, Dec 08, 2015 at 02:15:17PM +0800, joseph wrote: 
> On 2015/12/8 12:51, Eric Ren wrote:
> > Hi,
> > 
> > On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: 
> >> Hi Gang,
> >> Eric and I have discussed this case before.
> >> Using NONBLOCK here is because there is a lock inversion between inode
> >> lock and page lock. You can refer to the comments of
> >> ocfs2_inode_lock_with_page for details.
> >> Actually I have found that NONBLOCK mode is only used in lock inversion
> >> cases.
> > Yes, that comments can helps;-) I try to explain it, if any problem please correct me.
> > 
> > On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely doing reading) has
> > loced this page. Before thread A stepping into ocfs2_inode_lock_full(), on node 2, thread B(doing
> > writing) required EX lock on the same inode, if lockres->l_level is PR, so lockres needs to
> > been downconverted PR->NL on node1, i.e. ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread().
> > 
> > On node 1, if tread ocfs2dc proceeds like this:
> > ocfs2_downconvert_thread()
> >  ocfs2_downconvert_thread_do_work()
> >   ocfs2_process_blocked_lock()
> >    ocfs2_unblock_lock()
> >     lockres->l_ops->downconvert_worker(lockres, blocking)
> >       ocfs2_data_convert_worker() {
> >         ...
> >         3557         if (blocking == DLM_LOCK_EX) {
> >         3558                 truncate_inode_pages(mapping, 0);
> >         3559         } else {
> >         3560                 /* We only need to wait on the I/O if we're not also
> >         3561                  * truncating pages because truncate_inode_pages waits
> >         3562                  * for us above. We don't truncate pages if we're
> >         3563                  * blocking anything < EXMODE because we want to keep
> >         3564                  * them around in that case. */
> >         3565                 filemap_fdatawait(mapping);
> >         3566         }
> >         ...
> >       }
> > We can see truncate_inode_pages()
> >             truncate_inode_pages_range()
> >               trylock_page()/lock_page()/find_lock_page()
> > 
> > if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window that flag
> > OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like ocfs2_blocking_ast()->
> > ocfs2_generic_handle_bast(), so thread A may be blocked.
> > 
> > Now, ocfs2dc wants page lock, but thread A hold page lock and has been blocked because of
> > thread B. Deadlock occurs, right? Again, please correct me if any.
> Yes, you are right.
> > 
> > So, the author had to use UNBLOCK way here thus create livelock problem. But, I cannot understand
> > very clearly how livelock come up. Joseph or anyone else, could you please elaborate it?
> Firstly it takes inode lock in NONBLOCK way, if it couldn't be fulfilled
> currently and returns -EAGAIN, then unlock page and make downconvert go
> on. This is to release the deadlock.
> But we do have to take page lock and inode lock to safely accomplish the
> request. So it has to return to VFS and retry with page lock. Here
> taking inode lock and then unlocking it immediately in BLOCK way wants
Thanks! But unfortunately in our case, it oftern takes long time waiting here 
for inode lock, for some reasons I don't know yet. This is likely the cause why
ocfs2_readpage() takes long.

I really want to know why to lock & unlock inode here? I'm glad you mention the reason,
but is it the only one?
> to cache the inode and increase the chance of the next retry. But if
Any reason to cache the inode? And how this can increase the chance of retry? 
It's really interesting... Please elaborate more;-)

Thanks,
Eric
> another node successfully takes the inode lock again immediately after
> this node unlocks inode, the flow you described above happens again and
> this node still couldn't be fulfilled. And so forth.
> 
> > 
> > Thanks,
> > Eric
> > 
> >>
> >> Thanks,
> >> Joseph
> >>
> >> On 2015/12/8 11:21, Gang He wrote:
> >>> Hello Guys,
> >>>
> >>> There is a issue from the customer, who is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently.
> >>> According to the demo code from the customer, we also can reproduce this issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually this issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io mode, this issue will disappear.
> >>> Base on my investigation, the root cause is the buffer-io using cluster-lock is different from direct-io, I do not know why buffer-io use cluster-lock like this way.
> >>> the code details are as below,
> >>> in aops.c file,
> >>>  281 static int ocfs2_readpage(struct file *file, struct page *page)
> >>>  282 {
> >>>  283         struct inode *inode = page->mapping->host;
> >>>  284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >>>  285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
> >>>  286         int ret, unlock = 1;
> >>>  287
> >>>  288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
> >>>  289                              (page ? page->index : 0));
> >>>  290
> >>>  291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this line
> >>>  292         if (ret != 0) {
> >>>  293                 if (ret == AOP_TRUNCATED_PAGE)
> >>>  294                         unlock = 0;
> >>>  295                 mlog_errno(ret);
> >>>  296                 goto out;
> >>>  297         } 
> >>>  
> >>> in dlmglue.c file,
> >>> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
> >>> 2443                               struct buffer_head **ret_bh,
> >>> 2444                               int ex,
> >>> 2445                               struct page *page)
> >>> 2446 {
> >>> 2447         int ret;
> >>> 2448
> >>> 2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster lock? this way will let reading IO get starvation. 
> >>> 2450         if (ret == -EAGAIN) {
> >>> 2451                 unlock_page(page);
> >>> 2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> >>> 2453                         ocfs2_inode_unlock(inode, ex);
> >>> 2454                 ret = AOP_TRUNCATED_PAGE;
> >>> 2455         }
> >>> 2456
> >>> 2457         return ret;
> >>> 2458 }
> >>>
> >>> If you know the background behind the code, please tell us, why not use block way to get the lock in reading a page, then reading IO will get the page fairly when there is a concurrent writing IO from the other node.
> >>> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to blocking way), the reading IO will not be blocked too much time (can erase the customer's complaining), but a new problem arises, sometimes the reading IO and writing IO get a dead lock (why dead lock? I am looking at). 
> >>>
> >>>
> >>> Thanks
> >>> Gang  
> >>>
> >>>
> >>>
> >>> .
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> Ocfs2-devel mailing list
> >> Ocfs2-devel at oss.oracle.com
> >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >>
> > 
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> > 
> > .
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
  2015-12-08  8:26       ` Eric Ren
@ 2015-12-09  6:07         ` Joseph Qi
  2015-12-10  6:31           ` Eric Ren
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Qi @ 2015-12-09  6:07 UTC (permalink / raw)
  To: ocfs2-devel

On 2015/12/8 16:26, Eric Ren wrote:
> Hi Joseph,
> On Tue, Dec 08, 2015 at 02:15:17PM +0800, joseph wrote: 
>> On 2015/12/8 12:51, Eric Ren wrote:
>>> Hi,
>>>
>>> On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: 
>>>> Hi Gang,
>>>> Eric and I have discussed this case before.
>>>> Using NONBLOCK here is because there is a lock inversion between inode
>>>> lock and page lock. You can refer to the comments of
>>>> ocfs2_inode_lock_with_page for details.
>>>> Actually I have found that NONBLOCK mode is only used in lock inversion
>>>> cases.
>>> Yes, that comments can helps;-) I try to explain it, if any problem please correct me.
>>>
>>> On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely doing reading) has
>>> loced this page. Before thread A stepping into ocfs2_inode_lock_full(), on node 2, thread B(doing
>>> writing) required EX lock on the same inode, if lockres->l_level is PR, so lockres needs to
>>> been downconverted PR->NL on node1, i.e. ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread().
>>>
>>> On node 1, if tread ocfs2dc proceeds like this:
>>> ocfs2_downconvert_thread()
>>>  ocfs2_downconvert_thread_do_work()
>>>   ocfs2_process_blocked_lock()
>>>    ocfs2_unblock_lock()
>>>     lockres->l_ops->downconvert_worker(lockres, blocking)
>>>       ocfs2_data_convert_worker() {
>>>         ...
>>>         3557         if (blocking == DLM_LOCK_EX) {
>>>         3558                 truncate_inode_pages(mapping, 0);
>>>         3559         } else {
>>>         3560                 /* We only need to wait on the I/O if we're not also
>>>         3561                  * truncating pages because truncate_inode_pages waits
>>>         3562                  * for us above. We don't truncate pages if we're
>>>         3563                  * blocking anything < EXMODE because we want to keep
>>>         3564                  * them around in that case. */
>>>         3565                 filemap_fdatawait(mapping);
>>>         3566         }
>>>         ...
>>>       }
>>> We can see truncate_inode_pages()
>>>             truncate_inode_pages_range()
>>>               trylock_page()/lock_page()/find_lock_page()
>>>
>>> if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window that flag
>>> OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like ocfs2_blocking_ast()->
>>> ocfs2_generic_handle_bast(), so thread A may be blocked.
>>>
>>> Now, ocfs2dc wants page lock, but thread A hold page lock and has been blocked because of
>>> thread B. Deadlock occurs, right? Again, please correct me if any.
>> Yes, you are right.
>>>
>>> So, the author had to use UNBLOCK way here thus create livelock problem. But, I cannot understand
>>> very clearly how livelock come up. Joseph or anyone else, could you please elaborate it?
>> Firstly it takes inode lock in NONBLOCK way, if it couldn't be fulfilled
>> currently and returns -EAGAIN, then unlock page and make downconvert go
>> on. This is to release the deadlock.
>> But we do have to take page lock and inode lock to safely accomplish the
>> request. So it has to return to VFS and retry with page lock. Here
>> taking inode lock and then unlocking it immediately in BLOCK way wants
> Thanks! But unfortunately in our case, it oftern takes long time waiting here 
> for inode lock, for some reasons I don't know yet. This is likely the cause why
> ocfs2_readpage() takes long.
One time takes long time or retry many times? The former may be because
downconvert take too much time, or there are many requests ahead of this
node. And the latter may be this node still fails the race when retry.

> 
> I really want to know why to lock & unlock inode here? I'm glad you mention the reason,
> but is it the only one?
>> to cache the inode and increase the chance of the next retry. But if
> Any reason to cache the inode? And how this can increase the chance of retry? 
> It's really interesting... Please elaborate more;-)
IMO? this is because unlock is not truly unlock. This node retry is
somewhat 'local' operation, and another node has to send messages, so
this node has more chance to win the race.

Thanks,
Joseph
> 
> Thanks,
> Eric
>> another node successfully takes the inode lock again immediately after
>> this node unlocks inode, the flow you described above happens again and
>> this node still couldn't be fulfilled. And so forth.
>>
>>>
>>> Thanks,
>>> Eric
>>>
>>>>
>>>> Thanks,
>>>> Joseph
>>>>
>>>> On 2015/12/8 11:21, Gang He wrote:
>>>>> Hello Guys,
>>>>>
>>>>> There is a issue from the customer, who is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently.
>>>>> According to the demo code from the customer, we also can reproduce this issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually this issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io mode, this issue will disappear.
>>>>> Base on my investigation, the root cause is the buffer-io using cluster-lock is different from direct-io, I do not know why buffer-io use cluster-lock like this way.
>>>>> the code details are as below,
>>>>> in aops.c file,
>>>>>  281 static int ocfs2_readpage(struct file *file, struct page *page)
>>>>>  282 {
>>>>>  283         struct inode *inode = page->mapping->host;
>>>>>  284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>>>>>  285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>>>>>  286         int ret, unlock = 1;
>>>>>  287
>>>>>  288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>>>>>  289                              (page ? page->index : 0));
>>>>>  290
>>>>>  291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this line
>>>>>  292         if (ret != 0) {
>>>>>  293                 if (ret == AOP_TRUNCATED_PAGE)
>>>>>  294                         unlock = 0;
>>>>>  295                 mlog_errno(ret);
>>>>>  296                 goto out;
>>>>>  297         } 
>>>>>  
>>>>> in dlmglue.c file,
>>>>> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
>>>>> 2443                               struct buffer_head **ret_bh,
>>>>> 2444                               int ex,
>>>>> 2445                               struct page *page)
>>>>> 2446 {
>>>>> 2447         int ret;
>>>>> 2448
>>>>> 2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster lock? this way will let reading IO get starvation. 
>>>>> 2450         if (ret == -EAGAIN) {
>>>>> 2451                 unlock_page(page);
>>>>> 2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>>>> 2453                         ocfs2_inode_unlock(inode, ex);
>>>>> 2454                 ret = AOP_TRUNCATED_PAGE;
>>>>> 2455         }
>>>>> 2456
>>>>> 2457         return ret;
>>>>> 2458 }
>>>>>
>>>>> If you know the background behind the code, please tell us, why not use block way to get the lock in reading a page, then reading IO will get the page fairly when there is a concurrent writing IO from the other node.
>>>>> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to blocking way), the reading IO will not be blocked too much time (can erase the customer's complaining), but a new problem arises, sometimes the reading IO and writing IO get a dead lock (why dead lock? I am looking at). 
>>>>>
>>>>>
>>>>> Thanks
>>>>> Gang  
>>>>>
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel at oss.oracle.com
>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>>>
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>>
>>> .
>>>
>>
>>
>>
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> .
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently
  2015-12-09  6:07         ` Joseph Qi
@ 2015-12-10  6:31           ` Eric Ren
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Ren @ 2015-12-10  6:31 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joseph,

On Wed, Dec 09, 2015 at 02:07:56PM +0800, joseph wrote: 
> On 2015/12/8 16:26, Eric Ren wrote:
> > Hi Joseph,
> > On Tue, Dec 08, 2015 at 02:15:17PM +0800, joseph wrote: 
> >> On 2015/12/8 12:51, Eric Ren wrote:
> >>> Hi,
> >>>
> >>> On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: 
> >>>> Hi Gang,
> >>>> Eric and I have discussed this case before.
> >>>> Using NONBLOCK here is because there is a lock inversion between inode
> >>>> lock and page lock. You can refer to the comments of
> >>>> ocfs2_inode_lock_with_page for details.
> >>>> Actually I have found that NONBLOCK mode is only used in lock inversion
> >>>> cases.
> >>> Yes, that comments can helps;-) I try to explain it, if any problem please correct me.
> >>>
> >>> On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely doing reading) has
> >>> loced this page. Before thread A stepping into ocfs2_inode_lock_full(), on node 2, thread B(doing
> >>> writing) required EX lock on the same inode, if lockres->l_level is PR, so lockres needs to
> >>> been downconverted PR->NL on node1, i.e. ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread().
> >>>
> >>> On node 1, if tread ocfs2dc proceeds like this:
> >>> ocfs2_downconvert_thread()
> >>>  ocfs2_downconvert_thread_do_work()
> >>>   ocfs2_process_blocked_lock()
> >>>    ocfs2_unblock_lock()
> >>>     lockres->l_ops->downconvert_worker(lockres, blocking)
> >>>       ocfs2_data_convert_worker() {
> >>>         ...
> >>>         3557         if (blocking == DLM_LOCK_EX) {
> >>>         3558                 truncate_inode_pages(mapping, 0);
> >>>         3559         } else {
> >>>         3560                 /* We only need to wait on the I/O if we're not also
> >>>         3561                  * truncating pages because truncate_inode_pages waits
> >>>         3562                  * for us above. We don't truncate pages if we're
> >>>         3563                  * blocking anything < EXMODE because we want to keep
> >>>         3564                  * them around in that case. */
> >>>         3565                 filemap_fdatawait(mapping);
> >>>         3566         }
> >>>         ...
> >>>       }
> >>> We can see truncate_inode_pages()
> >>>             truncate_inode_pages_range()
> >>>               trylock_page()/lock_page()/find_lock_page()
> >>>
> >>> if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window that flag
> >>> OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like ocfs2_blocking_ast()->
> >>> ocfs2_generic_handle_bast(), so thread A may be blocked.
> >>>
> >>> Now, ocfs2dc wants page lock, but thread A hold page lock and has been blocked because of
> >>> thread B. Deadlock occurs, right? Again, please correct me if any.
> >> Yes, you are right.
> >>>
> >>> So, the author had to use UNBLOCK way here thus create livelock problem. But, I cannot understand
> >>> very clearly how livelock come up. Joseph or anyone else, could you please elaborate it?
> >> Firstly it takes inode lock in NONBLOCK way, if it couldn't be fulfilled
> >> currently and returns -EAGAIN, then unlock page and make downconvert go
> >> on. This is to release the deadlock.
> >> But we do have to take page lock and inode lock to safely accomplish the
> >> request. So it has to return to VFS and retry with page lock. Here
> >> taking inode lock and then unlocking it immediately in BLOCK way wants
> > Thanks! But unfortunately in our case, it oftern takes long time waiting here 
> > for inode lock, for some reasons I don't know yet. This is likely the cause why
> > ocfs2_readpage() takes long.
> One time takes long time or retry many times? The former may be because
One time takes long time, that's observed by ftrace ;-)
> downconvert take too much time, or there are many requests ahead of this
> node. And the latter may be this node still fails the race when retry.
Thanks very much for your info!
> 
> > 
> > I really want to know why to lock & unlock inode here? I'm glad you mention the reason,
> > but is it the only one?
> >> to cache the inode and increase the chance of the next retry. But if
> > Any reason to cache the inode? And how this can increase the chance of retry? 
> > It's really interesting... Please elaborate more;-)
> IMO? this is because unlock is not truly unlock. This node retry is
> somewhat 'local' operation, and another node has to send messages, so
> this node has more chance to win the race.
Sounds reasonable, thanks!

 Eric
> 
> Thanks,
> Joseph
> > 
> > Thanks,
> > Eric
> >> another node successfully takes the inode lock again immediately after
> >> this node unlocks inode, the flow you described above happens again and
> >> this node still couldn't be fulfilled. And so forth.
> >>
> >>>
> >>> Thanks,
> >>> Eric
> >>>
> >>>>
> >>>> Thanks,
> >>>> Joseph
> >>>>
> >>>> On 2015/12/8 11:21, Gang He wrote:
> >>>>> Hello Guys,
> >>>>>
> >>>>> There is a issue from the customer, who is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently.
> >>>>> According to the demo code from the customer, we also can reproduce this issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually this issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io mode, this issue will disappear.
> >>>>> Base on my investigation, the root cause is the buffer-io using cluster-lock is different from direct-io, I do not know why buffer-io use cluster-lock like this way.
> >>>>> the code details are as below,
> >>>>> in aops.c file,
> >>>>>  281 static int ocfs2_readpage(struct file *file, struct page *page)
> >>>>>  282 {
> >>>>>  283         struct inode *inode = page->mapping->host;
> >>>>>  284         struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >>>>>  285         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
> >>>>>  286         int ret, unlock = 1;
> >>>>>  287
> >>>>>  288         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
> >>>>>  289                              (page ? page->index : 0));
> >>>>>  290
> >>>>>  291         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this line
> >>>>>  292         if (ret != 0) {
> >>>>>  293                 if (ret == AOP_TRUNCATED_PAGE)
> >>>>>  294                         unlock = 0;
> >>>>>  295                 mlog_errno(ret);
> >>>>>  296                 goto out;
> >>>>>  297         } 
> >>>>>  
> >>>>> in dlmglue.c file,
> >>>>> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
> >>>>> 2443                               struct buffer_head **ret_bh,
> >>>>> 2444                               int ex,
> >>>>> 2445                               struct page *page)
> >>>>> 2446 {
> >>>>> 2447         int ret;
> >>>>> 2448
> >>>>> 2449         ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster lock? this way will let reading IO get starvation. 
> >>>>> 2450         if (ret == -EAGAIN) {
> >>>>> 2451                 unlock_page(page);
> >>>>> 2452                 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> >>>>> 2453                         ocfs2_inode_unlock(inode, ex);
> >>>>> 2454                 ret = AOP_TRUNCATED_PAGE;
> >>>>> 2455         }
> >>>>> 2456
> >>>>> 2457         return ret;
> >>>>> 2458 }
> >>>>>
> >>>>> If you know the background behind the code, please tell us, why not use block way to get the lock in reading a page, then reading IO will get the page fairly when there is a concurrent writing IO from the other node.
> >>>>> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to blocking way), the reading IO will not be blocked too much time (can erase the customer's complaining), but a new problem arises, sometimes the reading IO and writing IO get a dead lock (why dead lock? I am looking at). 
> >>>>>
> >>>>>
> >>>>> Thanks
> >>>>> Gang  
> >>>>>
> >>>>>
> >>>>>
> >>>>> .
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Ocfs2-devel mailing list
> >>>> Ocfs2-devel at oss.oracle.com
> >>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >>>>
> >>>
> >>> _______________________________________________
> >>> Ocfs2-devel mailing list
> >>> Ocfs2-devel at oss.oracle.com
> >>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >>>
> >>> .
> >>>
> >>
> >>
> >>
> > 
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> > 
> > .
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-12-10  6:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-08  3:21 [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently Gang He
2015-12-08  3:55 ` Joseph Qi
2015-12-08  4:51   ` Gang He
2015-12-08  4:51   ` Eric Ren
2015-12-08  6:15     ` Joseph Qi
2015-12-08  8:26       ` Eric Ren
2015-12-09  6:07         ` Joseph Qi
2015-12-10  6:31           ` Eric Ren

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.