ocfs2-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan
@ 2020-11-09 17:17 Wengang Wang
  2020-11-10  1:33 ` Joseph Qi
  0 siblings, 1 reply; 5+ messages in thread
From: Wengang Wang @ 2020-11-09 17:17 UTC (permalink / raw)
  To: ocfs2-devel

Though problem if found on a lower 4.1.12 kernel, I think upstream
has same issue.

In one node in the cluster, there is the following callback trace:

# cat /proc/21473/stack
[<ffffffffc09a2f06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
[<ffffffffc09a4481>] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
[<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820 [ocfs2]
[<ffffffff8122b36e>] evict+0xae/0x1a0
[<ffffffff8122bd26>] iput+0x1c6/0x230
[<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
[<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
[<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
[<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
[<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
[<ffffffff810a1399>] process_one_work+0x169/0x4a0
[<ffffffff810a1bcb>] worker_thread+0x5b/0x560
[<ffffffff810a7a2b>] kthread+0xcb/0xf0
[<ffffffff816f5d21>] ret_from_fork+0x61/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

The above stack is not reasonable, the final iput shouldn't happen in
ocfs2_orphan_filldir() function. Looking at the code,

2067         /* Skip inodes which are already added to recover list, since dio may
2068          * happen concurrently with unlink/rename */
2069         if (OCFS2_I(iter)->ip_next_orphan) {
2070                 iput(iter);
2071                 return 0;
2072         }
2073

The logic thinks the inode is already in recover list on seeing
ip_next_orphan is non-NULL, so it skip this inode after dropping a
reference which incremented in ocfs2_iget().

While, if the inode is already in recover list, it should have another
reference and the iput() at line 2070 should not be the final iput
(dropping the last reference). So I don't think the inode is really
in the recover list (no vmcore to confirm).

Note that ocfs2_queue_orphans(), though not shown up in the call back trace,
is holding cluster lock on the orphan directory when looking up for unlinked
inodes. The on disk inode eviction could involve a lot of IOs which may need
long time to finish. That means this node could hold the cluster lock for
very long time, that can lead to the lock requests (from other nodes) to the
orhpan directory hang for long time.

Looking at more on ip_next_orphan, I found it's not initialized when
allocating a new ocfs2_inode_info structure.

Fix:
	initialize ip_next_orphan as NULL.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
v1 -> v2: move the initialization of ip_next_orphan earlier.
---
 fs/ocfs2/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 1d91dd1e8711..2febc76e9de7 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1713,6 +1713,7 @@ static void ocfs2_inode_init_once(void *data)
 
 	oi->ip_blkno = 0ULL;
 	oi->ip_clusters = 0;
+	oi->ip_next_orphan = NULL;
 
 	ocfs2_resv_init_once(&oi->ip_la_data_resv);
 
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan
  2020-11-09 17:17 [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan Wengang Wang
@ 2020-11-10  1:33 ` Joseph Qi
  2020-11-10  1:56   ` Wengang Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Qi @ 2020-11-10  1:33 UTC (permalink / raw)
  To: ocfs2-devel



On 11/10/20 1:17 AM, Wengang Wang wrote:
> Though problem if found on a lower 4.1.12 kernel, I think upstream
> has same issue.
>
> In one node in the cluster, there is the following callback trace:
>
> # cat /proc/21473/stack
> [<ffffffffc09a2f06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
> [<ffffffffc09a4481>] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
> [<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820 [ocfs2]
> [<ffffffff8122b36e>] evict+0xae/0x1a0
> [<ffffffff8122bd26>] iput+0x1c6/0x230
> [<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
> [<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
> [<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
> [<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
> [<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
> [<ffffffff810a1399>] process_one_work+0x169/0x4a0
> [<ffffffff810a1bcb>] worker_thread+0x5b/0x560
> [<ffffffff810a7a2b>] kthread+0xcb/0xf0
> [<ffffffff816f5d21>] ret_from_fork+0x61/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The above stack is not reasonable, the final iput shouldn't happen in
> ocfs2_orphan_filldir() function. Looking at the code,
>
> 2067         /* Skip inodes which are already added to recover list, since dio may
> 2068          * happen concurrently with unlink/rename */
> 2069         if (OCFS2_I(iter)->ip_next_orphan) {
> 2070                 iput(iter);
> 2071                 return 0;
> 2072         }
> 2073
>
> The logic thinks the inode is already in recover list on seeing
> ip_next_orphan is non-NULL, so it skip this inode after dropping a
> reference which incremented in ocfs2_iget().
>
> While, if the inode is already in recover list, it should have another
> reference and the iput() at line 2070 should not be the final iput
> (dropping the last reference). So I don't think the inode is really
> in the recover list (no vmcore to confirm).
>
> Note that ocfs2_queue_orphans(), though not shown up in the call back trace,
> is holding cluster lock on the orphan directory when looking up for unlinked
> inodes. The on disk inode eviction could involve a lot of IOs which may need
> long time to finish. That means this node could hold the cluster lock for
> very long time, that can lead to the lock requests (from other nodes) to the
> orhpan directory hang for long time.
>
> Looking at more on ip_next_orphan, I found it's not initialized when
> allocating a new ocfs2_inode_info structure.
>
> Fix:
> 	initialize ip_next_orphan as NULL.
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> ---
> v1 -> v2: move the initialization of ip_next_orphan earlier.
> ---
>   fs/ocfs2/super.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
> index 1d91dd1e8711..2febc76e9de7 100644
> --- a/fs/ocfs2/super.c
> +++ b/fs/ocfs2/super.c
> @@ -1713,6 +1713,7 @@ static void ocfs2_inode_init_once(void *data)
>   
>   	oi->ip_blkno = 0ULL;
>   	oi->ip_clusters = 0;
> +	oi->ip_next_orphan = NULL;
>   
>   	ocfs2_resv_init_once(&oi->ip_la_data_resv);
>   

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan
  2020-11-10  1:33 ` Joseph Qi
@ 2020-11-10  1:56   ` Wengang Wang
  2020-11-10 21:04     ` Andrew Morton
  0 siblings, 1 reply; 5+ messages in thread
From: Wengang Wang @ 2020-11-10  1:56 UTC (permalink / raw)
  To: ocfs2-devel


On 11/9/20 5:33 PM, Joseph Qi wrote:
>
>
> On 11/10/20 1:17 AM, Wengang Wang wrote:
>> Though problem if found on a lower 4.1.12 kernel, I think upstream
>> has same issue.
>>
>> In one node in the cluster, there is the following callback trace:
>>
>> # cat /proc/21473/stack
>> [<ffffffffc09a2f06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
>> [<ffffffffc09a4481>] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
>> [<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820 [ocfs2]
>> [<ffffffff8122b36e>] evict+0xae/0x1a0
>> [<ffffffff8122bd26>] iput+0x1c6/0x230
>> [<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
>> [<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
>> [<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
>> [<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
>> [<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
>> [<ffffffff810a1399>] process_one_work+0x169/0x4a0
>> [<ffffffff810a1bcb>] worker_thread+0x5b/0x560
>> [<ffffffff810a7a2b>] kthread+0xcb/0xf0
>> [<ffffffff816f5d21>] ret_from_fork+0x61/0x90
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> The above stack is not reasonable, the final iput shouldn't happen in
>> ocfs2_orphan_filldir() function. Looking at the code,
>>
>> 2067???????? /* Skip inodes which are already added to recover list, 
>> since dio may
>> 2068????????? * happen concurrently with unlink/rename */
>> 2069???????? if (OCFS2_I(iter)->ip_next_orphan) {
>> 2070???????????????? iput(iter);
>> 2071???????????????? return 0;
>> 2072???????? }
>> 2073
>>
>> The logic thinks the inode is already in recover list on seeing
>> ip_next_orphan is non-NULL, so it skip this inode after dropping a
>> reference which incremented in ocfs2_iget().
>>
>> While, if the inode is already in recover list, it should have another
>> reference and the iput() at line 2070 should not be the final iput
>> (dropping the last reference). So I don't think the inode is really
>> in the recover list (no vmcore to confirm).
>>
>> Note that ocfs2_queue_orphans(), though not shown up in the call back 
>> trace,
>> is holding cluster lock on the orphan directory when looking up for 
>> unlinked
>> inodes. The on disk inode eviction could involve a lot of IOs which 
>> may need
>> long time to finish. That means this node could hold the cluster lock 
>> for
>> very long time, that can lead to the lock requests (from other nodes) 
>> to the
>> orhpan directory hang for long time.
>>
>> Looking at more on ip_next_orphan, I found it's not initialized when
>> allocating a new ocfs2_inode_info structure.
>>
>> Fix:
>> ????initialize ip_next_orphan as NULL.
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

Thank you Joseph!

AKPM, could you please pull this patch to your tree?

thanks,
wengang

>> ---
>> v1 -> v2: move the initialization of ip_next_orphan earlier.
>> ---
>> ? fs/ocfs2/super.c | 1 +
>> ? 1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>> index 1d91dd1e8711..2febc76e9de7 100644
>> --- a/fs/ocfs2/super.c
>> +++ b/fs/ocfs2/super.c
>> @@ -1713,6 +1713,7 @@ static void ocfs2_inode_init_once(void *data)
>> ? ????? oi->ip_blkno = 0ULL;
>> ????? oi->ip_clusters = 0;
>> +??? oi->ip_next_orphan = NULL;
>> ? ????? ocfs2_resv_init_once(&oi->ip_la_data_resv);
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan
  2020-11-10  1:56   ` Wengang Wang
@ 2020-11-10 21:04     ` Andrew Morton
  2020-11-11  0:55       ` Wengang Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2020-11-10 21:04 UTC (permalink / raw)
  To: ocfs2-devel

On Mon, 9 Nov 2020 17:56:41 -0800 Wengang Wang <wen.gang.wang@oracle.com> wrote:

> >> Looking at more on ip_next_orphan, I found it's not initialized when
> >> allocating a new ocfs2_inode_info structure.
> >>
> >> Fix:
> >> ????initialize ip_next_orphan as NULL.
> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >
> > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> 
> Thank you Joseph!
> 
> AKPM, could you please pull this patch to your tree?

I'm trying to figure out whether we should add cc:stable to this.  But
to know that, I need to know the end-user visible effects of the bug,
and that isn't described in this changelog (it should have been!).

Please describe this?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan
  2020-11-10 21:04     ` Andrew Morton
@ 2020-11-11  0:55       ` Wengang Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Wengang Wang @ 2020-11-11  0:55 UTC (permalink / raw)
  To: ocfs2-devel


On 11/10/20 1:04 PM, Andrew Morton wrote:
> On Mon, 9 Nov 2020 17:56:41 -0800 Wengang Wang <wen.gang.wang@oracle.com> wrote:
>
>>>> Looking at more on ip_next_orphan, I found it's not initialized when
>>>> allocating a new ocfs2_inode_info structure.
>>>>
>>>> Fix:
>>>>  ????initialize ip_next_orphan as NULL.
>>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
>> Thank you Joseph!
>>
>> AKPM, could you please pull this patch to your tree?
> I'm trying to figure out whether we should add cc:stable to this.  But
> to know that, I need to know the end-user visible effects of the bug,
> and that isn't described in this changelog (it should have been!).
>
> Please describe this?

The observed end-user visible effects is that:

The reflink operations from some nodes hang for very long time waiting 
for the cluster lock on the orphan directory.

So yes, we should add cc:stable. Sorry, I missed that part.

Thanks,
Wengang

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-11  0:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09 17:17 [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan Wengang Wang
2020-11-10  1:33 ` Joseph Qi
2020-11-10  1:56   ` Wengang Wang
2020-11-10 21:04     ` Andrew Morton
2020-11-11  0:55       ` Wengang Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).