* Another JFFS2 deadlock, kernel 3.4.11
@ 2015-10-26 10:53 wangzaiwei
2015-11-09 17:42 ` Thomas.Betker
0 siblings, 1 reply; 7+ messages in thread
From: wangzaiwei @ 2015-10-26 10:53 UTC (permalink / raw)
To: linux-mtd; +Cc: lizhenwei, 'Li Jiaxin'
Hello all:
Sorry for my pool English and bad Email using first.
I have encountered another deadlock between several JFFS2 threads.
(It is different from the one post at
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html )
Target system is a soc with BCM6838(Dual core,MIPS32, Broadcom BMIPS4350 V8.0)
We made a jffs2 partition on flash (S34ML01G200TFI000 1Gbit SLC Nand Chip),
Then we mount this jffs2 partition on both /app and /data
mtd:data on /data type jffs2 (rw,relatime)
mtd:data on /app type jffs2 (rw,relatime)
We encountered jffs2 deadlock issue first time when we run command "reboot"
in shell --- "reboot" stucked.Then we tried steps below
1, We found that neither /app nor /data could nearly not be accessed.
"ls" "touch" will be stucked too.
2, We noticed that [sync_supers] was at a state D.
3, We compiled a kernel module which can read process's kernel stack,and found
[sync_supers] stucked at lock_page(),just as same as which described at
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html .
4, "reboot" called sync(),then stucked.
So we patched our kernel refer to
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
SHA-1: 5ffd3412ae5536a4c57469cb8ea31887121dcb2e
* jffs2: Fix lock acquisition order bug in jffs2_write_begin
But these days, we encountered another deadlock .
our process stucked at system call 'unlink()' when we delete a file.
Enclosed scripts can be used to reproduce this new issue.
Scripts 1: testr.sh
#!/bin/sh
while [ 1 ]
do
cat /app/test.file >/dev/null
done
Scripts 2:test2.sh
#!/bin/sh
while [ 1 ]
do
ls /app -al >/dev/null
ls /app/test.file -al >/dev/null
done
Scripts 3:tests.sh
#!/bin/sh
while [ 1 ]
do
sync
sleep 1
done
Scripts 4:testw.sh
#!/bin/sh
while [ 1 ]
do
cat /etc/inittab >> /app/test.file
sleep 1
done
Scripts 5:testw.sh
#!/bin/sh
while [ 1 ]
do
cat /etc/config > /app/test.file
sleep 10
done
/etc/config is a ascii text file (size: 10785Bytes)
Run these scripts just like:
# ./testw2.sh &
# ./testw2.sh &
# ./testw2.sh &
# ./testw2.sh &
# ./testw.sh &
# ./testw.sh &
# ./testw.sh &
# ./testw.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./tests.sh &
# ./tests.sh &
about 10 minutes later, these test scripts will be blocked in state 'D'
We parsed this issue again.
for [sync_supers]
jffs2_garbage_collect_live
mutex_lock(&f->sem) (A)
jffs2_garbage_collect_dnode
jffs2_gc_fetch_page
read_cache_page_async
do_read_cache_page
lock_page(page) (B)
For other tasks
generic_file_aio_read
do_generic_file_read
lock_page_killable(page); (B)
mapping->a_ops->readpage (jffs2_readpage )
mutex_lock(&f->sem) (A)
We noticed that jffs2_readpage always be called with lock_page(page) hold,
but most of other functions in jffs2 module call mutex_lock(&f->sem) first,
lock_page(page) second. It is the same in latest kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Is this logical? Or is it just my understanding wrong?
More info:
We have tested this issue on another target system,a soc with FreeScale MPC8308
(PowerPc arch, e300c3), Linux version 2.6.29.6. it has a jffs2 partition on a
nor flash(S29GL512P11TFI010).
We never reproduced any jffs2 deadlock issue on this powerpc target.
Even though its linux source code has the proble mentioned in
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html.
What's the main difference between the two targets?
We noticed Backing-Dev(bdi_sync_supers) coming into kernel between two versions.
and different flash types being used for two targets(one nand ,one nor)
----------------------------------
From: wangzaiwei
15th R&D Department
Sumavision Technologies Co., Ltd.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Another JFFS2 deadlock, kernel 3.4.11
2015-10-26 10:53 Another JFFS2 deadlock, kernel 3.4.11 wangzaiwei
@ 2015-11-09 17:42 ` Thomas.Betker
2015-11-11 8:16 ` wangzaiwei
0 siblings, 1 reply; 7+ messages in thread
From: Thomas.Betker @ 2015-11-09 17:42 UTC (permalink / raw)
To: wangzaiwei
Cc: 'Li Jiaxin',
linux-mtd, linux-mtd, lizhenwei, Deng Chao, Ming Liu,
Joakim Tjernlund
Hello wangzaiwei:
> So we patched our kernel refer to
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> SHA-1: 5ffd3412ae5536a4c57469cb8ea31887121dcb2e
> * jffs2: Fix lock acquisition order bug in jffs2_write_begin
>
> But these days, we encountered another deadlock .
> our process stucked at system call 'unlink()' when we delete a file.
>
> Enclosed scripts can be used to reproduce this new issue.
[snip]
> about 10 minutes later, these test scripts will be blocked in state 'D'
>
> We parsed this issue again.
> for [sync_supers]
> jffs2_garbage_collect_live
> mutex_lock(&f->sem) (A)
> jffs2_garbage_collect_dnode
> jffs2_gc_fetch_page
> read_cache_page_async
> do_read_cache_page
> lock_page(page) (B)
> For other tasks
> generic_file_aio_read
> do_generic_file_read
> lock_page_killable(page); (B)
> mapping->a_ops->readpage (jffs2_readpage )
> mutex_lock(&f->sem) (A)
>
> We noticed that jffs2_readpage always be called with lock_page(page)
hold,
> but most of other functions in jffs2 module call mutex_lock(&f->sem)
first,
> lock_page(page) second. It is the same in latest kernel:
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> Is this logical? Or is it just my understanding wrong?
This looks suspiciously like a deadlock reported by Ming Liu
(22-Aug-2013). This deadlock, and another one reported by Deng Chao
(23-Jul-2013), were introduced by my patch, "jffs2: Fix lock acquisition
order bug in jffs2_write_begin".
Deng Chao has created a patch which a) removes the deadlock I wanted to
get rid of originally, without b) introducing the new deadlocks; see
http://lists.infradead.org/pipermail/linux-mtd/2013-August/048352.html.
However, his patch modifies mm/filemap.c, and we were hoping to find a
more light-weight solution -- which never came to be.
I do use his patch here around, though, and so far, it has worked fine. I
will try to run your test scripts on one of our devices, and see if it
holds up.
Anyway, I think I should revert my patch (and should have done so a long
time ago) even if this means that my original deadlock will come back.
This is neccessary in any case to clear the way for Deng Chao's patch, or
perhaps for some other solution. Joakim, what's your take on this?
Best regards,
Thomas Betker
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: Another JFFS2 deadlock, kernel 3.4.11
2015-11-09 17:42 ` Thomas.Betker
@ 2015-11-11 8:16 ` wangzaiwei
2015-11-11 12:20 ` Thomas.Betker
0 siblings, 1 reply; 7+ messages in thread
From: wangzaiwei @ 2015-11-11 8:16 UTC (permalink / raw)
To: Thomas.Betker
Cc: 'Li Jiaxin',
linux-mtd, linux-mtd, lizhenwei, Deng Chao, Ming Liu,
Joakim Tjernlund
Hello Thomas:
> This looks suspiciously like a deadlock reported by Ming Liu
> (22-Aug-2013). This deadlock, and another one reported by Deng Chao
> (23-Jul-2013), were introduced by my patch, "jffs2: Fix lock acquisition
> order bug in jffs2_write_begin".
> Deng Chao has created a patch which a) removes the deadlock I wanted to
> get rid of originally, without b) introducing the new deadlocks; see
> http://lists.infradead.org/pipermail/linux-mtd/2013-August/048352.html.
> However, his patch modifies mm/filemap.c, and we were hoping to find a
> more light-weight solution -- which never came to be.
> I do use his patch here around, though, and so far, it has worked fine. I
> will try to run your test scripts on one of our devices, and see if it
> holds up.
Though I didn't know about that Deng Chao and Ming Liu had reported the issue,
I have had the same patch thinking.
Yes,these deadlock issues which we have found always occured between gc thread(may
actived by sync system call) and other user tasks
gc thread just like :
> for [sync_supers]
> jffs2_garbage_collect_live
> mutex_lock(&f->sem) (A)
> jffs2_garbage_collect_dnode
> jffs2_gc_fetch_page
> read_cache_page_async
> do_read_cache_page
> lock_page(page) (B)
if we change lock_page(page) above to lock_page_try(page),deadlock will go away.
But i worry about this workaround. jffs2_garbage_collect_live action will changed.
and jffs2_garbage_collect_live is called not only by gc thread.
Is it ok to return an error rather than blocking .
Can syscall 'sync' still reach its goal ?
wangzaiwei
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Another JFFS2 deadlock, kernel 3.4.11
2015-11-11 8:16 ` wangzaiwei
@ 2015-11-11 12:20 ` Thomas.Betker
2015-11-12 2:26 ` deng.chao1
0 siblings, 1 reply; 7+ messages in thread
From: Thomas.Betker @ 2015-11-11 12:20 UTC (permalink / raw)
To: wangzaiwei
Cc: Deng Chao, Joakim Tjernlund, 'Li Jiaxin',
linux-mtd, linux-mtd, Ming Liu, lizhenwei
Hello wangzaiwei:
> > Deng Chao has created a patch which a) removes the deadlock I wanted
to
> > get rid of originally, without b) introducing the new deadlocks; see
> > http://lists.infradead.org/pipermail/linux-mtd/2013-August/048352.html
.
> > However, his patch modifies mm/filemap.c, and we were hoping to find a
> > more light-weight solution -- which never came to be.
>
> > I do use his patch here around, though, and so far, it has worked
fine. I
> > will try to run your test scripts on one of our devices, and see if it
> > holds up.
I have run your scripts on my device (with Deng Chao's patch) for three
hours. None of the scripts got into state 'D', and the system is still
alive (2 CPUs, so if a deadlock had occurred, the system would have
stopped dead).
> Though I didn't know about that Deng Chao and Ming Liu had reported the
issue,
> I have had the same patch thinking.
>
> Yes,these deadlock issues which we have found always occured between
> gc thread(may
> actived by sync system call) and other user tasks
>
> gc thread just like :
> > for [sync_supers]
> > jffs2_garbage_collect_live
> > mutex_lock(&f->sem) (A)
> > jffs2_garbage_collect_dnode
> > jffs2_gc_fetch_page
> > read_cache_page_async
> > do_read_cache_page
> > lock_page(page) (B)
>
>
> if we change lock_page(page) above to lock_page_try(page),deadlock
> will go away.
>
> But i worry about this workaround. jffs2_garbage_collect_live action
> will changed.
> and jffs2_garbage_collect_live is called not only by gc thread.
> Is it ok to return an error rather than blocking .
> Can syscall 'sync' still reach its goal ?
I think that Deng Chao has discussed this in his patch. We have been using
the patch for almost two years now, and I didn't see any bad effects yet.
Best regards,
Thomas Betker
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Another JFFS2 deadlock, kernel 3.4.11
2015-11-11 12:20 ` Thomas.Betker
@ 2015-11-12 2:26 ` deng.chao1
2015-11-12 9:47 ` Thomas.Betker
0 siblings, 1 reply; 7+ messages in thread
From: deng.chao1 @ 2015-11-12 2:26 UTC (permalink / raw)
To: Thomas.Betker
Cc: Joakim Tjernlund, 'Li Jiaxin',
linux-mtd, linux-mtd, Ming Liu, lizhenwei, wangzaiwei, chao.deng
Hi all:
My patch makes jffs2_garbage_collect_pass return 0 NOT error when it can not get page lock. This means "try again".
No matter where jffs2_garbage_collect_pass is called, it will always loop until it gets its goal.
However wangzaiwei's doubt is somehow reasonable. jffs2_garbage_collect_pass is not only called by gc thread but also by jffs2_reserve_space, this will introduce a living lock in a rare situation.
Consider this:
The disk is almost full, this means jffs2_reserve_space may call jffs2_garbage_collect_pass to get free space when performing write operation.
Then Thread A has acquired the page lock when it is now writing, and its priority is low.
Thread B is a rt thread, and its priority is higher than A, is also writing. If B preempts A when A is holding the page lock, and in the same time B calls jffs2_reserve_space->jffs2_collect_pass to acquire the page lock, living lock occurs: B will always loop to wait A to release the page lock which is preemptted by B itself.
To solve this, I make jffs2_reserve_space to sleep a while when it finds jffs2_garbage_collect_pass cannot fetch the page lock.
Still,I agree with Thomas that my patch is too heavy. It will be much better if we find way to just modify jffs2_garbage_collect_pass to avoid the original deadlock.
But I think the fix is too tricky to me, I have not got any idea yet.
Thanks
Dengchao
Thomas.Betker@rohde-schwarz.com
2015-11-11 20:20 To
wangzaiwei <wangzaiwei@top-vision.cn>,
cc
Deng Chao <deng.chao1@zte.com.cn>, Joakim Tjernlund <joakim.tjernlund@transmode.se>, 'Li Jiaxin' <lijiaxin@top-vision.cn>, linux-mtd <linux-mtd@lists.infradead.org>, linux-mtd <linux-mtd-bounces@lists.infradead.org>, Ming Liu <liu.ming50@gmail.com>, lizhenwei <lizhenwei@top-vision.cn>
Subject
Re: Another JFFS2 deadlock, kernel 3.4.11
Hello wangzaiwei:
> > Deng Chao has created a patch which a) removes the deadlock I wanted
to
> > get rid of originally, without b) introducing the new deadlocks; see
> > http://lists.infradead.org/pipermail/linux-mtd/2013-August/048352.html
.
> > However, his patch modifies mm/filemap.c, and we were hoping to find a
> > more light-weight solution -- which never came to be.
>
> > I do use his patch here around, though, and so far, it has worked
fine. I
> > will try to run your test scripts on one of our devices, and see if it
> > holds up.
I have run your scripts on my device (with Deng Chao's patch) for three
hours. None of the scripts got into state 'D', and the system is still
alive (2 CPUs, so if a deadlock had occurred, the system would have
stopped dead).
> Though I didn't know about that Deng Chao and Ming Liu had reported the
issue,
> I have had the same patch thinking.
>
> Yes,these deadlock issues which we have found always occured between
> gc thread(may
> actived by sync system call) and other user tasks
>
> gc thread just like :
> > for [sync_supers]
> > jffs2_garbage_collect_live
> > mutex_lock(&f->sem) (A)
> > jffs2_garbage_collect_dnode
> > jffs2_gc_fetch_page
> > read_cache_page_async
> > do_read_cache_page
> > lock_page(page) (B)
>
>
> if we change lock_page(page) above to lock_page_try(page),deadlock
> will go away.
>
> But i worry about this workaround. jffs2_garbage_collect_live action
> will changed.
> and jffs2_garbage_collect_live is called not only by gc thread.
> Is it ok to return an error rather than blocking .
> Can syscall 'sync' still reach its goal ?
I think that Deng Chao has discussed this in his patch. We have been using
the patch for almost two years now, and I didn't see any bad effects yet.
Best regards,
Thomas Betker
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Another JFFS2 deadlock, kernel 3.4.11
2015-11-12 2:26 ` deng.chao1
@ 2015-11-12 9:47 ` Thomas.Betker
2015-11-13 2:31 ` deng.chao1
0 siblings, 1 reply; 7+ messages in thread
From: Thomas.Betker @ 2015-11-12 9:47 UTC (permalink / raw)
To: deng.chao1
Cc: chao.deng, Joakim Tjernlund, 'Li Jiaxin',
linux-mtd, linux-mtd, Ming Liu, lizhenwei, wangzaiwei
Hello Deng:
> My patch makes jffs2_garbage_collect_pass return 0 NOT error when
> it can not get page lock. This means "try again".
> No matter where jffs2_garbage_collect_pass is called, it will
> always loop until it gets its goal.
> However wangzaiwei's doubt is somehow reasonable.
> jffs2_garbage_collect_pass is not only called by gc thread but also
> by jffs2_reserve_space, this will introduce a living lock in a rare
situation.
> Consider this:
> The disk is almost full, this means jffs2_reserve_space may call
> jffs2_garbage_collect_pass to get free space when performing write
operation.
> Then Thread A has acquired the page lock when it is now writing,
> and its priority is low.
> Thread B is a rt thread, and its priority is higher than A, is
> also writing. If B preempts A when A is holding the page lock, and
> in the same time B calls jffs2_reserve_space->jffs2_collect_pass to
> acquire the page lock, living lock occurs: B will always loop to
> wait A to release the page lock which is preemptted by B itself.
>
> To solve this, I make jffs2_reserve_space to sleep a while when it
> finds jffs2_garbage_collect_pass cannot fetch the page lock.
>
> Still,I agree with Thomas that my patch is too heavy. It will be
> much better if we find way to just modify jffs2_garbage_collect_pass
> to avoid the original deadlock.
> But I think the fix is too tricky to me, I have not got any idea yet.
I would still think it's a goood idea if you sent your current patch to
linux-mtd and linux-mm. At the moment, it's the only solution we got, and
perhaps somebody on linux-mm will find a better way. And your (old) patch
has been running happily on our devices for almost two years now, so I can
provide a Tested-by:.
@wangzaiwei: Your test scripts have actually run for 24 hours on my device
without any problems (I had to stop it this morning).
Best regards,
Thomas Betker
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Another JFFS2 deadlock, kernel 3.4.11
2015-11-12 9:47 ` Thomas.Betker
@ 2015-11-13 2:31 ` deng.chao1
0 siblings, 0 replies; 7+ messages in thread
From: deng.chao1 @ 2015-11-13 2:31 UTC (permalink / raw)
To: Thomas.Betker
Cc: chao.deng, Joakim Tjernlund, 'Li Jiaxin',
linux-mtd, linux-mtd, Ming Liu, lizhenwei, wangzaiwei
Hi Thomas:
OK. I'll re-do the patch, and post it to linux-mtd and linux-mm in several days.
Then wait for comments.
Thanks
Dengchao
Thomas.Betker@rohde-schwarz.com
2015-11-12 17:47 To
deng.chao1@zte.com.cn,
cc
chao.deng@linaro.org, Joakim Tjernlund <joakim.tjernlund@transmode.se>, "'Li Jiaxin'" <lijiaxin@top-vision.cn>, linux-mtd <linux-mtd@lists.infradead.org>, linux-mtd <linux-mtd-bounces@lists.infradead.org>, Ming Liu <liu.ming50@gmail.com>, lizhenwei <lizhenwei@top-vision.cn>, wangzaiwei <wangzaiwei@top-vision.cn>
Subject
Re: Another JFFS2 deadlock, kernel 3.4.11
Hello Deng:
> My patch makes jffs2_garbage_collect_pass return 0 NOT error when
> it can not get page lock. This means "try again".
> No matter where jffs2_garbage_collect_pass is called, it will
> always loop until it gets its goal.
> However wangzaiwei's doubt is somehow reasonable.
> jffs2_garbage_collect_pass is not only called by gc thread but also
> by jffs2_reserve_space, this will introduce a living lock in a rare
situation.
> Consider this:
> The disk is almost full, this means jffs2_reserve_space may call
> jffs2_garbage_collect_pass to get free space when performing write
operation.
> Then Thread A has acquired the page lock when it is now writing,
> and its priority is low.
> Thread B is a rt thread, and its priority is higher than A, is
> also writing. If B preempts A when A is holding the page lock, and
> in the same time B calls jffs2_reserve_space->jffs2_collect_pass to
> acquire the page lock, living lock occurs: B will always loop to
> wait A to release the page lock which is preemptted by B itself.
>
> To solve this, I make jffs2_reserve_space to sleep a while when it
> finds jffs2_garbage_collect_pass cannot fetch the page lock.
>
> Still,I agree with Thomas that my patch is too heavy. It will be
> much better if we find way to just modify jffs2_garbage_collect_pass
> to avoid the original deadlock.
> But I think the fix is too tricky to me, I have not got any idea yet.
I would still think it's a goood idea if you sent your current patch to
linux-mtd and linux-mm. At the moment, it's the only solution we got, and
perhaps somebody on linux-mm will find a better way. And your (old) patch
has been running happily on our devices for almost two years now, so I can
provide a Tested-by:.
@wangzaiwei: Your test scripts have actually run for 24 hours on my device
without any problems (I had to stop it this morning).
Best regards,
Thomas Betker
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-11-13 2:31 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-26 10:53 Another JFFS2 deadlock, kernel 3.4.11 wangzaiwei
2015-11-09 17:42 ` Thomas.Betker
2015-11-11 8:16 ` wangzaiwei
2015-11-11 12:20 ` Thomas.Betker
2015-11-12 2:26 ` deng.chao1
2015-11-12 9:47 ` Thomas.Betker
2015-11-13 2:31 ` deng.chao1
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.