From: Al Viro <viro@zeniv.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "zhengbin (A)" <zhengbin13@huawei.com>, Jan Kara <jack@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
"zhangyi (F)" <yi.zhang@huawei.com>,
renxudong1@huawei.com, Hou Tao <houtao1@huawei.com>
Subject: Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
Date: Sun, 15 Sep 2019 01:50:46 +0100 [thread overview]
Message-ID: <20190915005046.GV1131@ZenIV.linux.org.uk> (raw)
In-Reply-To: <CAHk-=whpoQ_hX2KeqjQs3DeX6Wb4Tmb8BkHa5zr-Xu=S55+ORg@mail.gmail.com>
On Sat, Sep 14, 2019 at 03:57:15PM -0700, Linus Torvalds wrote:
> But it is also possible that we could avoid some of that O(n**2)
> behavior by simply not adding the corsor to the end of the dentry
> child list at all. Right now your patch *always* sets the cursor at a
> valid point - even if it's at the end of the directory. But we could
> skip the "end of the directory" case entirely and just set a flag in
> the file for "at eof" instead.
Yeah.
> That way the cursor at least wouldn't exist for the common cases when
> we return to user space (at the beginning of the readdir and at the
> end). Which might make the cursors simply not be quite as common, even
> when you have a _lot_ of concurrent readdir() users.
>
> There may be other similar things we could do to minimize the pressure
> on the parent dentry lock. For example, maybe we could insert a cursor
> only _between_ readdir() calls, but then in the readdir() loop itself,
> we know we hold the inode lock shared for the directory, so during
> _that_ loop we can use the existing positive dentries that we keep a
> refcount to as cursors.
>
> Because if the only reason to not use existing dentry pointers as
> cursors is the concurrent rename() issue, then _that_ is certainly
> something that the parent inode shared lock should protect against,
> even if the child dentry list can change in other ways).
>
> So if the main 'reaim' problem was that the dentry child list itself
> grew due to the cursor pressure (and that is likely) and that in turn
> then caused the O(n**2) bad locking behavior due to having to walk
> much longer dentry child chains, then I suspect that we can do a few
> tweaks to simply not make that happen in practice.
>
> Yes, I realize that I'm handwaving a bit on the two above suggestions,
> but don't you think that sounds doable?
I think I have a stronger solution. It includes the "cursors past
the EOF are marked", but that's not all.
Look: the obvious problem with adding an hlist of cursors anchored in
dentry is that we don't want to blow dentry size. OK, but... we have
d_subdirs/d_child. And the same crawl through the tree has shown
exactly one place where we do anything to the end of the list - right
in dcache_readdir(). So we could bloody well turn that thing into
hlist. Voila - a pointer saved.
So, fuck using struct dentry for cursors. What we need in there
is
* pointer to struct dentry we would be about to read
* hlist_node linking them together
* indicator of pointing before / after the entire directory
contents.
* some way to get to dentry of directory and/or struct file.
Which can be done in much smaller space than struct dentry; details
of representation really don't matter.
d_subdirs/d_child become an hlist_head/hlist_node list; no cursors
in there at any time.
d_cursors is a new hlist_head, anchoring the set of cursor that
point to this sucker. The list is protected by ->d_lock of
that dentry.
d_cursors being non-empty contributes 1 to d_count.
dcache_readdir()/dcache_dir_lseek() have exclusion with simple_rename()
on parent's inode. On per-cursor basis they have exclusion with
each other (already; ->f_lock_pos).
dcache_dir_close() locks directory shared, giving it exclusion with
simple_rename(); per-cursor exclusion is, of course, already there.
Under that rwsem it removes the cursor from the hlist it's on (if
any), using ->d_lock of whatever it's point on for protection.
If that was the last cursor in hlist, drop the target after removal.
In any case, cursor gets freed.
In simple_rename():
if there are cursors
grab the next sibling (if any)
take ->d_lock on source
rip the list out of it
drop ->d_lock on source
if there's no sibling
go through the list
point cursors post-EOF, dissolving hlist
else
go through the list
point cursors to the sibling
find the last element of the list, while we are at it
if the sibling's list hadn't been empty to start with
drop our reference to sibling
take ->d_lock on the sibling
splice the list in
drop ->d_lock on sibling
drop the reference to source (from now-empty ->d_cursors)
Note that it's O(1) under any ->d_lock and O(cursors moved) without
holding any ->d_lock.
walk_cursor(cursor, count):
while count && cursor is not post-EOF
drop_old = NULL
if cursor points to anything
take ->d_lock on the target
remove cursor from hlist
if target's ->d_cursor is now empty
drop_old = target
drop ->d_lock on the target
p = target's ->d_child
else
p = parent's ->d_subdirs
take ->d_lock on directory
drop_new = NULL
eof = true
for d in list, starting after p
if d is positive && !--count || need_resched
eof = false
drop_new = dget(d)
point cursor to d
take ->d_lock on d
if its ->d_cursors is empty
drop_new = NULL
insert cursor into its ->d_cursors
drop ->d_lock on d
break
drop ->d_lock on directory
dput(drop_new)
dput(drop_old)
if eof
mark cursor post-EOF
else if count
cond_resched()
dcache_dir_lseek(), "changing position" case:
walk_cursor(cursor, where)
dcache_readdir() loop:
while cursor not post-EOF
if dir_emit the cursor's target fails
break
walk_cursor(cursor, 1)
ctx->pos++
dentry_unlink(): none of the cursor shit in the list, TYVM, and we can't
be called with cursors attached - d_count would've been positive.
What it should, AFAICS, give:
* no loops under ->d_lock in dentry_unlist()
* no cursors polluting the lists
* dentry size unchanged
* cursors considerably smaller than now
* no looping for hell knows how long under ->d_lock
And it might be possible to be clever enough to get lockless walk_cursor()
(lockless in terms of directory ->d_lock, that is), but even without that
this scheme looks interesting, IMO.
I haven't even started trying to implement that, so I might very well have
missed obvious problems. Comments?
next prev parent reply other threads:[~2019-09-15 0:51 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-03 14:44 Possible FS race condition between iterate_dir and d_alloc_parallel zhengbin (A)
2019-09-03 15:40 ` Al Viro
2019-09-03 15:41 ` Al Viro
2019-09-04 6:15 ` zhengbin (A)
2019-09-05 17:47 ` Al Viro
2019-09-06 0:55 ` Jun Li
2019-09-06 2:00 ` Al Viro
2019-09-06 2:32 ` zhengbin (A)
2019-09-09 14:10 ` zhengbin (A)
2019-09-09 14:59 ` Al Viro
2019-09-09 15:10 ` zhengbin (A)
[not found] ` <7e32cda5-dc89-719d-9651-cf2bd06ae728@huawei.com>
2019-09-10 21:53 ` Al Viro
2019-09-10 22:17 ` Al Viro
2019-09-14 16:16 ` [PATCH] " Al Viro
2019-09-14 16:49 ` Linus Torvalds
2019-09-14 17:01 ` Al Viro
2019-09-14 17:15 ` Linus Torvalds
2019-09-14 20:04 ` Al Viro
2019-09-14 22:57 ` Linus Torvalds
2019-09-15 0:50 ` Al Viro [this message]
2019-09-15 1:41 ` Linus Torvalds
2019-09-15 16:02 ` Al Viro
2019-09-15 17:58 ` Linus Torvalds
2019-09-21 14:07 ` Al Viro
2019-09-21 16:21 ` Linus Torvalds
2019-09-21 17:18 ` Al Viro
2019-09-21 17:38 ` Linus Torvalds
2019-09-24 2:52 ` Al Viro
2019-09-24 13:30 ` Josef Bacik
2019-09-24 14:51 ` Al Viro
2019-09-24 15:01 ` Josef Bacik
2019-09-24 15:11 ` Al Viro
2019-09-24 15:26 ` Josef Bacik
2019-09-24 16:33 ` Al Viro
[not found] ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
2019-09-29 5:29 ` Al Viro
2019-09-25 11:59 ` Amir Goldstein
2019-09-25 12:22 ` Al Viro
2019-09-25 12:34 ` Amir Goldstein
2019-09-22 21:29 ` Al Viro
2019-09-23 3:32 ` zhengbin (A)
2019-09-23 5:08 ` Al Viro
[not found] ` <20190916020434.tutzwipgs4f6o3di@inn2.lkp.intel.com>
2019-09-16 2:58 ` 266a9a8b41: WARNING:possible_recursive_locking_detected Al Viro
2019-09-16 3:03 ` Al Viro
2019-09-16 3:44 ` Linus Torvalds
2019-09-16 17:16 ` Al Viro
2019-09-16 17:29 ` Al Viro
[not found] ` <bd707e64-9650-e9ed-a820-e2cabd02eaf8@huawei.com>
2019-09-17 12:01 ` Al Viro
2019-09-19 3:36 ` zhengbin (A)
2019-09-19 3:55 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190915005046.GV1131@ZenIV.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=akpm@linux-foundation.org \
--cc=houtao1@huawei.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=renxudong1@huawei.com \
--cc=torvalds@linux-foundation.org \
--cc=yi.zhang@huawei.com \
--cc=zhengbin13@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).