linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: haosdent <haosdent@gmail.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	zhengyu.duan@shopee.com, Haosong Huang <huangh@sea.com>
Subject: Re: NULL pointer dereference when access /proc/net
Date: Mon, 3 May 2021 23:31:48 +0800	[thread overview]
Message-ID: <CAFt=ROOnU-rVoLoP=bgR3Y3WpJmhTerAQZqFd-v=eZ306Vrobg@mail.gmail.com> (raw)
In-Reply-To: <YIb4xIU6x5BLe0wd@zeniv-ca.linux.org.uk>

Hi, Alexander, thanks a lot for your detailed explanations. I take a
look at the code again and the thread and I realize there are some
incorrect representations in my previous word.

>>   struct inode *d_inode = 0x0         <======= d_inode is NULL and cause Oops!

Actually it is Oops at `inode->i_flags` directly instead of `d_inode`,
so the code would not further to change the access time.

```
bool __atime_needs_update(const struct path *path, struct inode *inode,
  bool rcu)
{
struct vfsmount *mnt = path->mnt;
struct timespec now;

if (inode->i_flags & S_NOATIME)    <======= Oops at here according to
`[19521409.372016] IP: __atime_needs_update+0x5/0x190`.
return false;
```

But it looks impossible if we take a look at "walk_component > lookup_fast".

Let me introduce why it goes through "walk_component > lookup_fast"
instead of "walk_component > lookup_slow" first,

in "walk_component > step_into > pick_link", the code would set
`nameidata->stack->seq` to `seq` which is comes from the passed in
parameters.
If the code goes through "walk_component > lookup_slow", `seq` would
be 0 and then `nameidata->stack->seq` would be 0.
If the code goes through "walk_component > lookup_slow", `seq` would
be `dentry->d_seq->sequence`.
According to the contents in attachment files "nameidata.txt" and
"dentry.txt", `dentry->d_seq->sequence` is 4, and
`nameidata->stack->seq` is 4 as well.
So looks like the code goes through "walk_component > lookup_fast" and
"walk_component > step_into > pick_link".

The `inode` parameter in `__atime_needs_update` comes from
`nameidata->link_inode`. But in attachment file "nameidata.txt", we
could found `nameidata->link_inode` is NULL already.
Because the code goes through "walk_component > lookup_fast" and
"walk_component > step_into > pick_link", the `inode` assign to
`nameidata->link_inode` must comes from `lookup_fast`.

So it looks like something wrong in `lookup_fast`. Let me continue to
explain why this looks impossible.

In `walk_component`, `lookup_fast` have to return 1 (> 0), otherwise
it would fallback to `lookup_slow`.

```
err = lookup_fast(nd, &path, &inode, &seq);
if (unlikely(err <= 0)) {
  if (err < 0)
    return err;
  path.dentry = lookup_slow(&nd->last, nd->path.dentry,
        nd->flags);
}

return step_into
```

Because for our case, it looks like the code goes through
"walk_component > lookup_fast" and "walk_component > step_into >
pick_link",
This infers `lookup_fast` return 1 in this Oops.

Because `lookup_fast` return 1, it looks like the code goes through
the following path.

```
  if (nd->flags & LOOKUP_RCU) {
    ...

    *inode = d_backing_inode(dentry);
    negative = d_is_negative(dentry);

    ...
      ...
      if (negative)
        return -ENOENT;
      path->mnt = mnt;
      path->dentry = dentry;
      if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
        return 1;
```

As we see, if `*inode` is NULL, it should return `-ENOENT` because `if
(negative)` would be true, which is conflict with "`lookup_fast`
return 1".

And in `__d_clear_type_and_inode`, it always sets the dentry to
negative first and then sets d_inode to NULL.

```
static inline void __d_clear_type_and_inode(struct dentry *dentry)
{`
  unsigned flags = READ_ONCE(dentry->d_flags);

  flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);   // Set dentry to
negative first.
  WRITE_ONCE(dentry->d_flags, flags);
      // memory barrier
  dentry->d_inode = NULL;
               // Then set d_inode to NULL.
}
```

So looks like `inode` in `lookup_fast` should not be NULL if it could
skip `if (negative)` check even in the RCU case. Unless

```
# in lookup_fast method
  *inode = d_backing_inode(dentry);
  negative = d_is_negative(dentry);
```

is reorder to

```
# in lookup_fast method
  negative = d_is_negative(dentry);
  *inode = d_backing_inode(dentry);
```

when CPU executing the code. But is this possible in RCU?

I diff my local ubuntu's code with upstream tag v4.15.18, it looks
like no different in `fs/namei.c`, `fs/dcache.c`, `fs/proc`.
So possible the problem may happen to upstream tag v4.15.18 as well,
sadly my script still could not reproduce the issue on the server so
far,
would like to see if any insights from you then I could continue to
check what's wrong in this Oops, thank you in advance!

On Tue, Apr 27, 2021 at 1:30 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Apr 27, 2021 at 01:16:44AM +0800, haosdent wrote:
> > > really should not assume ->d_inode stable
> >
> > Hi, Alexander, sorry to disturb you again. Today I try to check what
> > `dentry->d_inode` and `nd->link_inode` looks like when `dentry` is
> > already been killed in `__dentry_kill`.
> >
> > ```
> > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > ```
> >
> > It looks like `dentry->d_inode` could be NULL while `nd->link_inode`
> > is always has value.
> > But this make me confuse, by right `nd->link_inode` is get from
> > `dentry->d_inode`, right?
>
> It's sampled from there, yes.  And in RCU mode there's nothing to
> prevent a previously positive dentry from getting negative and/or
> killed.  ->link_inode (used to - it's gone these days) go with
> ->seq, which had been sampled from dentry->d_seq before fetching
> ->d_inode and then verified to have ->d_seq remain unchanged.
> That gives you "dentry used to have this inode at the time it
> had this d_seq", and that's what gets used to validate the sucker
> when we switch to non-RCU mode (look at legitimize_links()).
>
> IOW, we know that
>         * at some point during the pathwalk that sucker had this inode
>         * the inode won't get freed until we drop out of RCU mode
>         * if we need to go to non-RCU (and thus grab dentry references)
> while we still need that inode, we will verify that nothing has happened
> to that link (same ->d_seq, so it still refers to the same inode) and
> grab dentry reference, making sure it won't go away or become negative
> under us.  Or we'll fail (in case something _has_ happened to dentry)
> and repeat the entire thing in non-RCU mode.



-- 
Best Regards,
Haosdent Huang

  reply	other threads:[~2021-05-03 15:32 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAFt=RON+KYYf5yt9vM3TdOSn4zco+3XtFyi3VDRr1vbQUBPZ0g@mail.gmail.com>
2021-04-25 16:50 ` NULL pointer dereference when access /proc/net Al Viro
2021-04-25 17:04   ` haosdent
2021-04-25 17:14     ` haosdent
2021-04-25 17:22     ` Al Viro
2021-04-25 18:00       ` haosdent
2021-04-25 18:15         ` haosdent
2021-04-26 17:16       ` haosdent
2021-04-26 17:30         ` Al Viro
2021-05-03 15:31           ` haosdent [this message]
2021-05-06 10:21             ` haosdent
2021-04-25 15:47 haosdent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFt=ROOnU-rVoLoP=bgR3Y3WpJmhTerAQZqFd-v=eZ306Vrobg@mail.gmail.com' \
    --to=haosdent@gmail.com \
    --cc=huangh@sea.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zhengyu.duan@shopee.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).