All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ilya Dryomov <idryomov@gmail.com>
To: Chris Dunlop <chris@onthe.net.au>
Cc: ceph-devel@vger.kernel.org
Subject: Re: rbd unmap fails with "Device or resource busy"
Date: Wed, 21 Sep 2022 12:40:54 +0200	[thread overview]
Message-ID: <CAOi1vP__Mj9Qyb=WsUxo7ja5koTS+0eavsnWH=X+DTest4spaQ@mail.gmail.com> (raw)
In-Reply-To: <20220921013629.GA1583272@onthe.net.au>

On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi Ilya,
>
> On Mon, Sep 19, 2022 at 12:14:06PM +0200, Ilya Dryomov wrote:
> > On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>>> What can make a "rbd unmap" fail, assuming the device is not
> >>>> mounted and not (obviously) open by any other processes?
> >>
> >> E.g. maybe there's some way of using ebpf or similar to look at the
> >> 'rbd_dev->open_count' in the live kernel?
> >>
> >> And/or maybe there's some way, again using ebpf or similar, to record
> >> sufficient info (e.g. a stack trace?) from rbd_open() and
> >> rbd_release() to try to identify something that's opening the device
> >> and not releasing it?
> >
> > Attaching kprobes to rbd_open() and rbd_release() is probably the
> > fastest option.  I don't think you even need a stack trace, PID and
> > comm (process name) should do.  I would start with something like:
> >
> > # bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid,
> > comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid,
> > comm) }'
> >
> > Fetching the actual rbd_dev->open_count value is more involved but
> > also doable.
>
> Excellent! Thanks!
>
> tl;dr there's something other than the open_count causing the unmap
> failures - or something's elevating and decrementing open_count without
> going through rbd_open and rbd_release. Or perhaps there's some situation
> whereby bpftrace "misses" recording calls to rbd_open and rbd_release.
>
> FYI, the production process is:
>
> - create snapshot of rbd
> - map
> - mount with ro,norecovery,nouuid (the original live fs is still mounted)
> - export via NFS
> - mount on Windows NFS client
> - process on Windows
> - remove Windows NFS mount
> - unexport from NFS
> - unmount
> - unmap
>
> (I haven't mentioned the NFS export previously because I thought the
> issue was replicable without it - but that might simply have been due to
> the 'pvs' issue which has been resolved.)
>
> I now have a script that mimics the above production sequence in a loop
> and left it running all night. Out of 288 iterations it had 13 instances
> where the unmap was failing for some time (i.e. in all cases it
> eventually succeeded, unlike the 51 rbd devices I can't seem to unmap at
> all without using --force). In the failing cases the unmap was retried
> at 1 second intervals. The shortest time taken to eventually umap was
> 521 seconds, the longest was 793 seconds.
>
> Note, in the below I'm using "successful" for the tests where the first
> unmap succeeded, and "failed" for the tests where the first unmap
> failed, although in all cases the unmap eventually succeeded.
>
> I ended up with a bpftrace script (see below) that logs the timestamp,
> open or release (O/R), pid, device name, open_count (at entry to the
> function), and process name.
>
> A successful iteration of that process mostly looks like this:
>
> Timestamp     O/R Pid    Device Count Process
> 18:21:18.235870 O 3269426 rbd29 0 mapper
> 18:21:20.088873 R 3269426 rbd29 1 mapper
> 18:21:20.089346 O 3269447 rbd29 0 systemd-udevd
> 18:21:20.105281 O 3269457 rbd29 1 blkid
> 18:21:31.858621 R 3269457 rbd29 2 blkid
> 18:21:31.861762 R 3269447 rbd29 1 systemd-udevd
> 18:21:31.882235 O 3269475 rbd29 0 mount
> 18:21:38.241808 R 3269475 rbd29 1 mount
> 18:21:38.242174 O 3269475 rbd29 0 mount
> 18:22:49.646608 O 2364320 rbd29 1 rpc.mountd
> 18:22:58.715634 R 2364320 rbd29 2 rpc.mountd
> 18:23:55.564512 R 3270060 rbd29 1 umount
>
> Or occasionally it looks like this, with "rpc.mountd" disappearing:
>
> 18:35:49.539224 O 3277664 rbd29 0 mapper
> 18:35:50.515777 R 3277664 rbd29 1 mapper
> 18:35:50.516224 O 3277685 rbd29 0 systemd-udevd
> 18:35:50.531978 O 3277694 rbd29 1 blkid
> 18:35:57.361799 R 3277694 rbd29 2 blkid
> 18:35:57.365263 R 3277685 rbd29 1 systemd-udevd
> 18:35:57.384316 O 3277713 rbd29 0 mount
> 18:36:01.234337 R 3277713 rbd29 1 mount
> 18:36:01.234849 O 3277713 rbd29 0 mount
> 18:37:21.304270 R 3289527 rbd29 1 umount
>
> Of the 288 iterations, only 20 didn't include the rpc.mountd lines.
>
> An unsuccessful iteration looks like this:
>
> 18:37:31.885408 O 3294108 rbd29 0 mapper
> 18:37:33.181607 R 3294108 rbd29 1 mapper
> 18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
> 18:37:33.197982 O 3294691 rbd29 1 blkid
> 18:37:42.712870 R 3294691 rbd29 2 blkid
> 18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
> 18:37:42.738469 O 3298073 rbd29 0 mount
> 18:37:49.339012 R 3298073 rbd29 1 mount
> 18:37:49.339352 O 3298073 rbd29 0 mount
> 18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
> 18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
> 18:53:56.054685 R 3313923 rbd29 1 init
>
> According to my script log, the first unmap attempt was at 18:39:42,
> i.e. 42 seconds after rpc.mountd released the device. At that point the
> the open_count was (or should have been?) 1 again allowing the unmap to
> succeed - but it didn't. The unmap was retried every second until it

Hi Chris,

For unmap to go through, open_count must be 0.  rpc.mountd at
18:39:00.989050 just decremented it from 2 to 1, it didn't release
the device.

> eventually succeeded at 18:53:56, the same time as the mysterious "init"
> process ran - but also note there is NO "umount" process in there so I
> don't know if the name of the process recorded by bfptrace is simply
> incorrect (but how would that happen??) or what else could be going on.

I would suggest adding the PID and the kernel stack trace at this
point.

>
> All 13 of the failed iterations recorded that weird "init" instead of
> "umount".

Yeah, that seems to be the culprit.

>
> 12 of the 13 failed iterations included rpc.mountd in the trace, but one
> didn't (i.e. it went direct from mount to init/umount, like the 2nd
> successful example above), i.e. around the same proportion as the
> successful iterations.
>
> So it seems there's something other than the open_count causing the unmap
> failures - or something's elevating and decrementing open_count without
> going through rbd_open and rbd_release. Or perhaps there's some situation
> whereby bpftrace "misses" recording calls to rbd_open and rbd_release.
>
>
> The bpftrace script looks like this:
> --------------------------------------------------------------------
> //
> // bunches of defines and structure definitions extracted from
> // drivers/block/rbd.c elided here...
> //

It would be good to attach the entire script, just in case someone runs
into a similar issue in the future and tries to debug the same way.

Thanks,

                Ilya

  reply	other threads:[~2022-09-21 10:41 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-13  1:20 rbd unmap fails with "Device or resource busy" Chris Dunlop
2022-09-13 11:43 ` Ilya Dryomov
2022-09-14  3:49   ` Chris Dunlop
2022-09-14  8:41     ` Ilya Dryomov
2022-09-15  8:29       ` Chris Dunlop
2022-09-19  7:43         ` Chris Dunlop
2022-09-19 10:14           ` Ilya Dryomov
2022-09-21  1:36             ` Chris Dunlop
2022-09-21 10:40               ` Ilya Dryomov [this message]
2022-09-23  3:58                 ` Chris Dunlop
2022-09-23  9:47                   ` Ilya Dryomov
2022-09-28  0:22                     ` Chris Dunlop
2022-09-29 11:14                       ` Ilya Dryomov
2022-09-30  0:04                         ` Chris Dunlop
2022-09-30 13:26                           ` Ilya Dryomov
     [not found]                   ` <CANqTTH4dPibtJ_4ayDch5rKVG=ykGAJhWnCyWmG9vvm1zHEg1w@mail.gmail.com>
2022-09-27 10:55                     ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOi1vP__Mj9Qyb=WsUxo7ja5koTS+0eavsnWH=X+DTest4spaQ@mail.gmail.com' \
    --to=idryomov@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chris@onthe.net.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.