Re: rbd unmap fails with "Device or resource busy"

From: Ilya Dryomov <idryomov@gmail.com>
To: Chris Dunlop <chris@onthe.net.au>
Cc: ceph-devel@vger.kernel.org
Subject: Re: rbd unmap fails with "Device or resource busy"
Date: Wed, 21 Sep 2022 12:40:54 +0200	[thread overview]
Message-ID: <CAOi1vP__Mj9Qyb=WsUxo7ja5koTS+0eavsnWH=X+DTest4spaQ@mail.gmail.com> (raw)
In-Reply-To: <20220921013629.GA1583272@onthe.net.au>

On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi Ilya,
>
> On Mon, Sep 19, 2022 at 12:14:06PM +0200, Ilya Dryomov wrote:
> > On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>>> What can make a "rbd unmap" fail, assuming the device is not
> >>>> mounted and not (obviously) open by any other processes?
> >>
> >> E.g. maybe there's some way of using ebpf or similar to look at the
> >> 'rbd_dev->open_count' in the live kernel?
> >>
> >> And/or maybe there's some way, again using ebpf or similar, to record
> >> sufficient info (e.g. a stack trace?) from rbd_open() and
> >> rbd_release() to try to identify something that's opening the device
> >> and not releasing it?
> >
> > Attaching kprobes to rbd_open() and rbd_release() is probably the
> > fastest option.  I don't think you even need a stack trace, PID and
> > comm (process name) should do.  I would start with something like:
> >
> > # bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid,
> > comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid,
> > comm) }'
> >
> > Fetching the actual rbd_dev->open_count value is more involved but
> > also doable.
>
> Excellent! Thanks!
>
> tl;dr there's something other than the open_count causing the unmap
> failures - or something's elevating and decrementing open_count without
> going through rbd_open and rbd_release. Or perhaps there's some situation
> whereby bpftrace "misses" recording calls to rbd_open and rbd_release.
>
> FYI, the production process is:
>
> - create snapshot of rbd
> - map
> - mount with ro,norecovery,nouuid (the original live fs is still mounted)
> - export via NFS
> - mount on Windows NFS client
> - process on Windows
> - remove Windows NFS mount
> - unexport from NFS
> - unmount
> - unmap
>
> (I haven't mentioned the NFS export previously because I thought the
> issue was replicable without it - but that might simply have been due to
> the 'pvs' issue which has been resolved.)
>
> I now have a script that mimics the above production sequence in a loop
> and left it running all night. Out of 288 iterations it had 13 instances
> where the unmap was failing for some time (i.e. in all cases it
> eventually succeeded, unlike the 51 rbd devices I can't seem to unmap at
> all without using --force). In the failing cases the unmap was retried
> at 1 second intervals. The shortest time taken to eventually umap was
> 521 seconds, the longest was 793 seconds.
>
> Note, in the below I'm using "successful" for the tests where the first
> unmap succeeded, and "failed" for the tests where the first unmap
> failed, although in all cases the unmap eventually succeeded.
>
> I ended up with a bpftrace script (see below) that logs the timestamp,
> open or release (O/R), pid, device name, open_count (at entry to the
> function), and process name.
>
> A successful iteration of that process mostly looks like this:
>
> Timestamp     O/R Pid    Device Count Process
> 18:21:18.235870 O 3269426 rbd29 0 mapper
> 18:21:20.088873 R 3269426 rbd29 1 mapper
> 18:21:20.089346 O 3269447 rbd29 0 systemd-udevd
> 18:21:20.105281 O 3269457 rbd29 1 blkid
> 18:21:31.858621 R 3269457 rbd29 2 blkid
> 18:21:31.861762 R 3269447 rbd29 1 systemd-udevd
> 18:21:31.882235 O 3269475 rbd29 0 mount
> 18:21:38.241808 R 3269475 rbd29 1 mount
> 18:21:38.242174 O 3269475 rbd29 0 mount
> 18:22:49.646608 O 2364320 rbd29 1 rpc.mountd
> 18:22:58.715634 R 2364320 rbd29 2 rpc.mountd
> 18:23:55.564512 R 3270060 rbd29 1 umount
>
> Or occasionally it looks like this, with "rpc.mountd" disappearing:
>
> 18:35:49.539224 O 3277664 rbd29 0 mapper
> 18:35:50.515777 R 3277664 rbd29 1 mapper
> 18:35:50.516224 O 3277685 rbd29 0 systemd-udevd
> 18:35:50.531978 O 3277694 rbd29 1 blkid
> 18:35:57.361799 R 3277694 rbd29 2 blkid
> 18:35:57.365263 R 3277685 rbd29 1 systemd-udevd
> 18:35:57.384316 O 3277713 rbd29 0 mount
> 18:36:01.234337 R 3277713 rbd29 1 mount
> 18:36:01.234849 O 3277713 rbd29 0 mount
> 18:37:21.304270 R 3289527 rbd29 1 umount
>
> Of the 288 iterations, only 20 didn't include the rpc.mountd lines.
>
> An unsuccessful iteration looks like this:
>
> 18:37:31.885408 O 3294108 rbd29 0 mapper
> 18:37:33.181607 R 3294108 rbd29 1 mapper
> 18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
> 18:37:33.197982 O 3294691 rbd29 1 blkid
> 18:37:42.712870 R 3294691 rbd29 2 blkid
> 18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
> 18:37:42.738469 O 3298073 rbd29 0 mount
> 18:37:49.339012 R 3298073 rbd29 1 mount
> 18:37:49.339352 O 3298073 rbd29 0 mount
> 18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
> 18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
> 18:53:56.054685 R 3313923 rbd29 1 init
>
> According to my script log, the first unmap attempt was at 18:39:42,
> i.e. 42 seconds after rpc.mountd released the device. At that point the
> the open_count was (or should have been?) 1 again allowing the unmap to
> succeed - but it didn't. The unmap was retried every second until it

Hi Chris,

For unmap to go through, open_count must be 0.  rpc.mountd at
18:39:00.989050 just decremented it from 2 to 1, it didn't release
the device.

> eventually succeeded at 18:53:56, the same time as the mysterious "init"
> process ran - but also note there is NO "umount" process in there so I
> don't know if the name of the process recorded by bfptrace is simply
> incorrect (but how would that happen??) or what else could be going on.

I would suggest adding the PID and the kernel stack trace at this
point.

>
> All 13 of the failed iterations recorded that weird "init" instead of
> "umount".

Yeah, that seems to be the culprit.

>
> 12 of the 13 failed iterations included rpc.mountd in the trace, but one
> didn't (i.e. it went direct from mount to init/umount, like the 2nd
> successful example above), i.e. around the same proportion as the
> successful iterations.
>
> So it seems there's something other than the open_count causing the unmap
> failures - or something's elevating and decrementing open_count without
> going through rbd_open and rbd_release. Or perhaps there's some situation
> whereby bpftrace "misses" recording calls to rbd_open and rbd_release.
>
>
> The bpftrace script looks like this:
> --------------------------------------------------------------------
> //
> // bunches of defines and structure definitions extracted from
> // drivers/block/rbd.c elided here...
> //

It would be good to attach the entire script, just in case someone runs
into a similar issue in the future and tries to debug the same way.

Thanks,

                Ilya