All of lore.kernel.org
 help / color / mirror / Atom feed
* rbd unmap fails with "Device or resource busy"
@ 2022-09-13  1:20 Chris Dunlop
  2022-09-13 11:43 ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-09-13  1:20 UTC (permalink / raw)
  To: ceph-devel

Hi,

What can make a "rbd unmap" fail, assuming the device is not mounted and not 
(obviously) open by any other processes?

linux-5.15.58
ceph-16.2.9

I have multiple XFS on rbd filesystems, and often create rbd snapshots, map 
and read-only mount the snapshot, perform some work on the fs, then unmount 
and unmap. The unmap regularly (about 1 in 10 times) fails like:

$ sudo rbd unmap /dev/rbd29
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

I've double checked the device is no longer mounted, and, using "lsof" etc., 
nothing has the device open.

A "rbd unmap -f" can unmap the "busy" device but I'm concerned this may have 
undesirable consequences, e.g. ceph resource leakage, or even potential data 
corruption on non-read-only mounts.

I've found that waiting "a while", e.g. 5-30 minutes, will usually allow the 
"busy" device to be unmapped without the -f flag.

A simple "map/mount/read/unmount/unmap" test sees the unmap fail about 1 in 10 
times. When it fails it often takes 30 min or more for the unmap to finally 
succeed. E.g.:

----------------------------------------
#!/bin/bash

set -e

rbdname=pool/name

for ((i=0; ++i<=50; )); do
   dev=$(rbd map "${rbdname}")
   mount -oro,norecovery,nouuid "${dev}" /mnt/test

   dd if="/mnt/test/big-file" of=/dev/null bs=1G count=1
   umount /mnt/test
   # blockdev --flushbufs "${dev}"
   for ((j=0; ++j; )); do
     rbd unmap "${rdev}" && break
     sleep 5m
   done
done
----------------------------------------

Running "blockdev --flushbufs" prior to the unmap doesn't change the unmap 
failures.

What can I look at to see what's causing these unmaps to fail?

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-13  1:20 rbd unmap fails with "Device or resource busy" Chris Dunlop
@ 2022-09-13 11:43 ` Ilya Dryomov
  2022-09-14  3:49   ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-13 11:43 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi,
>
> What can make a "rbd unmap" fail, assuming the device is not mounted and not
> (obviously) open by any other processes?
>
> linux-5.15.58
> ceph-16.2.9
>
> I have multiple XFS on rbd filesystems, and often create rbd snapshots, map
> and read-only mount the snapshot, perform some work on the fs, then unmount
> and unmap. The unmap regularly (about 1 in 10 times) fails like:
>
> $ sudo rbd unmap /dev/rbd29
> rbd: sysfs write failed
> rbd: unmap failed: (16) Device or resource busy
>
> I've double checked the device is no longer mounted, and, using "lsof" etc.,
> nothing has the device open.

Hi Chris,

One thing that "lsof" is oblivious to is multipath, see
https://tracker.ceph.com/issues/12763.

>
> A "rbd unmap -f" can unmap the "busy" device but I'm concerned this may have
> undesirable consequences, e.g. ceph resource leakage, or even potential data
> corruption on non-read-only mounts.
>
> I've found that waiting "a while", e.g. 5-30 minutes, will usually allow the
> "busy" device to be unmapped without the -f flag.

"Device or resource busy" error from "rbd unmap" clearly indicates
that the block device is still open by something.  In this case -- you
are mounting a block-level snapshot of an XFS filesystem whose "HEAD"
is already mounted -- perhaps it could be some background XFS worker
thread?  I'm not sure if "nouuid" mount option solves all issues there.

>
> A simple "map/mount/read/unmount/unmap" test sees the unmap fail about 1 in 10
> times. When it fails it often takes 30 min or more for the unmap to finally
> succeed. E.g.:
>
> ----------------------------------------
> #!/bin/bash
>
> set -e
>
> rbdname=pool/name
>
> for ((i=0; ++i<=50; )); do
>    dev=$(rbd map "${rbdname}")
>    mount -oro,norecovery,nouuid "${dev}" /mnt/test
>
>    dd if="/mnt/test/big-file" of=/dev/null bs=1G count=1
>    umount /mnt/test
>    # blockdev --flushbufs "${dev}"
>    for ((j=0; ++j; )); do
>      rbd unmap "${rdev}" && break
>      sleep 5m
>    done
> done
> ----------------------------------------
>
> Running "blockdev --flushbufs" prior to the unmap doesn't change the unmap
> failures.

Yeah, I wouldn't expect that to affect anything there.

Have you encountered this error in other scenarios, i.e. without
mounting snapshots this way or with ext4 instead of XFS?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-13 11:43 ` Ilya Dryomov
@ 2022-09-14  3:49   ` Chris Dunlop
  2022-09-14  8:41     ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-09-14  3:49 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

Hi Illya,

On Tue, Sep 13, 2022 at 01:43:16PM +0200, Ilya Dryomov wrote:
> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
>> What can make a "rbd unmap" fail, assuming the device is not mounted 
>> and not (obviously) open by any other processes?
>>
>> linux-5.15.58
>> ceph-16.2.9
>>
>> I have multiple XFS on rbd filesystems, and often create rbd snapshots, 
>> map and read-only mount the snapshot, perform some work on the fs, then 
>> unmount and unmap. The unmap regularly (about 1 in 10 times) fails 
>> like:
>>
>> $ sudo rbd unmap /dev/rbd29
>> rbd: sysfs write failed
>> rbd: unmap failed: (16) Device or resource busy
>>
>> I've double checked the device is no longer mounted, and, using "lsof" 
>> etc., nothing has the device open.
>
> One thing that "lsof" is oblivious to is multipath, see
> https://tracker.ceph.com/issues/12763.

The server is not using multipath - e.g. there's no multipathd, and:

$ find /dev/mapper/ -name '*mpath*'

...finds nothing.

>> I've found that waiting "a while", e.g. 5-30 minutes, will usually 
>> allow the "busy" device to be unmapped without the -f flag.
>
> "Device or resource busy" error from "rbd unmap" clearly indicates
> that the block device is still open by something.  In this case -- you
> are mounting a block-level snapshot of an XFS filesystem whose "HEAD"
> is already mounted -- perhaps it could be some background XFS worker
> thread?  I'm not sure if "nouuid" mount option solves all issues there.

Good suggestion, I should have considered that first. I've now tried it 
without the mount at all, i.e. with no XFS or other filesystem:

------------------------------------------------------------------------------
#!/bin/bash
set -e
rbdname=pool/name
for ((i=0; ++i<=50; )); do
   dev=$(rbd map "${rbdname}")
   ts "${i}: ${dev}"
   dd if="${dev}" of=/dev/null bs=1G count=1
   for ((j=0; ++j; )); do
     rbd unmap "${dev}" && break
     sleep 1m
   done
   (( j > 1 )) && echo "$j minutes to unmap"
done
------------------------------------------------------------------------------

This failed at about the same rate, i.e. around 1 in 10. This time it only 
took 2 minutes each time to successfully unmap after the initial unmap 
failed - I'm not sure if this is due to the test change (no mount), or 
related to how busy the machine is otherwise.

The upshot is, it definitely looks like there's something related to the 
underlying rbd that's preventing the unmap.

> Have you encountered this error in other scenarios, i.e. without
> mounting snapshots this way or with ext4 instead of XFS?

I've seen the same issue after unmounting r/w filesystems, but I don't do 
that nearly as often so it hasn't been a pain point. However, per the test 
above, the issue is unrelated to the mount.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-14  3:49   ` Chris Dunlop
@ 2022-09-14  8:41     ` Ilya Dryomov
  2022-09-15  8:29       ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-14  8:41 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Wed, Sep 14, 2022 at 5:49 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi Illya,
>
> On Tue, Sep 13, 2022 at 01:43:16PM +0200, Ilya Dryomov wrote:
> > On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >> What can make a "rbd unmap" fail, assuming the device is not mounted
> >> and not (obviously) open by any other processes?
> >>
> >> linux-5.15.58
> >> ceph-16.2.9
> >>
> >> I have multiple XFS on rbd filesystems, and often create rbd snapshots,
> >> map and read-only mount the snapshot, perform some work on the fs, then
> >> unmount and unmap. The unmap regularly (about 1 in 10 times) fails
> >> like:
> >>
> >> $ sudo rbd unmap /dev/rbd29
> >> rbd: sysfs write failed
> >> rbd: unmap failed: (16) Device or resource busy
> >>
> >> I've double checked the device is no longer mounted, and, using "lsof"
> >> etc., nothing has the device open.
> >
> > One thing that "lsof" is oblivious to is multipath, see
> > https://tracker.ceph.com/issues/12763.
>
> The server is not using multipath - e.g. there's no multipathd, and:
>
> $ find /dev/mapper/ -name '*mpath*'
>
> ...finds nothing.
>
> >> I've found that waiting "a while", e.g. 5-30 minutes, will usually
> >> allow the "busy" device to be unmapped without the -f flag.
> >
> > "Device or resource busy" error from "rbd unmap" clearly indicates
> > that the block device is still open by something.  In this case -- you
> > are mounting a block-level snapshot of an XFS filesystem whose "HEAD"
> > is already mounted -- perhaps it could be some background XFS worker
> > thread?  I'm not sure if "nouuid" mount option solves all issues there.
>
> Good suggestion, I should have considered that first. I've now tried it
> without the mount at all, i.e. with no XFS or other filesystem:
>
> ------------------------------------------------------------------------------
> #!/bin/bash
> set -e
> rbdname=pool/name
> for ((i=0; ++i<=50; )); do
>    dev=$(rbd map "${rbdname}")
>    ts "${i}: ${dev}"
>    dd if="${dev}" of=/dev/null bs=1G count=1
>    for ((j=0; ++j; )); do
>      rbd unmap "${dev}" && break
>      sleep 1m
>    done
>    (( j > 1 )) && echo "$j minutes to unmap"
> done
> ------------------------------------------------------------------------------
>
> This failed at about the same rate, i.e. around 1 in 10. This time it only
> took 2 minutes each time to successfully unmap after the initial unmap
> failed - I'm not sure if this is due to the test change (no mount), or
> related to how busy the machine is otherwise.

I would suggest repeating this test with "sleep 1s" to get a better
idea of how long it really takes.

>
> The upshot is, it definitely looks like there's something related to the
> underlying rbd that's preventing the unmap.

I don't think so.  To confirm, now that there is no filesystem in the
mix, replace "rbd unmap" with "rbd unmap -o force".  If that fixes the
issue, RBD is very unlikely to have anything to do with it because all
"force" does is it overrides the "is this device still open" check
at the very top of "rbd unmap" handler in the kernel.

systemd-udevd may open block devices behind your back.  "rbd unmap"
command actually does a retry internally to work around that:

  /*
   * On final device close(), kernel sends a block change event, in
   * response to which udev apparently runs blkid on the device.  This
   * makes unmap fail with EBUSY, if issued right after final close().
   * Try to circumvent this with a retry before turning to udev.
   */
  for (int tries = 0; ; tries++) {
    int sysfs_r = sysfs_write_rbd_remove(buf);
    if (sysfs_r == -EBUSY && tries < 2) {
      if (!tries) {
        usleep(250 * 1000);
      } else if (!(flags & KRBD_CTX_F_NOUDEV)) {
        /*
         * libudev does not provide the "wait until the queue is empty"
         * API or the sufficient amount of primitives to build it from.
         */
        std::string err = run_cmd("udevadm", "settle", "--timeout", "10",
                                  (char *)NULL);
        if (!err.empty())
          std::cerr << "rbd: " << err << std::endl;
      }

Perhaps it is hitting "udevadm settle" timeout on your system?
"strace -f" might be useful here.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-14  8:41     ` Ilya Dryomov
@ 2022-09-15  8:29       ` Chris Dunlop
  2022-09-19  7:43         ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-09-15  8:29 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

On Wed, Sep 14, 2022 at 10:41:05AM +0200, Ilya Dryomov wrote:
> On Wed, Sep 14, 2022 at 5:49 AM Chris Dunlop <chris@onthe.net.au> wrote:
>> On Tue, Sep 13, 2022 at 01:43:16PM +0200, Ilya Dryomov wrote:
>>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>>> What can make a "rbd unmap" fail, assuming the device is not mounted
>>>> and not (obviously) open by any other processes?
>>>>
>>>> linux-5.15.58
>>>> ceph-16.2.9
>>>>
>>>> I have multiple XFS on rbd filesystems, and often create rbd snapshots,
>>>> map and read-only mount the snapshot, perform some work on the fs, then
>>>> unmount and unmap. The unmap regularly (about 1 in 10 times) fails
>>>> like:
>>>>
>>>> $ sudo rbd unmap /dev/rbd29
>>>> rbd: sysfs write failed
>>>> rbd: unmap failed: (16) Device or resource busy

tl;dr problem solved: there WAS a process holding the rbd device open.

The culprit was a 'pvs' command being run periodically by 'ceph-volume'. 
When the 'rbd unmap' was tried run at the same time the 'pvs' command 
was running, the unmap would fail.

It turns out the 'dd' command in my test script was only instrumental in 
as much as it made the test run long enough that it would intersect with 
the periodic 'pvs'. I had been thinking the 'dd' was causing the rbd 
data to be buffered in the kernel and perhaps the buffered which would 
sometimes not be cleared immediately, causing the rbd unmap to fail.

The conflicting 'pvs' command was a bit tricky to catch because it was 
only running for a very short time, so the 'pvs' would be gone by the 
time I'd run 'lsof'. The key to finding the prolem was to look through 
the processes as quickly as possible upon an unmap failure, e.g.:

----------------------------------------------------------------------
if ! rbd device unmap "${dev}"; then
   while read -r p; do
     p=${p#/proc/}; p=${p%%/*}
     (( p == prevp )) && continue
     prevp=$p

     printf '%(%F %T)T %d\t%s\n' -1 "${p}" "$(tr '\0' ' ' < /proc/${p}/cmdline)"

     pp=$(awk '$1=="PPid:"{print $2}' /proc/${p}/status)
     printf '+ %d\t%s\n' "${pp}" "$(tr '\0' ' ' < /proc/${pp}/cmdline)"

     ppp=$(awk '$1=="PPid:"{print $2}' /proc/${pp}/status)
     printf '+ %d\t%s\n' "${ppp}" "$(tr '\0' ' ' < /proc/${ppp}/cmdline)"
   done < <(
     find /proc/[0-9]*/fd -lname "${dev}" 2> /dev/null
   )
fi
----------------------------------------------------------------------

Note that 'pvs' normally does NOT scan rbd devices: you have to 
explicitly add "rbd" to the lvm.conf element for "List of additional 
acceptable block device types", e.g.:

/etc/lvm/lvm.conf
--
devices {
         types = [ "rbd", 1024 ]
}
--

I'd previously enabled the rbd scanning when testing some lvm-on-rbd 
stuff.

After removing rbd from the lvm.conf I was able to run through my unmap 
test 150 times without a single unmap failure.

>> ---------------------------------------------------------------------
>> #!/bin/bash
>> set -e
>> rbdname=pool/name
>> for ((i=0; ++i<=50; )); do
>>    dev=$(rbd map "${rbdname}")
>>    ts "${i}: ${dev}"
>>    dd if="${dev}" of=/dev/null bs=1G count=1
>>    for ((j=0; ++j; )); do
>>      rbd unmap "${dev}" && break
>>      sleep 1m
>>    done
>>    (( j > 1 )) && echo "$j minutes to unmap"
>> done
>> ---------------------------------------------------------------------
>>
>> This failed at about the same rate, i.e. around 1 in 10. This time it 
>> only took 2 minutes each time to successfully unmap after the initial 
>> unmap failed - I'm not sure if this is due to the test change (no 
>> mount), or related to how busy the machine is otherwise.
>
> I would suggest repeating this test with "sleep 1s" to get a better 
> idea of how long it really takes.

With "sleep 1s" it was generally successful the 2nd time around. I'm a 
bit puzzled at this because I'm certain, before I started scripting this 
test, I was doing many unmap attempts before finally successfully 
unmapping. I was convinced it was a matter of waiting for "something" to 
time out before the device was released, and in the meantime 'lsof' 
wasn't showing anything with the device open. It's implausible I was 
running into the 'pvs' command each of those times so what was actually 
going on there is a bit of a mystery.

> I don't think so.  To confirm, now that there is no filesystem in the
> mix, replace "rbd unmap" with "rbd unmap -o force".  If that fixes the
> issue, RBD is very unlikely to have anything to do with it because all
> "force" does is it overrides the "is this device still open" check
> at the very top of "rbd unmap" handler in the kernel.

I'd already confirmed "-o force" (or --force) would remove the device 
but I was concerned that could possibly cause data corruption if/when 
using a writable rbd so I wanted to get to the bottom of the problem.

> systemd-udevd may open block devices behind your back.  "rbd unmap"
> command actually does a retry internally to work around that:

Huh, interesting.

> Perhaps it is hitting "udevadm settle" timeout on your system?
> "strace -f" might be useful here.

A good suggestion although using 'strace' wasn't necessary in the end.


Thanks for your help!

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-15  8:29       ` Chris Dunlop
@ 2022-09-19  7:43         ` Chris Dunlop
  2022-09-19 10:14           ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-09-19  7:43 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

On Thu, Sep 15, 2022 at 06:29:20PM +1000, Chris Dunlop wrote:
> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
>> What can make a "rbd unmap" fail, assuming the device is not mounted 
>> and not (obviously) open by any other processes?
>>
>> linux-5.15.58
>> ceph-16.2.9
>>
>> I have multiple XFS on rbd filesystems, and often create rbd 
>> snapshots, map and read-only mount the snapshot, perform some work on 
>> the fs, then unmount and unmap. The unmap regularly (about 1 in 10 
>> times) fails like:
>>
>> $ sudo rbd unmap /dev/rbd29
>> rbd: sysfs write failed
>> rbd: unmap failed: (16) Device or resource busy
>
> tl;dr problem solved: there WAS a process holding the rbd device open.

Sigh. It turns out the problem is NOT solved.

I've stopped 'pvs' from scanning the rbd devices. This was sufficient to 
allow my minimal test script to work without unmap failures, but my full 
production process is still suffering from the unmap failures.

I now have 51 rbd devices which I haven't been able to unmap for the 
last three days (in contrast to my earlier statement where I said I'd 
always been able to unmap eventually, generally after 30 minutes or so).  
That's out of maybe 80-90 mapped rbds over that time.

I've no idea why the unmap failures are so common this time, and why, 
this time, I haven't been able to unmap them in 3 days.

I had been trying an unmap of one specific rbd (randomly selected) every 
second for 3 hours whilst simultaneously, in a tight loop, looking for 
any other processes that have the device open. The unmaps continued to 
fail and I haven't caught any other process with the device open.

I also tried a back-off strategy by linearly increasing a sleep between 
unmap attempts.  By the time the sleep was up to 4 hours I have up, with 
unmaps of that device still failing. Unmap attempts at random times 
since then on that particular device and all the other of the 51 
un-unmappable device continue to fail.

I'm sure I can unmap the devices using '--force' but at this point I'd 
rather try to work out WHY the unmap is failing: it seems to be pointing 
to /something/ going wrong, somewhere. Given no user processes can be 
seen to have the device open, it seems that "something" might be in the 
kernel somewhere.

I'm trying to put together a test using a cut down version of the 
production process to see if I can make the unmap failures happen a 
little more repeatably.

I'm open to suggestions as to what I can look at.

E.g. maybe there's some way of using ebpf or similar to look at the 
'rbd_dev->open_count' in the live kernel?

And/or maybe there's some way, again using ebpf or similar, to record 
sufficient info (e.g. a stack trace?) from rbd_open() and rbd_release() 
to try to identify something that's opening the device and not releasing 
it?

If anyone knows how that could be done that would be great, otherwise 
it's going to take me a bit of time to try to work out how that might be 
done.

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-19  7:43         ` Chris Dunlop
@ 2022-09-19 10:14           ` Ilya Dryomov
  2022-09-21  1:36             ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-19 10:14 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> On Thu, Sep 15, 2022 at 06:29:20PM +1000, Chris Dunlop wrote:
> > On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >> What can make a "rbd unmap" fail, assuming the device is not mounted
> >> and not (obviously) open by any other processes?
> >>
> >> linux-5.15.58
> >> ceph-16.2.9
> >>
> >> I have multiple XFS on rbd filesystems, and often create rbd
> >> snapshots, map and read-only mount the snapshot, perform some work on
> >> the fs, then unmount and unmap. The unmap regularly (about 1 in 10
> >> times) fails like:
> >>
> >> $ sudo rbd unmap /dev/rbd29
> >> rbd: sysfs write failed
> >> rbd: unmap failed: (16) Device or resource busy
> >
> > tl;dr problem solved: there WAS a process holding the rbd device open.
>
> Sigh. It turns out the problem is NOT solved.
>
> I've stopped 'pvs' from scanning the rbd devices. This was sufficient to
> allow my minimal test script to work without unmap failures, but my full
> production process is still suffering from the unmap failures.
>
> I now have 51 rbd devices which I haven't been able to unmap for the
> last three days (in contrast to my earlier statement where I said I'd
> always been able to unmap eventually, generally after 30 minutes or so).
> That's out of maybe 80-90 mapped rbds over that time.
>
> I've no idea why the unmap failures are so common this time, and why,
> this time, I haven't been able to unmap them in 3 days.
>
> I had been trying an unmap of one specific rbd (randomly selected) every
> second for 3 hours whilst simultaneously, in a tight loop, looking for
> any other processes that have the device open. The unmaps continued to
> fail and I haven't caught any other process with the device open.
>
> I also tried a back-off strategy by linearly increasing a sleep between
> unmap attempts.  By the time the sleep was up to 4 hours I have up, with
> unmaps of that device still failing. Unmap attempts at random times
> since then on that particular device and all the other of the 51
> un-unmappable device continue to fail.
>
> I'm sure I can unmap the devices using '--force' but at this point I'd
> rather try to work out WHY the unmap is failing: it seems to be pointing
> to /something/ going wrong, somewhere. Given no user processes can be
> seen to have the device open, it seems that "something" might be in the
> kernel somewhere.
>
> I'm trying to put together a test using a cut down version of the
> production process to see if I can make the unmap failures happen a
> little more repeatably.
>
> I'm open to suggestions as to what I can look at.
>
> E.g. maybe there's some way of using ebpf or similar to look at the
> 'rbd_dev->open_count' in the live kernel?
>
> And/or maybe there's some way, again using ebpf or similar, to record
> sufficient info (e.g. a stack trace?) from rbd_open() and rbd_release()
> to try to identify something that's opening the device and not releasing
> it?

Hi Chris,

Attaching kprobes to rbd_open() and rbd_release() is probably the
fastest option.  I don't think you even need a stack trace, PID and
comm (process name) should do.  I would start with something like:

# bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid,
comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid,
comm) }'

Fetching the actual rbd_dev->open_count value is more involved but
also doable.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-19 10:14           ` Ilya Dryomov
@ 2022-09-21  1:36             ` Chris Dunlop
  2022-09-21 10:40               ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-09-21  1:36 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

Hi Ilya,

On Mon, Sep 19, 2022 at 12:14:06PM +0200, Ilya Dryomov wrote:
> On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>>> What can make a "rbd unmap" fail, assuming the device is not 
>>>> mounted and not (obviously) open by any other processes?
>>
>> E.g. maybe there's some way of using ebpf or similar to look at the 
>> 'rbd_dev->open_count' in the live kernel?
>>
>> And/or maybe there's some way, again using ebpf or similar, to record 
>> sufficient info (e.g. a stack trace?) from rbd_open() and 
>> rbd_release() to try to identify something that's opening the device 
>> and not releasing it?
>
> Attaching kprobes to rbd_open() and rbd_release() is probably the 
> fastest option.  I don't think you even need a stack trace, PID and 
> comm (process name) should do.  I would start with something like:
>
> # bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid, 
> comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid, 
> comm) }'
>
> Fetching the actual rbd_dev->open_count value is more involved but 
> also doable.

Excellent! Thanks!

tl;dr there's something other than the open_count causing the unmap 
failures - or something's elevating and decrementing open_count without 
going through rbd_open and rbd_release. Or perhaps there's some situation 
whereby bpftrace "misses" recording calls to rbd_open and rbd_release.

FYI, the production process is:

- create snapshot of rbd
- map
- mount with ro,norecovery,nouuid (the original live fs is still mounted)
- export via NFS
- mount on Windows NFS client
- process on Windows
- remove Windows NFS mount
- unexport from NFS
- unmount
- unmap

(I haven't mentioned the NFS export previously because I thought the 
issue was replicable without it - but that might simply have been due to 
the 'pvs' issue which has been resolved.)

I now have a script that mimics the above production sequence in a loop 
and left it running all night. Out of 288 iterations it had 13 instances 
where the unmap was failing for some time (i.e. in all cases it 
eventually succeeded, unlike the 51 rbd devices I can't seem to unmap at 
all without using --force). In the failing cases the unmap was retried 
at 1 second intervals. The shortest time taken to eventually umap was 
521 seconds, the longest was 793 seconds.

Note, in the below I'm using "successful" for the tests where the first 
unmap succeeded, and "failed" for the tests where the first unmap 
failed, although in all cases the unmap eventually succeeded.

I ended up with a bpftrace script (see below) that logs the timestamp, 
open or release (O/R), pid, device name, open_count (at entry to the 
function), and process name.

A successful iteration of that process mostly looks like this:

Timestamp     O/R Pid    Device Count Process
18:21:18.235870 O 3269426 rbd29 0 mapper
18:21:20.088873 R 3269426 rbd29 1 mapper
18:21:20.089346 O 3269447 rbd29 0 systemd-udevd
18:21:20.105281 O 3269457 rbd29 1 blkid
18:21:31.858621 R 3269457 rbd29 2 blkid
18:21:31.861762 R 3269447 rbd29 1 systemd-udevd
18:21:31.882235 O 3269475 rbd29 0 mount
18:21:38.241808 R 3269475 rbd29 1 mount
18:21:38.242174 O 3269475 rbd29 0 mount
18:22:49.646608 O 2364320 rbd29 1 rpc.mountd
18:22:58.715634 R 2364320 rbd29 2 rpc.mountd
18:23:55.564512 R 3270060 rbd29 1 umount

Or occasionally it looks like this, with "rpc.mountd" disappearing:

18:35:49.539224 O 3277664 rbd29 0 mapper
18:35:50.515777 R 3277664 rbd29 1 mapper
18:35:50.516224 O 3277685 rbd29 0 systemd-udevd
18:35:50.531978 O 3277694 rbd29 1 blkid
18:35:57.361799 R 3277694 rbd29 2 blkid
18:35:57.365263 R 3277685 rbd29 1 systemd-udevd
18:35:57.384316 O 3277713 rbd29 0 mount
18:36:01.234337 R 3277713 rbd29 1 mount
18:36:01.234849 O 3277713 rbd29 0 mount
18:37:21.304270 R 3289527 rbd29 1 umount

Of the 288 iterations, only 20 didn't include the rpc.mountd lines.

An unsuccessful iteration looks like this:

18:37:31.885408 O 3294108 rbd29 0 mapper
18:37:33.181607 R 3294108 rbd29 1 mapper
18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
18:37:33.197982 O 3294691 rbd29 1 blkid
18:37:42.712870 R 3294691 rbd29 2 blkid
18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
18:37:42.738469 O 3298073 rbd29 0 mount
18:37:49.339012 R 3298073 rbd29 1 mount
18:37:49.339352 O 3298073 rbd29 0 mount
18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
18:53:56.054685 R 3313923 rbd29 1 init

According to my script log, the first unmap attempt was at 18:39:42, 
i.e. 42 seconds after rpc.mountd released the device. At that point the 
the open_count was (or should have been?) 1 again allowing the unmap to 
succeed - but it didn't. The unmap was retried every second until it 
eventually succeeded at 18:53:56, the same time as the mysterious "init" 
process ran - but also note there is NO "umount" process in there so I 
don't know if the name of the process recorded by bfptrace is simply 
incorrect (but how would that happen??) or what else could be going on.

All 13 of the failed iterations recorded that weird "init" instead of 
"umount".

12 of the 13 failed iterations included rpc.mountd in the trace, but one 
didn't (i.e. it went direct from mount to init/umount, like the 2nd 
successful example above), i.e. around the same proportion as the 
successful iterations.

So it seems there's something other than the open_count causing the unmap 
failures - or something's elevating and decrementing open_count without 
going through rbd_open and rbd_release. Or perhaps there's some situation 
whereby bpftrace "misses" recording calls to rbd_open and rbd_release.


The bpftrace script looks like this:
--------------------------------------------------------------------
//
// bunches of defines and structure definitions extracted from 
// drivers/block/rbd.c elided here...
//
kprobe:rbd_open {
   $bdev = (struct block_device *)arg0;
   $rbd_dev = (struct rbd_device *)($bdev->bd_disk->private_data);

   printf("%s O %d %s %lu %s\n",
     strftime("%T.%f", nsecs), pid, $rbd_dev->name,
     $rbd_dev->open_count, comm
   );
}
kprobe:rbd_release {
   $disk = (struct gendisk *)arg0;
   $rbd_dev = (struct rbd_device *)($disk->private_data);

   printf("%s R %d %s %lu %s\n",
     strftime("%T.%f", nsecs), pid, $rbd_dev->name,
     $rbd_dev->open_count, comm
   );
}
--------------------------------------------------------------------


Cheers,

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-21  1:36             ` Chris Dunlop
@ 2022-09-21 10:40               ` Ilya Dryomov
  2022-09-23  3:58                 ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-21 10:40 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi Ilya,
>
> On Mon, Sep 19, 2022 at 12:14:06PM +0200, Ilya Dryomov wrote:
> > On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>>> What can make a "rbd unmap" fail, assuming the device is not
> >>>> mounted and not (obviously) open by any other processes?
> >>
> >> E.g. maybe there's some way of using ebpf or similar to look at the
> >> 'rbd_dev->open_count' in the live kernel?
> >>
> >> And/or maybe there's some way, again using ebpf or similar, to record
> >> sufficient info (e.g. a stack trace?) from rbd_open() and
> >> rbd_release() to try to identify something that's opening the device
> >> and not releasing it?
> >
> > Attaching kprobes to rbd_open() and rbd_release() is probably the
> > fastest option.  I don't think you even need a stack trace, PID and
> > comm (process name) should do.  I would start with something like:
> >
> > # bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid,
> > comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid,
> > comm) }'
> >
> > Fetching the actual rbd_dev->open_count value is more involved but
> > also doable.
>
> Excellent! Thanks!
>
> tl;dr there's something other than the open_count causing the unmap
> failures - or something's elevating and decrementing open_count without
> going through rbd_open and rbd_release. Or perhaps there's some situation
> whereby bpftrace "misses" recording calls to rbd_open and rbd_release.
>
> FYI, the production process is:
>
> - create snapshot of rbd
> - map
> - mount with ro,norecovery,nouuid (the original live fs is still mounted)
> - export via NFS
> - mount on Windows NFS client
> - process on Windows
> - remove Windows NFS mount
> - unexport from NFS
> - unmount
> - unmap
>
> (I haven't mentioned the NFS export previously because I thought the
> issue was replicable without it - but that might simply have been due to
> the 'pvs' issue which has been resolved.)
>
> I now have a script that mimics the above production sequence in a loop
> and left it running all night. Out of 288 iterations it had 13 instances
> where the unmap was failing for some time (i.e. in all cases it
> eventually succeeded, unlike the 51 rbd devices I can't seem to unmap at
> all without using --force). In the failing cases the unmap was retried
> at 1 second intervals. The shortest time taken to eventually umap was
> 521 seconds, the longest was 793 seconds.
>
> Note, in the below I'm using "successful" for the tests where the first
> unmap succeeded, and "failed" for the tests where the first unmap
> failed, although in all cases the unmap eventually succeeded.
>
> I ended up with a bpftrace script (see below) that logs the timestamp,
> open or release (O/R), pid, device name, open_count (at entry to the
> function), and process name.
>
> A successful iteration of that process mostly looks like this:
>
> Timestamp     O/R Pid    Device Count Process
> 18:21:18.235870 O 3269426 rbd29 0 mapper
> 18:21:20.088873 R 3269426 rbd29 1 mapper
> 18:21:20.089346 O 3269447 rbd29 0 systemd-udevd
> 18:21:20.105281 O 3269457 rbd29 1 blkid
> 18:21:31.858621 R 3269457 rbd29 2 blkid
> 18:21:31.861762 R 3269447 rbd29 1 systemd-udevd
> 18:21:31.882235 O 3269475 rbd29 0 mount
> 18:21:38.241808 R 3269475 rbd29 1 mount
> 18:21:38.242174 O 3269475 rbd29 0 mount
> 18:22:49.646608 O 2364320 rbd29 1 rpc.mountd
> 18:22:58.715634 R 2364320 rbd29 2 rpc.mountd
> 18:23:55.564512 R 3270060 rbd29 1 umount
>
> Or occasionally it looks like this, with "rpc.mountd" disappearing:
>
> 18:35:49.539224 O 3277664 rbd29 0 mapper
> 18:35:50.515777 R 3277664 rbd29 1 mapper
> 18:35:50.516224 O 3277685 rbd29 0 systemd-udevd
> 18:35:50.531978 O 3277694 rbd29 1 blkid
> 18:35:57.361799 R 3277694 rbd29 2 blkid
> 18:35:57.365263 R 3277685 rbd29 1 systemd-udevd
> 18:35:57.384316 O 3277713 rbd29 0 mount
> 18:36:01.234337 R 3277713 rbd29 1 mount
> 18:36:01.234849 O 3277713 rbd29 0 mount
> 18:37:21.304270 R 3289527 rbd29 1 umount
>
> Of the 288 iterations, only 20 didn't include the rpc.mountd lines.
>
> An unsuccessful iteration looks like this:
>
> 18:37:31.885408 O 3294108 rbd29 0 mapper
> 18:37:33.181607 R 3294108 rbd29 1 mapper
> 18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
> 18:37:33.197982 O 3294691 rbd29 1 blkid
> 18:37:42.712870 R 3294691 rbd29 2 blkid
> 18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
> 18:37:42.738469 O 3298073 rbd29 0 mount
> 18:37:49.339012 R 3298073 rbd29 1 mount
> 18:37:49.339352 O 3298073 rbd29 0 mount
> 18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
> 18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
> 18:53:56.054685 R 3313923 rbd29 1 init
>
> According to my script log, the first unmap attempt was at 18:39:42,
> i.e. 42 seconds after rpc.mountd released the device. At that point the
> the open_count was (or should have been?) 1 again allowing the unmap to
> succeed - but it didn't. The unmap was retried every second until it

Hi Chris,

For unmap to go through, open_count must be 0.  rpc.mountd at
18:39:00.989050 just decremented it from 2 to 1, it didn't release
the device.

> eventually succeeded at 18:53:56, the same time as the mysterious "init"
> process ran - but also note there is NO "umount" process in there so I
> don't know if the name of the process recorded by bfptrace is simply
> incorrect (but how would that happen??) or what else could be going on.

I would suggest adding the PID and the kernel stack trace at this
point.

>
> All 13 of the failed iterations recorded that weird "init" instead of
> "umount".

Yeah, that seems to be the culprit.

>
> 12 of the 13 failed iterations included rpc.mountd in the trace, but one
> didn't (i.e. it went direct from mount to init/umount, like the 2nd
> successful example above), i.e. around the same proportion as the
> successful iterations.
>
> So it seems there's something other than the open_count causing the unmap
> failures - or something's elevating and decrementing open_count without
> going through rbd_open and rbd_release. Or perhaps there's some situation
> whereby bpftrace "misses" recording calls to rbd_open and rbd_release.
>
>
> The bpftrace script looks like this:
> --------------------------------------------------------------------
> //
> // bunches of defines and structure definitions extracted from
> // drivers/block/rbd.c elided here...
> //

It would be good to attach the entire script, just in case someone runs
into a similar issue in the future and tries to debug the same way.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-21 10:40               ` Ilya Dryomov
@ 2022-09-23  3:58                 ` Chris Dunlop
  2022-09-23  9:47                   ` Ilya Dryomov
       [not found]                   ` <CANqTTH4dPibtJ_4ayDch5rKVG=ykGAJhWnCyWmG9vvm1zHEg1w@mail.gmail.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Chris Dunlop @ 2022-09-23  3:58 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 6321 bytes --]

Hi Ilya,

On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote:
> On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@onthe.net.au> wrote:
>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>> What can make a "rbd unmap" fail, assuming the device is not
>>> mounted and not (obviously) open by any other processes?

OK, I'm confident I now understand the cause of this problem. The 
particular machine where I'm mounting the rbd snapshots is also running 
some containerised ceph services. The ceph containers are 
(bind-)mounting the entire host filesystem hierarchy on startup, and if 
a ceph container happens to start up whilst a rbd device is mounted, the 
container also has the rbd mounted, preventing the host from unmapping 
the device even after the host has unmounted it. (More below.)

This brings up a couple of issues...

Why is the ceph container getting access to the entire host filesystem 
in the first place?

Even if I mount an rbd device with the "unbindable" mount option, which 
is specifically supposed to prevent bind mounts to that filesystem, the 
ceph containers still get the mount - how / why??

If the ceph containers really do need access to the entire host 
filesystem, perhaps it would be better to do a "slave" mount, so if/when 
the hosts unmounts a filesystem it's also unmounted in the container[s].  
(Of course this also means any filesystems newly mounted in the host 
would also appear in the containers - but that happens anyway if the 
container is newly started).

>> An unsuccessful iteration looks like this:
>>
>> 18:37:31.885408 O 3294108 rbd29 0 mapper
>> 18:37:33.181607 R 3294108 rbd29 1 mapper
>> 18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
>> 18:37:33.197982 O 3294691 rbd29 1 blkid
>> 18:37:42.712870 R 3294691 rbd29 2 blkid
>> 18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
>> 18:37:42.738469 O 3298073 rbd29 0 mount
>> 18:37:49.339012 R 3298073 rbd29 1 mount
>> 18:37:49.339352 O 3298073 rbd29 0 mount
>> 18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
>> 18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
>> 18:53:56.054685 R 3313923 rbd29 1 init
>>
>> According to my script log, the first unmap attempt was at 18:39:42,
>> i.e. 42 seconds after rpc.mountd released the device. At that point the
>> the open_count was (or should have been?) 1 again allowing the unmap to
>> succeed - but it didn't. The unmap was retried every second until it
>
> For unmap to go through, open_count must be 0.  rpc.mountd at
> 18:39:00.989050 just decremented it from 2 to 1, it didn't release
> the device.

Yes - but my poorly made point was that, per the normal test iteration, 
some time shortly after rpc.mountd decremented open_count to 1, an 
"umount" command was run successfully (the test would have aborted if 
the umount didn't succeed) - but the "umount" didn't show up in the 
bpftrace output. Immediately after the umount a "rbd unmap" was run, 
which failed with "busy" - i.e. the open_count was still incremented.

>> eventually succeeded at 18:53:56, the same time as the mysterious 
>> "init" process ran - but also note there is NO "umount" process in 
>> there so I don't know if the name of the process recorded by bfptrace 
>> is simply incorrect (but how would that happen??) or what else could 
>> be going on.

Using "ps" once the unmap starts failing, then cross checking against 
the process id recorded for the mysterious "init" in the bpftrace 
output, reveals the full command line for the "init" is:

/dev/init -- /usr/sbin/ceph-volume inventory --format=json-pretty --filter-for-batch

I.e. it's the 'init' process of a ceph-volume container that eventually 
releases the open_count.

After doing a lot of learning about ceph and containers (podman in this 
case) and namespaces etc. etc., the problem is now known...

Ceph containers are started with '-v "/:/rootfs"' which bind mounts the 
entire host's filesystem hierarchy into the container. Specifically, if 
the host has mounted filesystems, they're also mounted within the 
container when it starts up. So, if a ceph container starts up whilst 
there is a filesystem mounted from an rbd mapped device, the container 
also has that mount - and it retains the mount even if the filesystem is 
unmounted in the host. So the rbd device can't be unmapped in the host 
until the filesystem is released by the container, either via an explicit 
umount within the container, or a umount from the host targetting the 
container namespace, or the container exits.

This explains the mysterious 51 rbd devices that I haven't been able to 
unmap for a week: they're all mounted within long-running ceph containers 
that happened to start up whilst those 51 devices were all mounted 
somewhere.  I've now been able to unmap those devices after unmounting the 
filesystems within those containers using:

umount --namespace "${pid_of_container}" "${fs}"


------------------------------------------------------------
An example demonstrating the problem
------------------------------------------------------------
#
# Mount a snapshot, with "unbindable"
#
host# {
   rbd=pool/name@snap
   dev=$(rbd device map "${rbd}")
   declare -p dev
   mount -oro,norecovery,nouuid,unbindable "${dev}" "/mnt"
   echo --
   grep "${dev}" /proc/self/mountinfo
   echo --
   ls /mnt
   echo --
}
declare -- dev="/dev/rbd30"
--
1463 22 252:480 / /mnt ro unbindable - xfs /dev/rbd30 ro,nouuid,norecovery
--
file1 file2 file3

#
# The mount is still visible if we start a ceph container
#
host# cephadm shell
root@host:/# ls /mnt
file1 file2 file3

#
# The device is not unmappable from the host...
#
host# umount /mnt
host# rbd device unmap "${dev}"
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

#
# ...until we umount the filesystem within the container
#
#
host# lsns -t mnt
         NS TYPE NPROCS     PID USER             COMMAND
4026533050 mnt       2 3105356 root             /dev/init -- bash
host# umount --namespace 3105356 /mnt
host# rbd device unmap "${dev}"
   ## success
------------------------------------------------------------


>> The bpftrace script looks like this:
>
> It would be good to attach the entire script, just in case someone runs
> into a similar issue in the future and tries to debug the same way.

Attached.

Cheers,

Chris

[-- Attachment #2: rbd-open-release.bpf --]
[-- Type: text/plain, Size: 4345 bytes --]

#!/usr/bin/bpftrace
/*
 * log rbd opens and releases
 *
 * run like:
 *
 * bpftrace -I /lib/modules/$(uname -r)/source/drivers/block -I /lib/modules/$(uname -r)/build this_script
 *
 * This assumes you have the appropriate linux source and build
 * artifacts available on the machine where you're running bpftrace.
 *
 * Note:
 *   https://github.com/iovisor/bpftrace/pull/2315
 *   BTF for kernel modules
 *
 * Once that lands in your local bpftrace you hopefully don't need the linux
 * source and build stuff, nor the 'extracted' stuff below, and you should be
 * able to simply run this script like:
 *
 * chmod +x ./this_script
 * ./this_script
 *   
 */

////////////////////////////////////////////////////////////
// extracted from
//   linux/drivers/block/rbd.c
//
#include <linux/ceph/osdmap.h>

#include <linux/kernel.h>
#include <linux/device.h>
#include <linux/blk-mq.h>

#include "rbd_types.h"

/*
 * An RBD device name will be "rbd#", where the "rbd" comes from
 * RBD_DRV_NAME above, and # is a unique integer identifier.
 */
#define DEV_NAME_LEN		32

/*
 * block device image metadata (in-memory version)
 */
struct rbd_image_header {
	/* These six fields never change for a given rbd image */
	char *object_prefix;
	__u8 obj_order;
	u64 stripe_unit;
	u64 stripe_count;
	s64 data_pool_id;
	u64 features;		/* Might be changeable someday? */

	/* The remaining fields need to be updated occasionally */
	u64 image_size;
	struct ceph_snap_context *snapc;
	char *snap_names;	/* format 1 only */
	u64 *snap_sizes;	/* format 1 only */
};

enum rbd_watch_state {
	RBD_WATCH_STATE_UNREGISTERED,
	RBD_WATCH_STATE_REGISTERED,
	RBD_WATCH_STATE_ERROR,
};

enum rbd_lock_state {
	RBD_LOCK_STATE_UNLOCKED,
	RBD_LOCK_STATE_LOCKED,
	RBD_LOCK_STATE_RELEASING,
};

/* WatchNotify::ClientId */
struct rbd_client_id {
	u64 gid;
	u64 handle;
};

struct rbd_mapping {
	u64                     size;
};

/*
 * a single device
 */
struct rbd_device {
	int			dev_id;		/* blkdev unique id */

	int			major;		/* blkdev assigned major */
	int			minor;
	struct gendisk		*disk;		/* blkdev's gendisk and rq */

	u32			image_format;	/* Either 1 or 2 */
	struct rbd_client	*rbd_client;

	char			name[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */

	spinlock_t		lock;		/* queue, flags, open_count */

	struct rbd_image_header	header;
	unsigned long		flags;		/* possibly lock protected */
	struct rbd_spec		*spec;
	struct rbd_options	*opts;
	char			*config_info;	/* add{,_single_major} string */

	struct ceph_object_id	header_oid;
	struct ceph_object_locator header_oloc;

	struct ceph_file_layout	layout;		/* used for all rbd requests */

	struct mutex		watch_mutex;
	enum rbd_watch_state	watch_state;
	struct ceph_osd_linger_request *watch_handle;
	u64			watch_cookie;
	struct delayed_work	watch_dwork;

	struct rw_semaphore	lock_rwsem;
	enum rbd_lock_state	lock_state;
	char			lock_cookie[32];
	struct rbd_client_id	owner_cid;
	struct work_struct	acquired_lock_work;
	struct work_struct	released_lock_work;
	struct delayed_work	lock_dwork;
	struct work_struct	unlock_work;
	spinlock_t		lock_lists_lock;
	struct list_head	acquiring_list;
	struct list_head	running_list;
	struct completion	acquire_wait;
	int			acquire_err;
	struct completion	releasing_wait;

	spinlock_t		object_map_lock;
	u8			*object_map;
	u64			object_map_size;	/* in objects */
	u64			object_map_flags;

	struct workqueue_struct	*task_wq;

	struct rbd_spec		*parent_spec;
	u64			parent_overlap;
	atomic_t		parent_ref;
	struct rbd_device	*parent;

	/* Block layer tags. */
	struct blk_mq_tag_set	tag_set;

	/* protects updating the header */
	struct rw_semaphore     header_rwsem;

	struct rbd_mapping	mapping;

	struct list_head	node;

	/* sysfs related */
	struct device		dev;
	unsigned long		open_count;	/* protected by lock */
};

//
// end of extraction
////////////////////////////////////////////////////////////

kprobe:rbd_open {
  $bdev = (struct block_device *)arg0;
  $rbd_dev = (struct rbd_device *)($bdev->bd_disk->private_data);

  printf("%s O %d %s %lu %s\n",
    strftime("%T.%f", nsecs), pid, $rbd_dev->name, $rbd_dev->open_count, comm
  );
}

kprobe:rbd_release {
  $disk = (struct gendisk *)arg0;
  $rbd_dev = (struct rbd_device *)($disk->private_data);

  printf("%s R %d %s %lu %s\n",
    strftime("%T.%f", nsecs), pid, $rbd_dev->name, $rbd_dev->open_count, comm
  );
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-23  3:58                 ` Chris Dunlop
@ 2022-09-23  9:47                   ` Ilya Dryomov
  2022-09-28  0:22                     ` Chris Dunlop
       [not found]                   ` <CANqTTH4dPibtJ_4ayDch5rKVG=ykGAJhWnCyWmG9vvm1zHEg1w@mail.gmail.com>
  1 sibling, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-23  9:47 UTC (permalink / raw)
  To: Chris Dunlop, Adam King, Guillaume Abrioux; +Cc: ceph-devel

On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi Ilya,
>
> On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote:
> > On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>> What can make a "rbd unmap" fail, assuming the device is not
> >>> mounted and not (obviously) open by any other processes?
>
> OK, I'm confident I now understand the cause of this problem. The
> particular machine where I'm mounting the rbd snapshots is also running
> some containerised ceph services. The ceph containers are
> (bind-)mounting the entire host filesystem hierarchy on startup, and if
> a ceph container happens to start up whilst a rbd device is mounted, the
> container also has the rbd mounted, preventing the host from unmapping
> the device even after the host has unmounted it. (More below.)
>
> This brings up a couple of issues...
>
> Why is the ceph container getting access to the entire host filesystem
> in the first place?
>
> Even if I mount an rbd device with the "unbindable" mount option, which
> is specifically supposed to prevent bind mounts to that filesystem, the
> ceph containers still get the mount - how / why??
>
> If the ceph containers really do need access to the entire host
> filesystem, perhaps it would be better to do a "slave" mount, so if/when
> the hosts unmounts a filesystem it's also unmounted in the container[s].
> (Of course this also means any filesystems newly mounted in the host
> would also appear in the containers - but that happens anyway if the
> container is newly started).
>
> >> An unsuccessful iteration looks like this:
> >>
> >> 18:37:31.885408 O 3294108 rbd29 0 mapper
> >> 18:37:33.181607 R 3294108 rbd29 1 mapper
> >> 18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
> >> 18:37:33.197982 O 3294691 rbd29 1 blkid
> >> 18:37:42.712870 R 3294691 rbd29 2 blkid
> >> 18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
> >> 18:37:42.738469 O 3298073 rbd29 0 mount
> >> 18:37:49.339012 R 3298073 rbd29 1 mount
> >> 18:37:49.339352 O 3298073 rbd29 0 mount
> >> 18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
> >> 18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
> >> 18:53:56.054685 R 3313923 rbd29 1 init
> >>
> >> According to my script log, the first unmap attempt was at 18:39:42,
> >> i.e. 42 seconds after rpc.mountd released the device. At that point the
> >> the open_count was (or should have been?) 1 again allowing the unmap to
> >> succeed - but it didn't. The unmap was retried every second until it
> >
> > For unmap to go through, open_count must be 0.  rpc.mountd at
> > 18:39:00.989050 just decremented it from 2 to 1, it didn't release
> > the device.
>
> Yes - but my poorly made point was that, per the normal test iteration,
> some time shortly after rpc.mountd decremented open_count to 1, an
> "umount" command was run successfully (the test would have aborted if
> the umount didn't succeed) - but the "umount" didn't show up in the
> bpftrace output. Immediately after the umount a "rbd unmap" was run,
> which failed with "busy" - i.e. the open_count was still incremented.
>
> >> eventually succeeded at 18:53:56, the same time as the mysterious
> >> "init" process ran - but also note there is NO "umount" process in
> >> there so I don't know if the name of the process recorded by bfptrace
> >> is simply incorrect (but how would that happen??) or what else could
> >> be going on.
>
> Using "ps" once the unmap starts failing, then cross checking against
> the process id recorded for the mysterious "init" in the bpftrace
> output, reveals the full command line for the "init" is:
>
> /dev/init -- /usr/sbin/ceph-volume inventory --format=json-pretty --filter-for-batch
>
> I.e. it's the 'init' process of a ceph-volume container that eventually
> releases the open_count.
>
> After doing a lot of learning about ceph and containers (podman in this
> case) and namespaces etc. etc., the problem is now known...
>
> Ceph containers are started with '-v "/:/rootfs"' which bind mounts the
> entire host's filesystem hierarchy into the container. Specifically, if
> the host has mounted filesystems, they're also mounted within the
> container when it starts up. So, if a ceph container starts up whilst
> there is a filesystem mounted from an rbd mapped device, the container
> also has that mount - and it retains the mount even if the filesystem is
> unmounted in the host. So the rbd device can't be unmapped in the host
> until the filesystem is released by the container, either via an explicit
> umount within the container, or a umount from the host targetting the
> container namespace, or the container exits.
>
> This explains the mysterious 51 rbd devices that I haven't been able to
> unmap for a week: they're all mounted within long-running ceph containers
> that happened to start up whilst those 51 devices were all mounted
> somewhere.  I've now been able to unmap those devices after unmounting the
> filesystems within those containers using:
>
> umount --namespace "${pid_of_container}" "${fs}"
>
>
> ------------------------------------------------------------
> An example demonstrating the problem
> ------------------------------------------------------------
> #
> # Mount a snapshot, with "unbindable"
> #
> host# {
>    rbd=pool/name@snap
>    dev=$(rbd device map "${rbd}")
>    declare -p dev
>    mount -oro,norecovery,nouuid,unbindable "${dev}" "/mnt"
>    echo --
>    grep "${dev}" /proc/self/mountinfo
>    echo --
>    ls /mnt
>    echo --
> }
> declare -- dev="/dev/rbd30"
> --
> 1463 22 252:480 / /mnt ro unbindable - xfs /dev/rbd30 ro,nouuid,norecovery
> --
> file1 file2 file3
>
> #
> # The mount is still visible if we start a ceph container
> #
> host# cephadm shell
> root@host:/# ls /mnt
> file1 file2 file3
>
> #
> # The device is not unmappable from the host...
> #
> host# umount /mnt
> host# rbd device unmap "${dev}"
> rbd: sysfs write failed
> rbd: unmap failed: (16) Device or resource busy
>
> #
> # ...until we umount the filesystem within the container
> #
> #
> host# lsns -t mnt
>          NS TYPE NPROCS     PID USER             COMMAND
> 4026533050 mnt       2 3105356 root             /dev/init -- bash
> host# umount --namespace 3105356 /mnt
> host# rbd device unmap "${dev}"
>    ## success
> ------------------------------------------------------------

Hi Chris,

Thanks for the great analysis!  I think ceph-volume container does
it because of [1].  I'm not sure about "cephadm shell".  There is also
node-exporter container that needs access to the host for gathering
metrics.

I'm adding Adam (cephadm maintainer) and Guillaume (ceph-volume
maintainer) as this is something that clearly wasn't intended.

[1] https://tracker.ceph.com/issues/52926

                Ilya

>
>
> >> The bpftrace script looks like this:
> >
> > It would be good to attach the entire script, just in case someone runs
> > into a similar issue in the future and tries to debug the same way.
>
> Attached.
>
> Cheers,
>
> Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
       [not found]                   ` <CANqTTH4dPibtJ_4ayDch5rKVG=ykGAJhWnCyWmG9vvm1zHEg1w@mail.gmail.com>
@ 2022-09-27 10:55                     ` Ilya Dryomov
  0 siblings, 0 replies; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-27 10:55 UTC (permalink / raw)
  To: Guillaume Abrioux; +Cc: Chris Dunlop, ceph-devel

On Fri, Sep 23, 2022 at 3:06 PM Guillaume Abrioux <gabrioux@redhat.com> wrote:
>
> Hi Chris,
>
> On Fri, 23 Sept 2022 at 05:59, Chris Dunlop <chris@onthe.net.au> wrote:
>>
>>
>> If the ceph containers really do need access to the entire host
>> filesystem, perhaps it would be better to do a "slave" mount,
>
>
> Yes, I think a mount with 'slave' propagation should fix your issue.
> I plan to do some tests next week and work on a patch.

Hi Guillaume,

I wanted to share an observation that there seem to be two cases here:
actual containers (e.g. an OSD container) and "cephadm shell" which is
technically also a container but may be regarded by users as a shell
("window") with some binaries and configuration files injected into it.

For the former, a unidirectional propagation such that when something
is unmounted on the host it is also unmounted in the container is all
that is needed.  However, for the latter, a bidirectional propagation
such that when something is mounted in this shell it is also mounted on
the host (and therefore in all other windows) seems desirable.

What do you think about going with MS_SLAVE for the former and MS_SHARED
for the latter?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-23  9:47                   ` Ilya Dryomov
@ 2022-09-28  0:22                     ` Chris Dunlop
  2022-09-29 11:14                       ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-09-28  0:22 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Adam King, Guillaume Abrioux, ceph-devel

Hi all,

On Fri, Sep 23, 2022 at 11:47:11AM +0200, Ilya Dryomov wrote:
> On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@onthe.net.au> wrote:
>> On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote:
>>> On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>>>> What can make a "rbd unmap" fail, assuming the device is not 
>>>>> mounted and not (obviously) open by any other processes?
>>
>> OK, I'm confident I now understand the cause of this problem. The 
>> particular machine where I'm mounting the rbd snapshots is also 
>> running some containerised ceph services. The ceph containers are 
>> (bind-)mounting the entire host filesystem hierarchy on startup, and 
>> if a ceph container happens to start up whilst a rbd device is 
>> mounted, the container also has the rbd mounted, preventing the host 
>> from unmapping the device even after the host has unmounted it. (More 
>> below.)
>>
>> This brings up a couple of issues...
>>
>> Why is the ceph container getting access to the entire host 
>> filesystem in the first place?
>>
>> Even if I mount an rbd device with the "unbindable" mount option, 
>> which is specifically supposed to prevent bind mounts to that 
>> filesystem, the ceph containers still get the mount - how / why??
>>
>> If the ceph containers really do need access to the entire host 
>> filesystem, perhaps it would be better to do a "slave" mount, so 
>> if/when the hosts unmounts a filesystem it's also unmounted in the 
>> container[s].  (Of course this also means any filesystems newly 
>> mounted in the host would also appear in the containers - but that 
>> happens anyway if the container is newly started).
>
> Thanks for the great analysis!  I think ceph-volume container does it 
> because of [1].  I'm not sure about "cephadm shell".  There is also
> node-exporter container that needs access to the host for gathering 
> metrics.
>
> [1] https://tracker.ceph.com/issues/52926

I'm guessing ceph-volume may need to see the host mounts so it can 
detect a disk is being used. Could this also be done in the host (like 
issue 52926 says is being done with pv/vg/lv commands), removing the 
need to have the entire host filesystem hierarchy available in the 
container?

Similarly, I would have thought the node-exporter container only needs 
access to ceph-specific files/directories rather than the whole system.

On Tue, Sep 27, 2022 at 12:55:37PM +0200, Ilya Dryomov wrote:
> On Fri, Sep 23, 2022 at 3:06 PM Guillaume Abrioux <gabrioux@redhat.com> wrote:
>> On Fri, 23 Sept 2022 at 05:59, Chris Dunlop <chris@onthe.net.au> wrote:
>>> If the ceph containers really do need access to the entire host 
>>> filesystem, perhaps it would be better to do a "slave" mount,
>>
>> Yes, I think a mount with 'slave' propagation should fix your issue.  
>> I plan to do some tests next week and work on a patch.

Thanks Guillaume.

> I wanted to share an observation that there seem to be two cases here: 
> actual containers (e.g. an OSD container) and cephadm shell which is 
> technically also a container but may be regarded by users as a shell 
> ("window") with some binaries and configuration files injected into 
> it.

For my part I don't see or use a cephadm shell as a normal shell with 
additional stuff injected. At the very least the host root filesystem 
location has changed to /rootfs so it's obviously not a standard shell.

In fact I was quite surprised that the rootfs and all the other mounts 
unrelated to ceph were available at all. I'm still not convinced it's a 
good idea.

In my conception a cephadm shell is a mini virtual machine specifically 
for inspecting and managing ceph specific areas *only*.

I guess it's really a difference of philosophy. I only use cephadm shell 
when I'm explicitly needing to so something with ceph, and I drop back 
out of the cephadm shell (and it's associated privleges!) as soon as I'm 
done with that specific task. For everything else I'll be in my 
(non-privileged) host shell. I can imagine (although I must say I'd be 
surprised), that others may use the cephadm shell as a matter of course, 
for managing the whole machine? Then again, given issue 52926 quoted 
above, it sounds like that would be a bad idea if, for instance, the lvm 
commands should NOT be run the container "in order to avoid lvm metadata 
corruption" - i.e. it's not safe to assume a cephadm shell is a normal 
shell.

I would argue the goal should be to remove access to the general host 
filesystem(s) from the ceph containers altogether where possible.

I'll also admit that, generally, it's probably a bad idea to be doing 
things unrelated to ceph on a box hosting ceph. But that's the way this 
particular system has grown and unfortunately it will take quite a bit 
of time, effort, and expense to change this now.

> For the former, a unidirectional propagation such that when something 
> is unmounted on the host it is also unmounted in the container is all 
> that is needed.  However, for the latter, a bidirectional propagation 
> such that when something is mounted in this shell it is also mounted 
> on the host (and therefore in all other windows) seems desirable.
>
> What do you think about going with MS_SLAVE for the former and 
> MS_SHARED for the latter?

Personally I would find it surprising and unexpected (i.e. potentially a 
source of trouble) for mount changes done in a container (including a 
"shell" container) to affect the host. But again, that may be that 
difference of philosophy regarding the cephadm shell mentioned above.

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-28  0:22                     ` Chris Dunlop
@ 2022-09-29 11:14                       ` Ilya Dryomov
  2022-09-30  0:04                         ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-29 11:14 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Adam King, Guillaume Abrioux, ceph-devel

On Wed, Sep 28, 2022 at 2:22 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi all,
>
> On Fri, Sep 23, 2022 at 11:47:11AM +0200, Ilya Dryomov wrote:
> > On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >> On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote:
> >>> On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >>>>> What can make a "rbd unmap" fail, assuming the device is not
> >>>>> mounted and not (obviously) open by any other processes?
> >>
> >> OK, I'm confident I now understand the cause of this problem. The
> >> particular machine where I'm mounting the rbd snapshots is also
> >> running some containerised ceph services. The ceph containers are
> >> (bind-)mounting the entire host filesystem hierarchy on startup, and
> >> if a ceph container happens to start up whilst a rbd device is
> >> mounted, the container also has the rbd mounted, preventing the host
> >> from unmapping the device even after the host has unmounted it. (More
> >> below.)
> >>
> >> This brings up a couple of issues...
> >>
> >> Why is the ceph container getting access to the entire host
> >> filesystem in the first place?
> >>
> >> Even if I mount an rbd device with the "unbindable" mount option,
> >> which is specifically supposed to prevent bind mounts to that
> >> filesystem, the ceph containers still get the mount - how / why??
> >>
> >> If the ceph containers really do need access to the entire host
> >> filesystem, perhaps it would be better to do a "slave" mount, so
> >> if/when the hosts unmounts a filesystem it's also unmounted in the
> >> container[s].  (Of course this also means any filesystems newly
> >> mounted in the host would also appear in the containers - but that
> >> happens anyway if the container is newly started).
> >
> > Thanks for the great analysis!  I think ceph-volume container does it
> > because of [1].  I'm not sure about "cephadm shell".  There is also
> > node-exporter container that needs access to the host for gathering
> > metrics.
> >
> > [1] https://tracker.ceph.com/issues/52926
>
> I'm guessing ceph-volume may need to see the host mounts so it can
> detect a disk is being used. Could this also be done in the host (like
> issue 52926 says is being done with pv/vg/lv commands), removing the
> need to have the entire host filesystem hierarchy available in the
> container?
>
> Similarly, I would have thought the node-exporter container only needs
> access to ceph-specific files/directories rather than the whole system.
>
> On Tue, Sep 27, 2022 at 12:55:37PM +0200, Ilya Dryomov wrote:
> > On Fri, Sep 23, 2022 at 3:06 PM Guillaume Abrioux <gabrioux@redhat.com> wrote:
> >> On Fri, 23 Sept 2022 at 05:59, Chris Dunlop <chris@onthe.net.au> wrote:
> >>> If the ceph containers really do need access to the entire host
> >>> filesystem, perhaps it would be better to do a "slave" mount,
> >>
> >> Yes, I think a mount with 'slave' propagation should fix your issue.
> >> I plan to do some tests next week and work on a patch.
>
> Thanks Guillaume.
>
> > I wanted to share an observation that there seem to be two cases here:
> > actual containers (e.g. an OSD container) and cephadm shell which is
> > technically also a container but may be regarded by users as a shell
> > ("window") with some binaries and configuration files injected into
> > it.
>
> For my part I don't see or use a cephadm shell as a normal shell with
> additional stuff injected. At the very least the host root filesystem
> location has changed to /rootfs so it's obviously not a standard shell.
>
> In fact I was quite surprised that the rootfs and all the other mounts
> unrelated to ceph were available at all. I'm still not convinced it's a
> good idea.
>
> In my conception a cephadm shell is a mini virtual machine specifically
> for inspecting and managing ceph specific areas *only*.
>
> I guess it's really a difference of philosophy. I only use cephadm shell
> when I'm explicitly needing to so something with ceph, and I drop back
> out of the cephadm shell (and it's associated privleges!) as soon as I'm
> done with that specific task. For everything else I'll be in my
> (non-privileged) host shell. I can imagine (although I must say I'd be
> surprised), that others may use the cephadm shell as a matter of course,
> for managing the whole machine? Then again, given issue 52926 quoted
> above, it sounds like that would be a bad idea if, for instance, the lvm
> commands should NOT be run the container "in order to avoid lvm metadata
> corruption" - i.e. it's not safe to assume a cephadm shell is a normal
> shell.
>
> I would argue the goal should be to remove access to the general host
> filesystem(s) from the ceph containers altogether where possible.
>
> I'll also admit that, generally, it's probably a bad idea to be doing
> things unrelated to ceph on a box hosting ceph. But that's the way this
> particular system has grown and unfortunately it will take quite a bit
> of time, effort, and expense to change this now.
>
> > For the former, a unidirectional propagation such that when something
> > is unmounted on the host it is also unmounted in the container is all
> > that is needed.  However, for the latter, a bidirectional propagation
> > such that when something is mounted in this shell it is also mounted
> > on the host (and therefore in all other windows) seems desirable.
> >
> > What do you think about going with MS_SLAVE for the former and
> > MS_SHARED for the latter?
>
> Personally I would find it surprising and unexpected (i.e. potentially a
> source of trouble) for mount changes done in a container (including a
> "shell" container) to affect the host. But again, that may be that
> difference of philosophy regarding the cephadm shell mentioned above.

Hi Chris,

Right, I see your point, particularly around /rootfs location making it
obvious that it's not a standard shell.  I don't have a strong opinion
here, ultimately the fix is up to Adam and Guillaume (although I would
definitely prefer a set of targeted mounts over a blanket -v /:/rootfs
mount, whether slave or not).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-29 11:14                       ` Ilya Dryomov
@ 2022-09-30  0:04                         ` Chris Dunlop
  2022-09-30 13:26                           ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-09-30  0:04 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Adam King, Guillaume Abrioux, ceph-devel

Hi all,

On Thu, Sep 29, 2022 at 01:14:17PM +0200, Ilya Dryomov wrote:
> On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@onthe.net.au> wrote:
>> Why is the ceph container getting access to the entire host
>> filesystem in the first place?
...
> Right, I see your point, particularly around /rootfs location making it
> obvious that it's not a standard shell.  I don't have a strong opinion
> here, ultimately the fix is up to Adam and Guillaume (although I would
> definitely prefer a set of targeted mounts over a blanket -v /:/rootfs
> mount, whether slave or not).

Perhaps this topic should be raised at a team meeting or however project 
directions are managed - i.e. whether or not to keep the blanket mount of 
the entire host filesystem or the containers should be aiming for the 
minimal filesystem access required to run. If such a discussion were to 
take place I think the general safety principals around providing minimum 
privileged access should be noted. 


Cheers,

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rbd unmap fails with "Device or resource busy"
  2022-09-30  0:04                         ` Chris Dunlop
@ 2022-09-30 13:26                           ` Ilya Dryomov
  0 siblings, 0 replies; 16+ messages in thread
From: Ilya Dryomov @ 2022-09-30 13:26 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Adam King, Guillaume Abrioux, ceph-devel

On Fri, Sep 30, 2022 at 2:04 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Hi all,
>
> On Thu, Sep 29, 2022 at 01:14:17PM +0200, Ilya Dryomov wrote:
> > On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@onthe.net.au> wrote:
> >> Why is the ceph container getting access to the entire host
> >> filesystem in the first place?
> ...
> > Right, I see your point, particularly around /rootfs location making it
> > obvious that it's not a standard shell.  I don't have a strong opinion
> > here, ultimately the fix is up to Adam and Guillaume (although I would
> > definitely prefer a set of targeted mounts over a blanket -v /:/rootfs
> > mount, whether slave or not).
>
> Perhaps this topic should be raised at a team meeting or however project
> directions are managed - i.e. whether or not to keep the blanket mount of
> the entire host filesystem or the containers should be aiming for the
> minimal filesystem access required to run. If such a discussion were to
> take place I think the general safety principals around providing minimum
> privileged access should be noted.

Indeed.  I added this as a topic for the upcoming Ceph Developer
Monthly meeting [1].

[1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/VDV5YVZSLFMUAAUI2NBZMYSKCFRC5AIV/

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-09-30 13:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-13  1:20 rbd unmap fails with "Device or resource busy" Chris Dunlop
2022-09-13 11:43 ` Ilya Dryomov
2022-09-14  3:49   ` Chris Dunlop
2022-09-14  8:41     ` Ilya Dryomov
2022-09-15  8:29       ` Chris Dunlop
2022-09-19  7:43         ` Chris Dunlop
2022-09-19 10:14           ` Ilya Dryomov
2022-09-21  1:36             ` Chris Dunlop
2022-09-21 10:40               ` Ilya Dryomov
2022-09-23  3:58                 ` Chris Dunlop
2022-09-23  9:47                   ` Ilya Dryomov
2022-09-28  0:22                     ` Chris Dunlop
2022-09-29 11:14                       ` Ilya Dryomov
2022-09-30  0:04                         ` Chris Dunlop
2022-09-30 13:26                           ` Ilya Dryomov
     [not found]                   ` <CANqTTH4dPibtJ_4ayDch5rKVG=ykGAJhWnCyWmG9vvm1zHEg1w@mail.gmail.com>
2022-09-27 10:55                     ` Ilya Dryomov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.