All of lore.kernel.org
 help / color / mirror / Atom feed
* Race-free block device opening
@ 2022-04-26 18:12 Demi Marie Obenour
  2022-04-26 18:35 ` Greg Kroah-Hartman
  2022-04-27 13:29 ` James Bottomley
  0 siblings, 2 replies; 6+ messages in thread
From: Demi Marie Obenour @ 2022-04-26 18:12 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Kernel Mailing List, Linux Block Mailing List,
	Linux Filesystem Mailing List

[-- Attachment #1: Type: text/plain, Size: 1515 bytes --]

Right now, opening block devices in a race-free way is incredibly hard.
The only reasonable approach I know of is sd_device_new_from_path() +
sd_device_open(), and is only available in systemd git main.  It also
requires waiting on systemd-udev to have processed udev rules, which can
be a bottleneck.  There are better approaches in various special cases,
such as using device-mapper ioctls to check that the device one has
opened still has the name and/or UUID one expects.  However, none of
them works for a plain call to open(2).

A much better approach would be for udev to point its symlinks at
"/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
"/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions.  A
filesystem would then be mounted at "/dev/disk/by-diskseq" that provides
for race-free opening of these paths.  This could be implemented in
userspace using FUSE, either with difficulty using the current kernel
API, or easily and efficiently using a new kernel API for opening a
block device by diskseq + partition.  However, I think this should be
handled by the Linux kernel itself.

What would be necessary to get this into the kernel?  I would like to
implement this, but I don’t have the time to do so anytime soon.  Is
anyone else interested in taking this on?  I suspect the kernel code
needed to implement this would be quite a bit smaller than the FUSE
implementation.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Race-free block device opening
  2022-04-26 18:12 Race-free block device opening Demi Marie Obenour
@ 2022-04-26 18:35 ` Greg Kroah-Hartman
  2022-04-26 21:31   ` Demi Marie Obenour
  2022-04-27 13:29 ` James Bottomley
  1 sibling, 1 reply; 6+ messages in thread
From: Greg Kroah-Hartman @ 2022-04-26 18:35 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Linux Kernel Mailing List, Linux Block Mailing List,
	Linux Filesystem Mailing List

On Tue, Apr 26, 2022 at 02:12:22PM -0400, Demi Marie Obenour wrote:
> Right now, opening block devices in a race-free way is incredibly hard.
> The only reasonable approach I know of is sd_device_new_from_path() +
> sd_device_open(), and is only available in systemd git main.  It also
> requires waiting on systemd-udev to have processed udev rules, which can
> be a bottleneck.  There are better approaches in various special cases,
> such as using device-mapper ioctls to check that the device one has
> opened still has the name and/or UUID one expects.  However, none of
> them works for a plain call to open(2).

Why do you call open(2) on a block device?

> A much better approach would be for udev to point its symlinks at
> "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
> "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions.

You can do that today with udev rules, right?

> A
> filesystem would then be mounted at "/dev/disk/by-diskseq" that provides
> for race-free opening of these paths.

How would it be any less race-free than just open("/dev/sda1") is?

> This could be implemented in
> userspace using FUSE, either with difficulty using the current kernel
> API, or easily and efficiently using a new kernel API for opening a
> block device by diskseq + partition.  However, I think this should be
> handled by the Linux kernel itself.
> 
> What would be necessary to get this into the kernel?

Get what exactly?  I don't see anything the kernel needs to do here
specifically.  Normally block devices are accessed using mount(2), not
open(2).  Do you want a new mount(2)-type api?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Race-free block device opening
  2022-04-26 18:35 ` Greg Kroah-Hartman
@ 2022-04-26 21:31   ` Demi Marie Obenour
  2022-04-26 22:07     ` Demi Marie Obenour
  0 siblings, 1 reply; 6+ messages in thread
From: Demi Marie Obenour @ 2022-04-26 21:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Kernel Mailing List, Linux Block Mailing List,
	Linux Filesystem Mailing List

[-- Attachment #1: Type: text/plain, Size: 4338 bytes --]

On Tue, Apr 26, 2022 at 08:35:34PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Apr 26, 2022 at 02:12:22PM -0400, Demi Marie Obenour wrote:
> > Right now, opening block devices in a race-free way is incredibly hard.
> > The only reasonable approach I know of is sd_device_new_from_path() +
> > sd_device_open(), and is only available in systemd git main.  It also
> > requires waiting on systemd-udev to have processed udev rules, which can
> > be a bottleneck.  There are better approaches in various special cases,
> > such as using device-mapper ioctls to check that the device one has
> > opened still has the name and/or UUID one expects.  However, none of
> > them works for a plain call to open(2).
> 
> Why do you call open(2) on a block device?

There are many reasons to do so:

- Some programs invoke ioctls on the block device FD.
- Some programs perform I/O using a block device (or a partition)
  directly.  mkfs, fsck, dd, lvm, cryptsetup, and Ceph all fall in this
  category.
- Some programs need to use the block device’s major and minor numbers
  in device-mapper ioctls, and need to make sure that the major and
  minor number won’t be recycled behind their back.
- Some programs need to pass the assign the device to a virtual machine.

> > A much better approach would be for udev to point its symlinks at
> > "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
> > "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions.
> 
> You can do that today with udev rules, right?

One can make udev create a symlink with that path pointing to the kernel
device name, but not make udev’s other symlinks point to that path.  It
is also still necessary to check (with BLKGETDISKSEQ) that the device
one opened is what one intended to open.

> > A
> > filesystem would then be mounted at "/dev/disk/by-diskseq" that provides
> > for race-free opening of these paths.
> 
> How would it be any less race-free than just open("/dev/sda1") is?

Assuming you meant "more race-free", the answer is that /dev/sda1 is not
guarnateed to always point to the same device.  This could happen if the
user unplugs their USB hard drive and plugs in a new one.  The problem
is much more severe for virtual devices, such as /dev/loop* or
/dev/dm-*, which can be created and destroyed quite frequently.
If a diskseqfs is implemented and mounted on /dev/disk/by-diskseq,
opening /dev/disk/by-diskseq/1 will always either return the same device
every time, or return an error if the original device no longer exists.

> > This could be implemented in
> > userspace using FUSE, either with difficulty using the current kernel
> > API, or easily and efficiently using a new kernel API for opening a
> > block device by diskseq + partition.  However, I think this should be
> > handled by the Linux kernel itself.
> > 
> > What would be necessary to get this into the kernel?
> 
> Get what exactly?  I don't see anything the kernel needs to do here
> specifically.  Normally block devices are accessed using mount(2), not
> open(2).  Do you want a new mount(2)-type api?

I would like to have a filesystem, which will typically be mounted on
/dev/disk/by-diskseq, such that:

- Opening /dev/disk/by-diskseq/$DISKSEQ always returns a device with
  sequence number $DISKSEQ or an error.
- Opening /dev/disk/by-diskseq/${DISKSEQ}p${PARTITION} always returns
  partition $PARTITION of the device with diskseq $DISKSEQ or an error.
- If a device with diskseq $DISKSEQ exists, opening
  /dev/disk/by-diskseq/$DISKSEQ will return a file descriptor to the
  device, provide the user has sufficient permissions and no errors
  happen.
- If a device with diskseq $DISKSEQ exists and has a partition
  $PARTITION, opening /dev/disk/by-diskseq/${DISKSEQ}p${PARTITION} will
  return a file descriptor to partition $PARTITION of the device
  $DISKSEQ, provide the user has sufficient permissions and no errors
  happen.
- Listing /dev/disk/by-diskseq will enumerate all path names for which
  an open could succeed.

Obviously /dev/disk/by-diskseq can be replaced with any other path at
which diskseqfs is mounted, but I expect diskseqfs to typically be
mounted at that path.

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Race-free block device opening
  2022-04-26 21:31   ` Demi Marie Obenour
@ 2022-04-26 22:07     ` Demi Marie Obenour
  0 siblings, 0 replies; 6+ messages in thread
From: Demi Marie Obenour @ 2022-04-26 22:07 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Lennart Poettering
  Cc: Linux Kernel Mailing List, Linux Block Mailing List,
	Linux Filesystem Mailing List

[-- Attachment #1: Type: text/plain, Size: 112 bytes --]

Also bringing in Lennart Poettering.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Race-free block device opening
  2022-04-26 18:12 Race-free block device opening Demi Marie Obenour
  2022-04-26 18:35 ` Greg Kroah-Hartman
@ 2022-04-27 13:29 ` James Bottomley
  2022-05-07 11:35   ` Demi Marie Obenour
  1 sibling, 1 reply; 6+ messages in thread
From: James Bottomley @ 2022-04-27 13:29 UTC (permalink / raw)
  To: Demi Marie Obenour, Greg Kroah-Hartman
  Cc: Linux Kernel Mailing List, Linux Block Mailing List,
	Linux Filesystem Mailing List

[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]

On Tue, 2022-04-26 at 14:12 -0400, Demi Marie Obenour wrote:
> Right now, opening block devices in a race-free way is incredibly
> hard.

Could you be more specific about what the race you're having problems
with is?  What is racing.

> The only reasonable approach I know of is sd_device_new_from_path() +
> sd_device_open(), and is only available in systemd git main.  It also
> requires waiting on systemd-udev to have processed udev rules, which
> can be a bottleneck.

This doesn't actually seem to be in my copy of systemd.

>   There are better approaches in various special cases, such as using
> device-mapper ioctls to check that the device one has opened still
> has the name and/or UUID one expects.  However, none of them works
> for a plain call to open(2).

Just so we're clear: if you call open on, say /dev/sdb1 and something
happens to hot unplug and then replug a different device under that
node, the file descriptor you got at open does *not* point to the new
node.  It points to a dead device responder that errors everything.

The point being once you open() something, the file descriptor is
guaranteed to point to the same device (or error).

> A much better approach would be for udev to point its symlinks at
> "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
> "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions.  A
> filesystem would then be mounted at "/dev/disk/by-diskseq" that
> provides for race-free opening of these paths.  This could be
> implemented in userspace using FUSE, either with difficulty using the
> current kernel API, or easily and efficiently using a new kernel API
> for opening a block device by diskseq + partition.  However, I think
> this should be handled by the Linux kernel itself.
> 
> What would be necessary to get this into the kernel?  I would like to
> implement this, but I don’t have the time to do so anytime soon.  Is
> anyone else interested in taking this on?  I suspect the kernel code
> needed to implement this would be quite a bit smaller than the FUSE
> implementation.

So it sounds like the problem is you want to be sure that the device
doesn't change after you've called libblkid to identify it but before
you call open?  If that's so, the way you do this in userspace is to
call libblkid again after the open.  If the before and after id match,
you're as sure as you can be the open was of the right device.

James


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Race-free block device opening
  2022-04-27 13:29 ` James Bottomley
@ 2022-05-07 11:35   ` Demi Marie Obenour
  0 siblings, 0 replies; 6+ messages in thread
From: Demi Marie Obenour @ 2022-05-07 11:35 UTC (permalink / raw)
  To: James Bottomley, Greg Kroah-Hartman
  Cc: Linux Kernel Mailing List, Linux Block Mailing List,
	Linux Filesystem Mailing List

[-- Attachment #1: Type: text/plain, Size: 4409 bytes --]

On Wed, Apr 27, 2022 at 09:29:12AM -0400, James Bottomley wrote:
> On Tue, 2022-04-26 at 14:12 -0400, Demi Marie Obenour wrote:
> > Right now, opening block devices in a race-free way is incredibly
> > hard.
> 
> Could you be more specific about what the race you're having problems
> with is?  What is racing.

If I open /dev/mapper/qubes_dom0-vm--sys--net--private, it is possible
that something has destroyed the corresponding device and created a new
one with the same kernel name, *before* udev has managed to unlink the
device node.  As a result, I wind up opening the wrong device.

> > The only reasonable approach I know of is sd_device_new_from_path() +
> > sd_device_open(), and is only available in systemd git main.  It also
> > requires waiting on systemd-udev to have processed udev rules, which
> > can be a bottleneck.
> 
> This doesn't actually seem to be in my copy of systemd.

That’s because it is not in any release yet.

> >   There are better approaches in various special cases, such as using
> > device-mapper ioctls to check that the device one has opened still
> > has the name and/or UUID one expects.  However, none of them works
> > for a plain call to open(2).
> 
> Just so we're clear: if you call open on, say /dev/sdb1 and something
> happens to hot unplug and then replug a different device under that
> node, the file descriptor you got at open does *not* point to the new
> node.  It points to a dead device responder that errors everything.
> 
> The point being once you open() something, the file descriptor is
> guaranteed to point to the same device (or error).

That doesn’t help if the unplug and replug happens between passing the
path and udev having purged the now-stale symlink.

> > A much better approach would be for udev to point its symlinks at
> > "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
> > "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions.  A
> > filesystem would then be mounted at "/dev/disk/by-diskseq" that
> > provides for race-free opening of these paths.  This could be
> > implemented in userspace using FUSE, either with difficulty using the
> > current kernel API, or easily and efficiently using a new kernel API
> > for opening a block device by diskseq + partition.  However, I think
> > this should be handled by the Linux kernel itself.
> > 
> > What would be necessary to get this into the kernel?  I would like to
> > implement this, but I don’t have the time to do so anytime soon.  Is
> > anyone else interested in taking this on?  I suspect the kernel code
> > needed to implement this would be quite a bit smaller than the FUSE
> > implementation.
> 
> So it sounds like the problem is you want to be sure that the device
> doesn't change after you've called libblkid to identify it but before
> you call open?  If that's so, the way you do this in userspace is to
> call libblkid again after the open.  If the before and after id match,
> you're as sure as you can be the open was of the right device.

The devices I am working with are raw-format VM disks that contain
untrusted data.  They are identified not by their content, which the VM
has complete control over, but by various sysfs attributes such as
dm/name and dm/uuid.  And they need to be passed to interfaces, such as
libvirt and cryptsetup, that only accept device paths.

I can work around this in the case of cryptsetup by using the
libcryptsetup library and/or holding a file descriptor open, but neither
of those will work for libvirt since libvirtd is a separate process and
I cannot pass a file descriptor to it.  Furthermore, there is no way to
make libvirtd do any post-open() checking on the file descriptor it has
obtained.  While I plan to add a workaround in libxl and blkback for
loop and device-mapper devices, it is not reasonable to expect every
userspace tool to do the same.  

The approach I am suggesting avoids this problem entirely, because
/dev/mapper/qubes_dom0-vm--sys--net--private is now a symlink to a
device node under /dev/disk/by-diskseq/$DISKSEQ.  Those are never, ever
reused.  When the device goes away, the device node goes away too, and
so any attempt to open the symlink (without O_PATH|O_NOFOLLOW) gets
-ENOENT as it should.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-05-07 11:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 18:12 Race-free block device opening Demi Marie Obenour
2022-04-26 18:35 ` Greg Kroah-Hartman
2022-04-26 21:31   ` Demi Marie Obenour
2022-04-26 22:07     ` Demi Marie Obenour
2022-04-27 13:29 ` James Bottomley
2022-05-07 11:35   ` Demi Marie Obenour

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.