* [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO @ 2012-07-20 16:30 Paolo Bonzini 2012-08-01 15:53 ` Paolo Bonzini 2012-09-11 16:59 ` Tejun Heo 0 siblings, 2 replies; 24+ messages in thread From: Paolo Bonzini @ 2012-07-20 16:30 UTC (permalink / raw) To: linux-kernel; +Cc: axboe, linux-scsi These commands cannot be issued right now without giving CAP_SYS_RAWIO to the process who wishes to send them. These commands can be useful also to non-privileged programs who have access to the block devices. For example a virtual machine monitor needs them to forward trim/discard to host disks. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- block/scsi_ioctl.c | 3 ++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 260fa80..dd71f18 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) /* Basic writing commands */ __set_bit(WRITE_6, filter->write_ok); __set_bit(WRITE_10, filter->write_ok); + __set_bit(WRITE_SAME, filter->write_ok); __set_bit(WRITE_VERIFY, filter->write_ok); __set_bit(WRITE_12, filter->write_ok); __set_bit(WRITE_VERIFY_12, filter->write_ok); __set_bit(WRITE_16, filter->write_ok); + __set_bit(WRITE_SAME_16, filter->write_ok); __set_bit(WRITE_LONG, filter->write_ok); __set_bit(WRITE_LONG_2, filter->write_ok); __set_bit(ERASE, filter->write_ok); + __set_bit(UNMAP, filter->write_ok); __set_bit(GPCMD_MODE_SELECT_10, filter->write_ok); __set_bit(MODE_SELECT, filter->write_ok); __set_bit(LOG_SELECT, filter->write_ok); ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-07-20 16:30 [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO Paolo Bonzini @ 2012-08-01 15:53 ` Paolo Bonzini 2012-08-28 11:04 ` Paolo Bonzini 2012-09-11 16:59 ` Tejun Heo 1 sibling, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-08-01 15:53 UTC (permalink / raw) To: axboe; +Cc: linux-kernel, linux-scsi Il 20/07/2012 18:30, Paolo Bonzini ha scritto: > These commands cannot be issued right now without giving CAP_SYS_RAWIO to > the process who wishes to send them. These commands can be useful also to > non-privileged programs who have access to the block devices. For example > a virtual machine monitor needs them to forward trim/discard to host disks. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > block/scsi_ioctl.c | 3 ++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c > index 260fa80..dd71f18 100644 > --- a/block/scsi_ioctl.c > +++ b/block/scsi_ioctl.c > @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) > /* Basic writing commands */ > __set_bit(WRITE_6, filter->write_ok); > __set_bit(WRITE_10, filter->write_ok); > + __set_bit(WRITE_SAME, filter->write_ok); > __set_bit(WRITE_VERIFY, filter->write_ok); > __set_bit(WRITE_12, filter->write_ok); > __set_bit(WRITE_VERIFY_12, filter->write_ok); > __set_bit(WRITE_16, filter->write_ok); > + __set_bit(WRITE_SAME_16, filter->write_ok); > __set_bit(WRITE_LONG, filter->write_ok); > __set_bit(WRITE_LONG_2, filter->write_ok); > __set_bit(ERASE, filter->write_ok); > + __set_bit(UNMAP, filter->write_ok); > __set_bit(GPCMD_MODE_SELECT_10, filter->write_ok); > __set_bit(MODE_SELECT, filter->write_ok); > __set_bit(LOG_SELECT, filter->write_ok); > Jens, can this go in 3.6 as well? Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-08-01 15:53 ` Paolo Bonzini @ 2012-08-28 11:04 ` Paolo Bonzini 2012-09-05 14:41 ` [Ping^3] " Paolo Bonzini 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-08-28 11:04 UTC (permalink / raw) To: axboe; +Cc: linux-kernel, linux-scsi Il 01/08/2012 17:53, Paolo Bonzini ha scritto: > Il 20/07/2012 18:30, Paolo Bonzini ha scritto: >> These commands cannot be issued right now without giving CAP_SYS_RAWIO to >> the process who wishes to send them. These commands can be useful also to >> non-privileged programs who have access to the block devices. For example >> a virtual machine monitor needs them to forward trim/discard to host disks. >> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> block/scsi_ioctl.c | 3 ++ >> 1 files changed, 3 insertions(+), 0 deletions(-) >> >> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c >> index 260fa80..dd71f18 100644 >> --- a/block/scsi_ioctl.c >> +++ b/block/scsi_ioctl.c >> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) >> /* Basic writing commands */ >> __set_bit(WRITE_6, filter->write_ok); >> __set_bit(WRITE_10, filter->write_ok); >> + __set_bit(WRITE_SAME, filter->write_ok); >> __set_bit(WRITE_VERIFY, filter->write_ok); >> __set_bit(WRITE_12, filter->write_ok); >> __set_bit(WRITE_VERIFY_12, filter->write_ok); >> __set_bit(WRITE_16, filter->write_ok); >> + __set_bit(WRITE_SAME_16, filter->write_ok); >> __set_bit(WRITE_LONG, filter->write_ok); >> __set_bit(WRITE_LONG_2, filter->write_ok); >> __set_bit(ERASE, filter->write_ok); >> + __set_bit(UNMAP, filter->write_ok); >> __set_bit(GPCMD_MODE_SELECT_10, filter->write_ok); >> __set_bit(MODE_SELECT, filter->write_ok); >> __set_bit(LOG_SELECT, filter->write_ok); >> > > Jens, > > can this go in 3.6 as well? Another ping... Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-08-28 11:04 ` Paolo Bonzini @ 2012-09-05 14:41 ` Paolo Bonzini 2012-09-05 20:18 ` Ric Wheeler 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-05 14:41 UTC (permalink / raw) To: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen Cc: linux-kernel, linux-scsi Il 28/08/2012 13:04, Paolo Bonzini ha scritto: > Il 01/08/2012 17:53, Paolo Bonzini ha scritto: >> Il 20/07/2012 18:30, Paolo Bonzini ha scritto: >>> These commands cannot be issued right now without giving CAP_SYS_RAWIO to >>> the process who wishes to send them. These commands can be useful also to >>> non-privileged programs who have access to the block devices. For example >>> a virtual machine monitor needs them to forward trim/discard to host disks. >>> >>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >>> --- >>> block/scsi_ioctl.c | 3 ++ >>> 1 files changed, 3 insertions(+), 0 deletions(-) >>> >>> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c >>> index 260fa80..dd71f18 100644 >>> --- a/block/scsi_ioctl.c >>> +++ b/block/scsi_ioctl.c >>> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) >>> /* Basic writing commands */ >>> __set_bit(WRITE_6, filter->write_ok); >>> __set_bit(WRITE_10, filter->write_ok); >>> + __set_bit(WRITE_SAME, filter->write_ok); >>> __set_bit(WRITE_VERIFY, filter->write_ok); >>> __set_bit(WRITE_12, filter->write_ok); >>> __set_bit(WRITE_VERIFY_12, filter->write_ok); >>> __set_bit(WRITE_16, filter->write_ok); >>> + __set_bit(WRITE_SAME_16, filter->write_ok); >>> __set_bit(WRITE_LONG, filter->write_ok); >>> __set_bit(WRITE_LONG_2, filter->write_ok); >>> __set_bit(ERASE, filter->write_ok); >>> + __set_bit(UNMAP, filter->write_ok); >>> __set_bit(GPCMD_MODE_SELECT_10, filter->write_ok); >>> __set_bit(MODE_SELECT, filter->write_ok); >>> __set_bit(LOG_SELECT, filter->write_ok); >>> >> >> Jens, >> >> can this go in 3.6 as well? > > Another ping... Ping & adding some more folks hoping to get a Reviewed-by or to be screamed at. Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-05 14:41 ` [Ping^3] " Paolo Bonzini @ 2012-09-05 20:18 ` Ric Wheeler 2012-09-06 6:31 ` Paolo Bonzini 0 siblings, 1 reply; 24+ messages in thread From: Ric Wheeler @ 2012-09-05 20:18 UTC (permalink / raw) To: Paolo Bonzini Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel, linux-scsi On 09/05/2012 10:41 AM, Paolo Bonzini wrote: > Il 28/08/2012 13:04, Paolo Bonzini ha scritto: >> Il 01/08/2012 17:53, Paolo Bonzini ha scritto: >>> Il 20/07/2012 18:30, Paolo Bonzini ha scritto: >>>> These commands cannot be issued right now without giving CAP_SYS_RAWIO to >>>> the process who wishes to send them. These commands can be useful also to >>>> non-privileged programs who have access to the block devices. For example >>>> a virtual machine monitor needs them to forward trim/discard to host disks. >>>> >>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >>>> --- >>>> block/scsi_ioctl.c | 3 ++ >>>> 1 files changed, 3 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c >>>> index 260fa80..dd71f18 100644 >>>> --- a/block/scsi_ioctl.c >>>> +++ b/block/scsi_ioctl.c >>>> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) >>>> /* Basic writing commands */ >>>> __set_bit(WRITE_6, filter->write_ok); >>>> __set_bit(WRITE_10, filter->write_ok); >>>> + __set_bit(WRITE_SAME, filter->write_ok); >>>> __set_bit(WRITE_VERIFY, filter->write_ok); >>>> __set_bit(WRITE_12, filter->write_ok); >>>> __set_bit(WRITE_VERIFY_12, filter->write_ok); >>>> __set_bit(WRITE_16, filter->write_ok); >>>> + __set_bit(WRITE_SAME_16, filter->write_ok); >>>> __set_bit(WRITE_LONG, filter->write_ok); >>>> __set_bit(WRITE_LONG_2, filter->write_ok); >>>> __set_bit(ERASE, filter->write_ok); >>>> + __set_bit(UNMAP, filter->write_ok); >>>> __set_bit(GPCMD_MODE_SELECT_10, filter->write_ok); >>>> __set_bit(MODE_SELECT, filter->write_ok); >>>> __set_bit(LOG_SELECT, filter->write_ok); >>>> >>> Jens, >>> >>> can this go in 3.6 as well? >> Another ping... > Ping & adding some more folks hoping to get a Reviewed-by or to be > screamed at. > > Paolo Hi Paolo, Both of these commands are destructive. WRITE_SAME (if done without the discard bits set) can also take a very long time to be destructive and tie up the storage. I think that restricting them to CAP_SYS_RAWIO seems reasonable - better to vet and give the appropriate apps the needed capability than to widely open up the safety check? thanks! Ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-05 20:18 ` Ric Wheeler @ 2012-09-06 6:31 ` Paolo Bonzini 2012-09-06 11:31 ` Ric Wheeler 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-06 6:31 UTC (permalink / raw) To: Ric Wheeler Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel, linux-scsi Il 05/09/2012 22:18, Ric Wheeler ha scritto: >> > > Hi Paolo, > > Both of these commands are destructive. WRITE_SAME (if done without the > discard bits set) can also take a very long time to be destructive and > tie up the storage. FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I don't think WRITE SAME slowness is limited to the case where a real write is requested; discarding can be just as slow). Also, the two new commands are anyway restricted to programs that have write access to the disk. If you have read-only access, you won't be able to issue any destructive command (there is one exception, START STOP UNIT is allowed even with read-only capability and is somewhat destructive). Honestly, the only reason why these two commands weren't included, is that the current whitelist is heavily tailored towards CD/DVD burning. > I think that restricting them to CAP_SYS_RAWIO seems reasonable - better > to vet and give the appropriate apps the needed capability than to > widely open up the safety check? CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is insecure. Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-06 6:31 ` Paolo Bonzini @ 2012-09-06 11:31 ` Ric Wheeler 2012-09-06 11:49 ` Paolo Bonzini 0 siblings, 1 reply; 24+ messages in thread From: Ric Wheeler @ 2012-09-06 11:31 UTC (permalink / raw) To: Paolo Bonzini Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel, linux-scsi On 09/06/2012 02:31 AM, Paolo Bonzini wrote: > Il 05/09/2012 22:18, Ric Wheeler ha scritto: >> Hi Paolo, >> >> Both of these commands are destructive. WRITE_SAME (if done without the >> discard bits set) can also take a very long time to be destructive and >> tie up the storage. > FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I > don't think WRITE SAME slowness is limited to the case where a real > write is requested; discarding can be just as slow). > > Also, the two new commands are anyway restricted to programs that have > write access to the disk. If you have read-only access, you won't be > able to issue any destructive command (there is one exception, START > STOP UNIT is allowed even with read-only capability and is somewhat > destructive). > > Honestly, the only reason why these two commands weren't included, is > that the current whitelist is heavily tailored towards CD/DVD burning. Hi Paolo, I assume that FORMAT_UNIT is for CD/DVD needs - not sure what a S-ATA disk would do with that. If it is destructive, we should probably think about how to make it more secure and see how many applications we would break. > >> I think that restricting them to CAP_SYS_RAWIO seems reasonable - better >> to vet and give the appropriate apps the needed capability than to >> widely open up the safety check? > CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is > insecure. > > Paolo I don't see allowing anyone who can open the device to zero the data as better though :) Ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-06 11:31 ` Ric Wheeler @ 2012-09-06 11:49 ` Paolo Bonzini 2012-09-06 12:08 ` Ric Wheeler 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-06 11:49 UTC (permalink / raw) To: Ric Wheeler Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel, linux-scsi Il 06/09/2012 13:31, Ric Wheeler ha scritto: >>> Both of these commands are destructive. WRITE_SAME (if done without the >>> discard bits set) can also take a very long time to be destructive and >>> tie up the storage. >> >> FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I >> don't think WRITE SAME slowness is limited to the case where a real >> write is requested; discarding can be just as slow). >> >> Also, the two new commands are anyway restricted to programs that have >> write access to the disk. If you have read-only access, you won't be >> able to issue any destructive command (there is one exception, START >> STOP UNIT is allowed even with read-only capability and is somewhat >> destructive). >> >> Honestly, the only reason why these two commands weren't included, is >> that the current whitelist is heavily tailored towards CD/DVD burning. > > I assume that FORMAT_UNIT is for CD/DVD needs - not sure what a S-ATA > disk would do with that. According to the standard, the translation layer can write a user-provided pattern to every sector in the disk. It's an optional feature and libata doesn't do that, but it is still possible. > If it is destructive, we should probably think > about how to make it more secure and see how many applications we would > break. We have filesystem permissions to make it secure. They already do. >>> I think that restricting them to CAP_SYS_RAWIO seems reasonable - better >>> to vet and give the appropriate apps the needed capability than to >>> widely open up the safety check? >> CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is >> insecure. > > I don't see allowing anyone who can open the device to zero the data as > better though :) Note: anyone who can open it for writing! And they can just as well issue WRITE, it just takes a little more effort than with WRITE SAME. :) If you only have read access, you cannot issue WRITE or FORMAT UNIT, and with this patch you will not be able to issue WRITE SAME. I'm all for providing more versatile filters---which can be both stricter and looser depending on the configuration than the default. For example http://www.redhat.com/archives/libvir-list/2012-June/msg00505.html is a possible spec for BPF-based filtering of CDBs. However, the default whitelist (which is all we have for now) should provide a reasonable default for a user that already has been granted access to the device by the normal access control mechanisms. I believe WRITE SAME and UNMAP fit the definition of a reasonable default. Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-06 11:49 ` Paolo Bonzini @ 2012-09-06 12:08 ` Ric Wheeler 2012-09-06 12:36 ` Paolo Bonzini 0 siblings, 1 reply; 24+ messages in thread From: Ric Wheeler @ 2012-09-06 12:08 UTC (permalink / raw) To: Paolo Bonzini Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel, linux-scsi On 09/06/2012 07:49 AM, Paolo Bonzini wrote: > Il 06/09/2012 13:31, Ric Wheeler ha scritto: >>>> Both of these commands are destructive. WRITE_SAME (if done without the >>>> discard bits set) can also take a very long time to be destructive and >>>> tie up the storage. >>> FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I >>> don't think WRITE SAME slowness is limited to the case where a real >>> write is requested; discarding can be just as slow). >>> >>> Also, the two new commands are anyway restricted to programs that have >>> write access to the disk. If you have read-only access, you won't be >>> able to issue any destructive command (there is one exception, START >>> STOP UNIT is allowed even with read-only capability and is somewhat >>> destructive). >>> >>> Honestly, the only reason why these two commands weren't included, is >>> that the current whitelist is heavily tailored towards CD/DVD burning. >> I assume that FORMAT_UNIT is for CD/DVD needs - not sure what a S-ATA >> disk would do with that. > According to the standard, the translation layer can write a > user-provided pattern to every sector in the disk. It's an optional > feature and libata doesn't do that, but it is still possible. It is not possible today with our stack though, any patch that would change that would also need to be vetted. > >> If it is destructive, we should probably think >> about how to make it more secure and see how many applications we would >> break. > We have filesystem permissions to make it secure. They already do. > >>>> I think that restricting them to CAP_SYS_RAWIO seems reasonable - better >>>> to vet and give the appropriate apps the needed capability than to >>>> widely open up the safety check? >>> CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is >>> insecure. >> I don't see allowing anyone who can open the device to zero the data as >> better though :) > Note: anyone who can open it for writing! And they can just as well > issue WRITE, it just takes a little more effort than with WRITE SAME. :) > If you only have read access, you cannot issue WRITE or FORMAT UNIT, > and with this patch you will not be able to issue WRITE SAME. > > I'm all for providing more versatile filters---which can be both > stricter and looser depending on the configuration than the default. > For example > http://www.redhat.com/archives/libvir-list/2012-June/msg00505.html is a > possible spec for BPF-based filtering of CDBs. > > However, the default whitelist (which is all we have for now) should > provide a reasonable default for a user that already has been granted > access to the device by the normal access control mechanisms. I believe > WRITE SAME and UNMAP fit the definition of a reasonable default. > > Paolo This just seems like an argument over whether or not capabilities make sense. In general, anything as destructive as a single CDB that can kill all of your data should be tightly controlled. Pushing more code in the data path is not where we are going - we routinely need to disable IO scheduling for example when driving IO to high speed/low latency devices and are actively looking at how to tackle other performance bottlenecks in the stack. I don't see a strong reason that our existing scheme (root or CAP_SYS_RAWIO access) prevents you from doing what you need to do. Ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-06 12:08 ` Ric Wheeler @ 2012-09-06 12:36 ` Paolo Bonzini 2012-09-06 14:20 ` Lukáš Czerner 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-06 12:36 UTC (permalink / raw) To: Ric Wheeler Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel, linux-scsi Il 06/09/2012 14:08, Ric Wheeler ha scritto: >> According to the standard, the translation layer can write a >> user-provided pattern to every sector in the disk. It's an optional >> feature and libata doesn't do that, but it is still possible. > > It is not possible today with our stack though, any patch that would > change that would also need to be vetted. It is not possible with SATA disks, but native SCSI disks might well interpret FORMAT UNIT destructively. >>> I don't see allowing anyone who can open the device to zero the data as >>> better though :) >> Note: anyone who can open it for writing! And they can just as well >> issue WRITE, it just takes a little more effort than with WRITE SAME. :) >> If you only have read access, you cannot issue WRITE or FORMAT UNIT, >> and with this patch you will not be able to issue WRITE SAME. > > This just seems like an argument over whether or not capabilities make > sense. In general, anything as destructive as a single CDB that can kill > all of your data should be tightly controlled. In practice, a single write to the first MB of the disk is just as destructive. For that you do not even need a SCSI command. > Pushing more code in the data path is not where we are going - we > routinely need to disable IO scheduling for example when driving IO to > high speed/low latency devices and are actively looking at how to tackle > other performance bottlenecks in the stack. I am not talking about the regular data path, only of SG_IO. > I don't see a strong reason that our existing scheme (root or > CAP_SYS_RAWIO access) prevents you from doing what you need to do. Here are three: - CAP_SYS_RAWIO partly bypasses DAC; you can issue destructive commands even if you only opened the disk for reading. CAP_SYS_RAWIO also gives access to _really_ destructive commands (WRITE BUFFER and PERSISTENT RESERVE OUT for example). - CAP_SYS_RAWIO lets you send SCSI commands to partitions, and they will gladly read/write the disk going outside the boundaries of the partition. Changing this behavior was rejected upstream already. - CAP_SYS_RAWIO also gives access to I/O ports, mmap at address 0, and too many other insecure things. All the above mean that: - any application using CAP_SYS_RAWIO would have to implement its own whitelisting, even if just to duplicate what is done in the kernel; - exploiting a CAP_SYS_RAWIO process leads to root too easily, and it is not possible to give the capability to anything that will run in a hostile environment (in my case QEMU). Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-06 12:36 ` Paolo Bonzini @ 2012-09-06 14:20 ` Lukáš Czerner 0 siblings, 0 replies; 24+ messages in thread From: Lukáš Czerner @ 2012-09-06 14:20 UTC (permalink / raw) To: Paolo Bonzini Cc: Ric Wheeler, axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel, linux-scsi On Thu, 6 Sep 2012, Paolo Bonzini wrote: > Date: Thu, 06 Sep 2012 14:36:53 +0200 > From: Paolo Bonzini <pbonzini@redhat.com> > To: Ric Wheeler <ricwheeler@gmail.com> > Cc: axboe@kernel.dk, Mike Snitzer <snitzer@redhat.com>, > Alan Cox <alan@lxorguk.ukuu.org.uk>, > Martin K. Petersen <martin.petersen@oracle.com>, > linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org > Subject: Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without > CAP_SYS_RAWIO > > Il 06/09/2012 14:08, Ric Wheeler ha scritto: > >> According to the standard, the translation layer can write a > >> user-provided pattern to every sector in the disk. It's an optional > >> feature and libata doesn't do that, but it is still possible. > > > > It is not possible today with our stack though, any patch that would > > change that would also need to be vetted. > > It is not possible with SATA disks, but native SCSI disks might well > interpret FORMAT UNIT destructively. > > >>> I don't see allowing anyone who can open the device to zero the data as > >>> better though :) > >> Note: anyone who can open it for writing! And they can just as well > >> issue WRITE, it just takes a little more effort than with WRITE SAME. :) > >> If you only have read access, you cannot issue WRITE or FORMAT UNIT, > >> and with this patch you will not be able to issue WRITE SAME. > > > > This just seems like an argument over whether or not capabilities make > > sense. In general, anything as destructive as a single CDB that can kill > > all of your data should be tightly controlled. > > In practice, a single write to the first MB of the disk is just as > destructive. For that you do not even need a SCSI command. > > > Pushing more code in the data path is not where we are going - we > > routinely need to disable IO scheduling for example when driving IO to > > high speed/low latency devices and are actively looking at how to tackle > > other performance bottlenecks in the stack. > > I am not talking about the regular data path, only of SG_IO. > > > I don't see a strong reason that our existing scheme (root or > > CAP_SYS_RAWIO access) prevents you from doing what you need to do. > > Here are three: > > - CAP_SYS_RAWIO partly bypasses DAC; you can issue destructive commands > even if you only opened the disk for reading. CAP_SYS_RAWIO also gives > access to _really_ destructive commands (WRITE BUFFER and PERSISTENT > RESERVE OUT for example). > > - CAP_SYS_RAWIO lets you send SCSI commands to partitions, and they will > gladly read/write the disk going outside the boundaries of the > partition. Changing this behavior was rejected upstream already. > > - CAP_SYS_RAWIO also gives access to I/O ports, mmap at address 0, and > too many other insecure things. > > All the above mean that: > > - any application using CAP_SYS_RAWIO would have to implement its own > whitelisting, even if just to duplicate what is done in the kernel; > > - exploiting a CAP_SYS_RAWIO process leads to root too easily, and it is > not possible to give the capability to anything that will run in a > hostile environment (in my case QEMU). So at fist I did not think this is such a good idea however there are several good points you've mentioned. CAP_SYS_RAWIO is indeed too big hammer for this and it is not secure to allow such application to possess such capability. Moreover WRITE_SAME indeed is almost the SAME as WRITE :), only easier and delegated to the storage itself. UNMAP or WRITE_WAME w/ unamp bit is a little bit trickier but thinking about it some more I do not see any real reason why the user with write permission should not be able to use this. Yes, it is not technically write and it has other consequences as well, but none of it seems to be exploitable more than simple write. Moreover looking at BLKDISCARD ioctl there is no such restrictions (obviously) but neither is with TRIM ata command. So with sata SSD you're allowed to use TRIM command if you have write permission to that device. So I guess having this consistent is good idea and considering the points above I think it is ok to allow WRITE_SAME and UNMAP without CAP_SYS_RAWIO. Thanks! -Lukas > > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-07-20 16:30 [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO Paolo Bonzini 2012-08-01 15:53 ` Paolo Bonzini @ 2012-09-11 16:59 ` Tejun Heo 2012-09-11 17:56 ` Paolo Bonzini 1 sibling, 1 reply; 24+ messages in thread From: Tejun Heo @ 2012-09-11 16:59 UTC (permalink / raw) To: Paolo Bonzini; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley Hello, On Fri, Jul 20, 2012 at 06:30:01PM +0200, Paolo Bonzini wrote: > These commands cannot be issued right now without giving CAP_SYS_RAWIO to > the process who wishes to send them. These commands can be useful also to > non-privileged programs who have access to the block devices. For example > a virtual machine monitor needs them to forward trim/discard to host disks. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > block/scsi_ioctl.c | 3 ++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c > index 260fa80..dd71f18 100644 > --- a/block/scsi_ioctl.c > +++ b/block/scsi_ioctl.c > @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) > /* Basic writing commands */ > __set_bit(WRITE_6, filter->write_ok); > __set_bit(WRITE_10, filter->write_ok); > + __set_bit(WRITE_SAME, filter->write_ok); > __set_bit(WRITE_VERIFY, filter->write_ok); > __set_bit(WRITE_12, filter->write_ok); > __set_bit(WRITE_VERIFY_12, filter->write_ok); > __set_bit(WRITE_16, filter->write_ok); > + __set_bit(WRITE_SAME_16, filter->write_ok); > __set_bit(WRITE_LONG, filter->write_ok); > __set_bit(WRITE_LONG_2, filter->write_ok); > __set_bit(ERASE, filter->write_ok); > + __set_bit(UNMAP, filter->write_ok); FWIW, I don't think this is the right way to expose functionality which needs management in terms of access control, interpretation (stacking drivers) and serving concurrent users. SG_IO filtering was mostly for cd/dvd burning and other removeable devices. Especially for the former, the possibility of having a properly encapsulated interface was long lost and I think significant part of that is how the underlying technology and its hardware interface developed. We basically just gave up. Fortunately, they're going the way of floppies. So, it wouldn't be a good idea to abuse SG_IO filtering for exposing trim/discard. It's something which should be retired or at least severely restricted in time. I don't think we want to be developing new uses of it. I think trim/discards are fairly easy to abstract and common enough to justify having properly abstracted interface. In fact, we already have block layer interface for it - BLKDISCARD. If it's lacking, let's improve that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 16:59 ` Tejun Heo @ 2012-09-11 17:56 ` Paolo Bonzini 2012-09-11 18:29 ` Tejun Heo 2012-09-12 8:05 ` James Bottomley 0 siblings, 2 replies; 24+ messages in thread From: Paolo Bonzini @ 2012-09-11 17:56 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley Il 11/09/2012 18:59, Tejun Heo ha scritto: > FWIW, I don't think this is the right way to expose functionality > which needs management in terms of access control, interpretation > (stacking drivers) and serving concurrent users. SG_IO filtering was > mostly for cd/dvd burning and other removeable devices. Understood; unfortunately, there is another major user of it (virtualization). If you are passing "raw" LUNs down to a virtual machine, there's no possibility at all to use a properly encapsulated interface and still be able to pass sense data etc. back to the virtual machine. And it's only going to grow now that people are starting to use virtio-scsi. The set of use cases is so variable that no single filter can accomodate all of them: high availability people want persistent reservations, NAS people want trim/discard, but these are just two groups. Someone is using a Windows VM to run vendor tools and wants to have access to vendor-specific commands. You can tell this last group to use root, but not everyone else who is already relying on Unix permissions, SELinux and/or device cgroups to confine their virtual machines. A generic filter (see http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal) would be satisfactory for everyone, but it's also a major undertaking and so far I've not received a single comment about it. > So, it wouldn't be a good idea to abuse SG_IO filtering for exposing > trim/discard. It's something which should be retired or at least > severely restricted in time. I don't think we want to be developing > new uses of it. > > I think trim/discards are fairly easy to abstract and common enough to > justify having properly abstracted interface. In fact, we already > have block layer interface for it - BLKDISCARD. If it's lacking, > let's improve that. I do want to improve the block layer interfaces to avoid that people use SG_IO. But unfortunately this is for a completely different use case. Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 17:56 ` Paolo Bonzini @ 2012-09-11 18:29 ` Tejun Heo 2012-09-11 18:54 ` Paolo Bonzini 2012-09-12 8:05 ` James Bottomley 1 sibling, 1 reply; 24+ messages in thread From: Tejun Heo @ 2012-09-11 18:29 UTC (permalink / raw) To: Paolo Bonzini; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley Hello, Paolo. On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote: > Understood; unfortunately, there is another major user of it > (virtualization). If you are passing "raw" LUNs down to a virtual > machine, there's no possibility at all to use a properly encapsulated Is there still command filtering issue when you're passing "raw" LUNs down? > interface and still be able to pass sense data etc. back to the virtual > machine. And it's only going to grow now that people are starting to > use virtio-scsi. > > The set of use cases is so variable that no single filter can accomodate > all of them: high availability people want persistent reservations, NAS > people want trim/discard, but these are just two groups. Someone is > using a Windows VM to run vendor tools and wants to have access to > vendor-specific commands. > > You can tell this last group to use root, but not everyone else who is > already relying on Unix permissions, SELinux and/or device cgroups to > confine their virtual machines. You listed three - HA w/ persistent reservation, NAS w/ trim/discard and the third which you said that using root would be fine. Dunno much about persistent reservation but I don't see why trim/discard can't use existing block layer facilities whether from userland or virtio-scsi? > A generic filter (see > http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal) > would be satisfactory for everyone, but it's also a major undertaking > and so far I've not received a single comment about it. Maybe I'm just not familiar with the problem space but I really hope things don't come to that. It's not like kernel by itself has to support every single possible use cases. > > So, it wouldn't be a good idea to abuse SG_IO filtering for exposing > > trim/discard. It's something which should be retired or at least > > severely restricted in time. I don't think we want to be developing > > new uses of it. > > > > I think trim/discards are fairly easy to abstract and common enough to > > justify having properly abstracted interface. In fact, we already > > have block layer interface for it - BLKDISCARD. If it's lacking, > > let's improve that. > > I do want to improve the block layer interfaces to avoid that people use > SG_IO. But unfortunately this is for a completely different use case. Hmmm? This was about discard, no? Thanks. -- tejun ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 18:29 ` Tejun Heo @ 2012-09-11 18:54 ` Paolo Bonzini 2012-09-11 19:13 ` Tejun Heo 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-11 18:54 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley Il 11/09/2012 20:29, Tejun Heo ha scritto:> Hello, Paolo. > > On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote: >> Understood; unfortunately, there is another major user of it >> (virtualization). If you are passing "raw" LUNs down to a virtual >> machine, there's no possibility at all to use a properly encapsulated > > Is there still command filtering issue when you're passing "raw" LUNs > down? Yes, the passing down is just a userland program that gets SCSI commands from the guest, sends them via SG_IO, and passes back the result. If the userland program is unprivileged (it usually is), then you go through the filter. >> The set of use cases is so variable that no single filter can accomodate >> all of them: high availability people want persistent reservations, NAS >> people want trim/discard, but these are just two groups. Someone is >> using a Windows VM to run vendor tools and wants to have access to >> vendor-specific commands. >> >> You can tell this last group to use root, but not everyone else who is >> already relying on Unix permissions, SELinux and/or device cgroups to >> confine their virtual machines. > > You listed three - HA w/ persistent reservation, NAS w/ trim/discard > and the third which you said that using root would be fine. Dunno > much about persistent reservation but I don't see why trim/discard > can't use existing block layer facilities whether from userland or > virtio-scsi? This is the userland for virtio-scsi (the kernel part of virtio-scsi is just a driver running in the guest). It can run in two mode: it can do its own SCSI emulation, or it can just relay CDBs and their results. It can (and does) use higher-level services if SCSI emulation is done in userland. In that case, trim/discard can become a BLKDISCARD or a fallocate for example. However, in this case userland doesn't do any emulation and in fact doesn't even need to know that this CDB is a discard. Also, if it fails, there's no way to reconstruct the NAS's sense data to pass it back to the guest. We do a limited amount of "making up" sense data (for example if a command is filtered, all we get is an errno value; and we say it was not recognized), but it should really be as simple and limited as possible. >> A generic filter (see >> http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal) >> would be satisfactory for everyone, but it's also a major undertaking >> and so far I've not received a single comment about it. > > Maybe I'm just not familiar with the problem space but I really hope > things don't come to that. Why not? :) (BTW it was suggested by Alan Cox, that's just my proposal for how to do it). I think that it's a good idea, but it's a big bazooka for the smaller issue of supporting trim/discard. >>> So, it wouldn't be a good idea to abuse SG_IO filtering for exposing >>> trim/discard. It's something which should be retired or at least >>> severely restricted in time. I don't think we want to be developing >>> new uses of it. >>> >>> I think trim/discards are fairly easy to abstract and common enough to >>> justify having properly abstracted interface. In fact, we already >>> have block layer interface for it - BLKDISCARD. If it's lacking, >>> let's improve that. >> >> I do want to improve the block layer interfaces to avoid that people use >> SG_IO. But unfortunately this is for a completely different use case. > > Hmmm? This was about discard, no? One example of block layer interfaces that I want to add is BLKPING, so that you can see if the NAS is reachable. Then SCSI emulation can map the "test unit ready" command to BLKPING. There's a handful of such ioctls that would be useful, such as BLKDISCARD itself. But this is for the other direction, where ioctls are not enough accurate. Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 18:54 ` Paolo Bonzini @ 2012-09-11 19:13 ` Tejun Heo 2012-09-11 19:24 ` Paolo Bonzini 0 siblings, 1 reply; 24+ messages in thread From: Tejun Heo @ 2012-09-11 19:13 UTC (permalink / raw) To: Paolo Bonzini; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley Hello, Paolo. On Tue, Sep 11, 2012 at 08:54:03PM +0200, Paolo Bonzini wrote: > > On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote: > >> Understood; unfortunately, there is another major user of it > >> (virtualization). If you are passing "raw" LUNs down to a virtual > >> machine, there's no possibility at all to use a properly encapsulated > > > > Is there still command filtering issue when you're passing "raw" LUNs > > down? > > Yes, the passing down is just a userland program that gets SCSI > commands from the guest, sends them via SG_IO, and passes back the > result. If the userland program is unprivileged (it usually is), then > you go through the filter. Could being able to bypass the filters for this "you own this LUN" be a solution? Or is it that we still need command filtering for whatever reason? > This is the userland for virtio-scsi (the kernel part of virtio-scsi is just > a driver running in the guest). It can run in two mode: it can do its own > SCSI emulation, or it can just relay CDBs and their results. > > It can (and does) use higher-level services if SCSI emulation is done in > userland. In that case, trim/discard can become a BLKDISCARD or a fallocate > for example. However, in this case userland doesn't do any emulation and in > fact doesn't even need to know that this CDB is a discard. Couldn't it intercept some of them - e.g. RWs and discards? What's the benifit / use case of doing pure bypass? Would the benefits be strong enough to justify whole bpf cdb filtering? > Also, if it fails, there's no way to reconstruct the NAS's sense data to > pass it back to the guest. We do a limited amount of "making up" sense > data (for example if a command is filtered, all we get is an errno value; > and we say it was not recognized), but it should really be as simple > and limited as possible. Yeah, I agree losing sense data could suck but that alone doesn't seem to be a very strong justification for the whole deal and there could be different / smaller ways to solve the sense data problem. > >> A generic filter (see > >> http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal) > >> would be satisfactory for everyone, but it's also a major undertaking > >> and so far I've not received a single comment about it. > > > > Maybe I'm just not familiar with the problem space but I really hope > > things don't come to that. > > Why not? :) (BTW it was suggested by Alan Cox, that's just my proposal for > how to do it). I think that it's a good idea, but it's a big bazooka for > the smaller issue of supporting trim/discard. I guess I mostly wanna know for sure that there's big / strong enough targets for the big bazooka. :) > > Hmmm? This was about discard, no? > > One example of block layer interfaces that I want to add is BLKPING, so > that you can see if the NAS is reachable. Then SCSI emulation can map > the "test unit ready" command to BLKPING. There's a handful of such > ioctls that would be useful, such as BLKDISCARD itself. Can't you make use of the existing disk events mechanism for that? Block layer already knows how to watch readiness of a device and tell the userland about it via uevent. Hooking to that shouldn't be too difficult and I think probably is the right approach given that all hotplug userland hotplug operations go through the same channel. If you absoluately has to test it from userland, READ on the first sector? That essentially is what libata does for START_STOP although it uses VERIFY instead of READ. Given how partition code behaves, any device which fails on READ on block0 isn't gonna work well with linux anyway. Thanks. -- tejun ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 19:13 ` Tejun Heo @ 2012-09-11 19:24 ` Paolo Bonzini 2012-09-11 20:01 ` Tejun Heo 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-11 19:24 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley Il 11/09/2012 21:13, Tejun Heo ha scritto: > Hello, Paolo. > > On Tue, Sep 11, 2012 at 08:54:03PM +0200, Paolo Bonzini wrote: >>> On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote: >>>> Understood; unfortunately, there is another major user of it >>>> (virtualization). If you are passing "raw" LUNs down to a virtual >>>> machine, there's no possibility at all to use a properly encapsulated >>> >>> Is there still command filtering issue when you're passing "raw" LUNs >>> down? >> >> Yes, the passing down is just a userland program that gets SCSI >> commands from the guest, sends them via SG_IO, and passes back the >> result. If the userland program is unprivileged (it usually is), then >> you go through the filter. > > Could being able to bypass the filters for this "you own this LUN" be > a solution? Or is it that we still need command filtering for > whatever reason? Yes, it could be. Enabling/disabling the filters from a privileged program and passing the unfiltered fd via SCM_RIGHTS would be enough. >> This is the userland for virtio-scsi (the kernel part of virtio-scsi is just >> a driver running in the guest). It can run in two mode: it can do its own >> SCSI emulation, or it can just relay CDBs and their results. >> >> It can (and does) use higher-level services if SCSI emulation is done in >> userland. In that case, trim/discard can become a BLKDISCARD or a fallocate >> for example. However, in this case userland doesn't do any emulation and in >> fact doesn't even need to know that this CDB is a discard. > > Couldn't it intercept some of them - e.g. RWs and discards? > What's the benifit / use case of doing pure bypass? Basically, using the same storage technology for bare metal and virtualized systems. IMHO losing sense data is a no-no, but the above solution could be feasible too. > Would the benefits be strong enough to justify whole bpf cdb filtering? If we can get a simpler solution that is okay with kernel maintainers, I'm all for it. >>> Hmmm? This was about discard, no? >> >> One example of block layer interfaces that I want to add is BLKPING, so >> that you can see if the NAS is reachable. Then SCSI emulation can map >> the "test unit ready" command to BLKPING. There's a handful of such >> ioctls that would be useful, such as BLKDISCARD itself. > > Can't you make use of the existing disk events mechanism for that? > Block layer already knows how to watch readiness of a device and tell > the userland about it via uevent. How? But anyway i don't want to divert the discussion from the actual topic... Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 19:24 ` Paolo Bonzini @ 2012-09-11 20:01 ` Tejun Heo 2012-09-11 21:50 ` Paolo Bonzini 0 siblings, 1 reply; 24+ messages in thread From: Tejun Heo @ 2012-09-11 20:01 UTC (permalink / raw) To: Paolo Bonzini Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley, Kay Sievers Hello, Paolo. On Tue, Sep 11, 2012 at 09:24:32PM +0200, Paolo Bonzini wrote: > > Couldn't it intercept some of them - e.g. RWs and discards? > > What's the benifit / use case of doing pure bypass? > > Basically, using the same storage technology for bare metal and > virtualized systems. IMHO losing sense data is a no-no, but the above > solution could be feasible too. Most of my experience with storage devices comes from SATA where error data is more of the deal "take some hints if you really want but there's pretty good chance it's completely bogus", so my perception about the importance of sense data might not match the fancy SCSI land. I don't know. Either way, with or without virtualization, making detailed error information to userland is a valid goal. I *think* we're finally getting there after years of talking via structured printk. I don't know much about the details but heard about exposing sense data via printk. cc'ing Kay. Kay, could that be useful for virtualization use cases too? They want to obtain the sense data after command failure. I suppose there would be some challenge in matching actions and error logs tho and it could be too clunky to use this way. > > Can't you make use of the existing disk events mechanism for that? > > Block layer already knows how to watch readiness of a device and tell > > the userland about it via uevent. > > How? But anyway i don't want to divert the discussion from the actual > topic... Disk events mechanism is there to watch (either via async notification or polling) media change and device readiness and generates the usual uevents when it detects them. For sd devices, it basically issues TUR periodically, so it's already doing pretty much what you need. I guess the repeating question is whether to solve the problem within the framework the underlying OS is providing or having direct access to the raw hardware. I don't know the answer. Accessing the "raw" hardware does have its advantages but managing multiple users so that they don't step on each other's foot is one of the main reasons we have this kernel thing after all, so there's natural tension between "wanting raw" and "multiuser security/whatever concerns". Beyond certain point, I think it essentially becomes wanting the cake and eating it too. I personally hope "raw" to be strictly confined to specific areas where performance impact of having kernel inbetween is prohibitive but that's just me hoping. Thanks. -- tejun ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 20:01 ` Tejun Heo @ 2012-09-11 21:50 ` Paolo Bonzini 2012-09-11 22:02 ` Tejun Heo 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-11 21:50 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley, Kay Sievers, Al Viro [Al: you can jump down to "One problem:"] Il 11/09/2012 22:01, Tejun Heo ha scritto: > Hello, Paolo. > > On Tue, Sep 11, 2012 at 09:24:32PM +0200, Paolo Bonzini wrote: >>> Couldn't it intercept some of them - e.g. RWs and discards? >>> What's the benifit / use case of doing pure bypass? >> >> Basically, using the same storage technology for bare metal and >> virtualized systems. IMHO losing sense data is a no-no, but the above >> solution could be feasible too. > > Either way, with or without virtualization, making detailed error > information to userland is a valid goal. I *think* we're finally > getting there after years of talking via structured printk. I don't > know much about the details but heard about exposing sense data via > printk. Wait wait, there is already a perfectly 1:1 solution for this, and it's SG_IO. I think error processing falls roughly in two categories: "I need each command's precise state" and "I need to know if/when something bad happens". Luckily, I/O also falls roughly in the same two categories: "I need precise control of each commands" and "I just care of getting this to disk". The former can use SG_IO, the latter can use logs. So, let's not complicate the problem further. We have a perfectly sane API that (with different names) is even provided by almost every operating system in existence. There's just this little detail of filtering that is done for unprivileged processes; I hoped to fix 50% of the problem with this 3-line patch but it's not the end of the world if it's rejected constructively. The solution I outlined in my previous email: >> Enabling/disabling the filters from a privileged >> program and passing the unfiltered fd via SCM_RIGHTS would be enough. would entail some userland coding, but nothing paramount at all (and closer to my usual territory :)). And we would have to do it anyway for the reservations case. Basically it would be a ioctl(fd, SG_SET_FILTER_ENABLED, arg) where arg can be: -1 for "enable/disable based on CAP_SYS_RAWIO" (default) 0 for always enable filter 1 for always disable filter And also a dual ioctl(fd, SG_GET_FILTER_ENABLED, arg). One problem: to do this, I need to access some "struct file" member in SG_IO, and thus change the ioctl member from block_device/fmode to block_device/file. This would partially undo the 2007 switch from inode/file by Al Viro. He was already asked about it in https://lkml.org/lkml/2012/6/12/414, let's try again here. >>> Can't you make use of the existing disk events mechanism for that? >>> Block layer already knows how to watch readiness of a device and tell >>> the userland about it via uevent. >> >> How? But anyway i don't want to divert the discussion from the actual >> topic... > > Disk events mechanism is there to watch (either via async notification > or polling) media change and device readiness and generates the usual > uevents when it detects them. For sd devices, it basically issues TUR > periodically, so it's already doing pretty much what you need. Ah, no, we can't do that because the device should be opened with O_EXCL. It is not right now, but it's a bug. It's not very different from burning a CD (in fact, it's absolutely the same if you burn a CD inside a guest :)). > I guess the repeating question is whether to solve the problem within > the framework the underlying OS is providing or having direct access > to the raw hardware. I don't know the answer. > > Accessing the "raw" hardware does have its advantages but managing > multiple users In this case, the constraints pretty much guarantee that you have only one user. To stick with everyday hardware, if you pass your CD drive to a guest you can well expect that the host will not be able to use it. Or, if you have more than one user, that they know what they are doing (reservations, etc.). > I personally hope "raw" to be strictly confined to specific areas > where performance impact of having kernel inbetween is prohibitive but > that's just me hoping. Well, it's not just about performance but also about precision sometimes. Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 21:50 ` Paolo Bonzini @ 2012-09-11 22:02 ` Tejun Heo 2012-09-11 22:10 ` Paolo Bonzini 0 siblings, 1 reply; 24+ messages in thread From: Tejun Heo @ 2012-09-11 22:02 UTC (permalink / raw) To: Paolo Bonzini Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley, Kay Sievers, Al Viro Hello, Paolo. On Tue, Sep 11, 2012 at 11:50:38PM +0200, Paolo Bonzini wrote: > > Either way, with or without virtualization, making detailed error > > information to userland is a valid goal. I *think* we're finally > > getting there after years of talking via structured printk. I don't > > know much about the details but heard about exposing sense data via > > printk. > > Wait wait, there is already a perfectly 1:1 solution for this, and it's > SG_IO. > > I think error processing falls roughly in two categories: "I need each > command's precise state" and "I need to know if/when something bad > happens". Luckily, I/O also falls roughly in the same two categories: > "I need precise control of each commands" and "I just care of getting > this to disk". The former can use SG_IO, the latter can use logs. SG_IO itself is a bypassing interface. It bypasses most of block layer and the kernel doesn't have any idea (apart from the adhoc filtering) about what's going on. The problem can be approached from both directions (make use of OS IO layer improving it as needed or add more intelligence to the bypass thing) and I'm not sure at all adding more capability to the adhoc filtering is the better direction. So, what I wanna say is that if you can get by with the adhoc thing we already have in place for cd/dvd, that's great; otherwise, it's not clear at all whether expanding that adhoc filtering is a good idea. Thanks. -- tejun ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 22:02 ` Tejun Heo @ 2012-09-11 22:10 ` Paolo Bonzini 2012-09-11 22:13 ` Tejun Heo 0 siblings, 1 reply; 24+ messages in thread From: Paolo Bonzini @ 2012-09-11 22:10 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley, Kay Sievers, Al Viro Il 12/09/2012 00:02, Tejun Heo ha scritto: > SG_IO itself is a bypassing interface. It bypasses most of block > layer and the kernel doesn't have any idea (apart from the adhoc > filtering) about what's going on. That's very much the point. The guest must have free reins. You asked "Could being able to bypass the filters for this "you own this LUN" be a solution?", I said yes and outlined how. Do you agree with the proposed solution? > The problem can be approached from > both directions (make use of OS IO layer improving it as needed or add > more intelligence to the bypass thing) Or remove intelligence if it gets in the way. Of course the removal must be done by an appropriately-privileged program. > and I'm not sure at all adding > more capability to the adhoc filtering is the better direction. Sure, I'm fine with leaving the current ad hoc filtering aside. Again, I was hoping to get most of the job done by loosening the filter a bit, but discussion is inversely proportional to patch length sometimes. Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 22:10 ` Paolo Bonzini @ 2012-09-11 22:13 ` Tejun Heo 0 siblings, 0 replies; 24+ messages in thread From: Tejun Heo @ 2012-09-11 22:13 UTC (permalink / raw) To: Paolo Bonzini Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley, Kay Sievers, Al Viro Hello, On Tue, Sep 11, 2012 at 3:10 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 12/09/2012 00:02, Tejun Heo ha scritto: >> SG_IO itself is a bypassing interface. It bypasses most of block >> layer and the kernel doesn't have any idea (apart from the adhoc >> filtering) about what's going on. > > That's very much the point. The guest must have free reins. > > You asked "Could being able to bypass the filters for this "you own this > LUN" be a solution?", I said yes and outlined how. Do you agree with > the proposed solution? Ooh, yeah, I like that one. I was still thinking about the bpf one. It being based on cgroup made me worried even more. :) > Sure, I'm fine with leaving the current ad hoc filtering aside. Again, > I was hoping to get most of the job done by loosening the filter a bit, > but discussion is inversely proportional to patch length sometimes. Sorry about dragging it on. The cdb filtering one has always been controversial and I think it'll continue to be. It's an unfortunate thing that we had to add. Thanks. -- tejun ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-11 17:56 ` Paolo Bonzini 2012-09-11 18:29 ` Tejun Heo @ 2012-09-12 8:05 ` James Bottomley 2012-09-12 8:18 ` Paolo Bonzini 1 sibling, 1 reply; 24+ messages in thread From: James Bottomley @ 2012-09-12 8:05 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Tejun Heo, linux-kernel, axboe, linux-scsi On Tue, 2012-09-11 at 19:56 +0200, Paolo Bonzini wrote: > The set of use cases is so variable that no single filter can accomodate > all of them: high availability people want persistent reservations, NAS > people want trim/discard, but these are just two groups. Someone is > using a Windows VM to run vendor tools and wants to have access to > vendor-specific commands. > > You can tell this last group to use root, but not everyone else who is > already relying on Unix permissions, SELinux and/or device cgroups to > confine their virtual machines. This is why the whole filter thing was mutable via sysfs. That way the admin could set this up per device. It sounds like this is what you want to fix, rather than opening up more holes in an already leaky security apparatus. The ideal is that we would be much more restrictive by default and give root the ability to override this both globally and per-device to conform to whatever policy it has for the virtual environments. The patch which removed all of the sysfs pieces was: commit 018e0446890661504783f92388ecce7138c1566d Author: Jens Axboe <jens.axboe@oracle.com> Date: Fri Jun 26 16:27:10 2009 +0200 block: get rid of queue-private command filter So that's probably the place to start for putting it back properly. James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO 2012-09-12 8:05 ` James Bottomley @ 2012-09-12 8:18 ` Paolo Bonzini 0 siblings, 0 replies; 24+ messages in thread From: Paolo Bonzini @ 2012-09-12 8:18 UTC (permalink / raw) To: James Bottomley; +Cc: Tejun Heo, linux-kernel, axboe, linux-scsi Il 12/09/2012 10:05, James Bottomley ha scritto: > This is why the whole filter thing was mutable via sysfs. That way the > admin could set this up per device. It sounds like this is what you > want to fix, rather than opening up more holes in an already leaky > security apparatus. It is, thanks for the useful history lesson! Paolo ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2012-09-12 8:18 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-20 16:30 [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO Paolo Bonzini 2012-08-01 15:53 ` Paolo Bonzini 2012-08-28 11:04 ` Paolo Bonzini 2012-09-05 14:41 ` [Ping^3] " Paolo Bonzini 2012-09-05 20:18 ` Ric Wheeler 2012-09-06 6:31 ` Paolo Bonzini 2012-09-06 11:31 ` Ric Wheeler 2012-09-06 11:49 ` Paolo Bonzini 2012-09-06 12:08 ` Ric Wheeler 2012-09-06 12:36 ` Paolo Bonzini 2012-09-06 14:20 ` Lukáš Czerner 2012-09-11 16:59 ` Tejun Heo 2012-09-11 17:56 ` Paolo Bonzini 2012-09-11 18:29 ` Tejun Heo 2012-09-11 18:54 ` Paolo Bonzini 2012-09-11 19:13 ` Tejun Heo 2012-09-11 19:24 ` Paolo Bonzini 2012-09-11 20:01 ` Tejun Heo 2012-09-11 21:50 ` Paolo Bonzini 2012-09-11 22:02 ` Tejun Heo 2012-09-11 22:10 ` Paolo Bonzini 2012-09-11 22:13 ` Tejun Heo 2012-09-12 8:05 ` James Bottomley 2012-09-12 8:18 ` Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).