linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
@ 2012-07-20 16:30 Paolo Bonzini
  2012-08-01 15:53 ` Paolo Bonzini
  2012-09-11 16:59 ` Tejun Heo
  0 siblings, 2 replies; 24+ messages in thread
From: Paolo Bonzini @ 2012-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, linux-scsi

These commands cannot be issued right now without giving CAP_SYS_RAWIO to
the process who wishes to send them.  These commands can be useful also to
non-privileged programs who have access to the block devices.  For example
a virtual machine monitor needs them to forward trim/discard to host disks.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/scsi_ioctl.c |    3 ++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 260fa80..dd71f18 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter)
 	/* Basic writing commands */
 	__set_bit(WRITE_6, filter->write_ok);
 	__set_bit(WRITE_10, filter->write_ok);
+	__set_bit(WRITE_SAME, filter->write_ok);
 	__set_bit(WRITE_VERIFY, filter->write_ok);
 	__set_bit(WRITE_12, filter->write_ok);
 	__set_bit(WRITE_VERIFY_12, filter->write_ok);
 	__set_bit(WRITE_16, filter->write_ok);
+	__set_bit(WRITE_SAME_16, filter->write_ok);
 	__set_bit(WRITE_LONG, filter->write_ok);
 	__set_bit(WRITE_LONG_2, filter->write_ok);
 	__set_bit(ERASE, filter->write_ok);
+	__set_bit(UNMAP, filter->write_ok);
 	__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
 	__set_bit(MODE_SELECT, filter->write_ok);
 	__set_bit(LOG_SELECT, filter->write_ok);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-07-20 16:30 [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO Paolo Bonzini
@ 2012-08-01 15:53 ` Paolo Bonzini
  2012-08-28 11:04   ` Paolo Bonzini
  2012-09-11 16:59 ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-08-01 15:53 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, linux-scsi

Il 20/07/2012 18:30, Paolo Bonzini ha scritto:
> These commands cannot be issued right now without giving CAP_SYS_RAWIO to
> the process who wishes to send them.  These commands can be useful also to
> non-privileged programs who have access to the block devices.  For example
> a virtual machine monitor needs them to forward trim/discard to host disks.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/scsi_ioctl.c |    3 ++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 260fa80..dd71f18 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter)
>  	/* Basic writing commands */
>  	__set_bit(WRITE_6, filter->write_ok);
>  	__set_bit(WRITE_10, filter->write_ok);
> +	__set_bit(WRITE_SAME, filter->write_ok);
>  	__set_bit(WRITE_VERIFY, filter->write_ok);
>  	__set_bit(WRITE_12, filter->write_ok);
>  	__set_bit(WRITE_VERIFY_12, filter->write_ok);
>  	__set_bit(WRITE_16, filter->write_ok);
> +	__set_bit(WRITE_SAME_16, filter->write_ok);
>  	__set_bit(WRITE_LONG, filter->write_ok);
>  	__set_bit(WRITE_LONG_2, filter->write_ok);
>  	__set_bit(ERASE, filter->write_ok);
> +	__set_bit(UNMAP, filter->write_ok);
>  	__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
>  	__set_bit(MODE_SELECT, filter->write_ok);
>  	__set_bit(LOG_SELECT, filter->write_ok);
> 

Jens,

can this go in 3.6 as well?

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-08-01 15:53 ` Paolo Bonzini
@ 2012-08-28 11:04   ` Paolo Bonzini
  2012-09-05 14:41     ` [Ping^3] " Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-08-28 11:04 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, linux-scsi

Il 01/08/2012 17:53, Paolo Bonzini ha scritto:
> Il 20/07/2012 18:30, Paolo Bonzini ha scritto:
>> These commands cannot be issued right now without giving CAP_SYS_RAWIO to
>> the process who wishes to send them.  These commands can be useful also to
>> non-privileged programs who have access to the block devices.  For example
>> a virtual machine monitor needs them to forward trim/discard to host disks.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  block/scsi_ioctl.c |    3 ++
>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
>> index 260fa80..dd71f18 100644
>> --- a/block/scsi_ioctl.c
>> +++ b/block/scsi_ioctl.c
>> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter)
>>  	/* Basic writing commands */
>>  	__set_bit(WRITE_6, filter->write_ok);
>>  	__set_bit(WRITE_10, filter->write_ok);
>> +	__set_bit(WRITE_SAME, filter->write_ok);
>>  	__set_bit(WRITE_VERIFY, filter->write_ok);
>>  	__set_bit(WRITE_12, filter->write_ok);
>>  	__set_bit(WRITE_VERIFY_12, filter->write_ok);
>>  	__set_bit(WRITE_16, filter->write_ok);
>> +	__set_bit(WRITE_SAME_16, filter->write_ok);
>>  	__set_bit(WRITE_LONG, filter->write_ok);
>>  	__set_bit(WRITE_LONG_2, filter->write_ok);
>>  	__set_bit(ERASE, filter->write_ok);
>> +	__set_bit(UNMAP, filter->write_ok);
>>  	__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
>>  	__set_bit(MODE_SELECT, filter->write_ok);
>>  	__set_bit(LOG_SELECT, filter->write_ok);
>>
> 
> Jens,
> 
> can this go in 3.6 as well?

Another ping...

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-08-28 11:04   ` Paolo Bonzini
@ 2012-09-05 14:41     ` Paolo Bonzini
  2012-09-05 20:18       ` Ric Wheeler
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-05 14:41 UTC (permalink / raw)
  To: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen
  Cc: linux-kernel, linux-scsi

Il 28/08/2012 13:04, Paolo Bonzini ha scritto:
> Il 01/08/2012 17:53, Paolo Bonzini ha scritto:
>> Il 20/07/2012 18:30, Paolo Bonzini ha scritto:
>>> These commands cannot be issued right now without giving CAP_SYS_RAWIO to
>>> the process who wishes to send them.  These commands can be useful also to
>>> non-privileged programs who have access to the block devices.  For example
>>> a virtual machine monitor needs them to forward trim/discard to host disks.
>>>
>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>> ---
>>>  block/scsi_ioctl.c |    3 ++
>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
>>> index 260fa80..dd71f18 100644
>>> --- a/block/scsi_ioctl.c
>>> +++ b/block/scsi_ioctl.c
>>> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter)
>>>  	/* Basic writing commands */
>>>  	__set_bit(WRITE_6, filter->write_ok);
>>>  	__set_bit(WRITE_10, filter->write_ok);
>>> +	__set_bit(WRITE_SAME, filter->write_ok);
>>>  	__set_bit(WRITE_VERIFY, filter->write_ok);
>>>  	__set_bit(WRITE_12, filter->write_ok);
>>>  	__set_bit(WRITE_VERIFY_12, filter->write_ok);
>>>  	__set_bit(WRITE_16, filter->write_ok);
>>> +	__set_bit(WRITE_SAME_16, filter->write_ok);
>>>  	__set_bit(WRITE_LONG, filter->write_ok);
>>>  	__set_bit(WRITE_LONG_2, filter->write_ok);
>>>  	__set_bit(ERASE, filter->write_ok);
>>> +	__set_bit(UNMAP, filter->write_ok);
>>>  	__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
>>>  	__set_bit(MODE_SELECT, filter->write_ok);
>>>  	__set_bit(LOG_SELECT, filter->write_ok);
>>>
>>
>> Jens,
>>
>> can this go in 3.6 as well?
> 
> Another ping...

Ping & adding some more folks hoping to get a Reviewed-by or to be
screamed at.

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-05 14:41     ` [Ping^3] " Paolo Bonzini
@ 2012-09-05 20:18       ` Ric Wheeler
  2012-09-06  6:31         ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Ric Wheeler @ 2012-09-05 20:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel,
	linux-scsi

On 09/05/2012 10:41 AM, Paolo Bonzini wrote:
> Il 28/08/2012 13:04, Paolo Bonzini ha scritto:
>> Il 01/08/2012 17:53, Paolo Bonzini ha scritto:
>>> Il 20/07/2012 18:30, Paolo Bonzini ha scritto:
>>>> These commands cannot be issued right now without giving CAP_SYS_RAWIO to
>>>> the process who wishes to send them.  These commands can be useful also to
>>>> non-privileged programs who have access to the block devices.  For example
>>>> a virtual machine monitor needs them to forward trim/discard to host disks.
>>>>
>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>> ---
>>>>   block/scsi_ioctl.c |    3 ++
>>>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
>>>> index 260fa80..dd71f18 100644
>>>> --- a/block/scsi_ioctl.c
>>>> +++ b/block/scsi_ioctl.c
>>>> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter)
>>>>   	/* Basic writing commands */
>>>>   	__set_bit(WRITE_6, filter->write_ok);
>>>>   	__set_bit(WRITE_10, filter->write_ok);
>>>> +	__set_bit(WRITE_SAME, filter->write_ok);
>>>>   	__set_bit(WRITE_VERIFY, filter->write_ok);
>>>>   	__set_bit(WRITE_12, filter->write_ok);
>>>>   	__set_bit(WRITE_VERIFY_12, filter->write_ok);
>>>>   	__set_bit(WRITE_16, filter->write_ok);
>>>> +	__set_bit(WRITE_SAME_16, filter->write_ok);
>>>>   	__set_bit(WRITE_LONG, filter->write_ok);
>>>>   	__set_bit(WRITE_LONG_2, filter->write_ok);
>>>>   	__set_bit(ERASE, filter->write_ok);
>>>> +	__set_bit(UNMAP, filter->write_ok);
>>>>   	__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
>>>>   	__set_bit(MODE_SELECT, filter->write_ok);
>>>>   	__set_bit(LOG_SELECT, filter->write_ok);
>>>>
>>> Jens,
>>>
>>> can this go in 3.6 as well?
>> Another ping...
> Ping & adding some more folks hoping to get a Reviewed-by or to be
> screamed at.
>
> Paolo

Hi Paolo,

Both of these commands are destructive. WRITE_SAME (if done without the discard 
bits set) can also take a very long time to be destructive and tie up the storage.

I think that restricting them to CAP_SYS_RAWIO seems reasonable - better to vet 
and give the appropriate apps the needed capability than to widely open up the 
safety check?

thanks!

Ric



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-05 20:18       ` Ric Wheeler
@ 2012-09-06  6:31         ` Paolo Bonzini
  2012-09-06 11:31           ` Ric Wheeler
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-06  6:31 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel,
	linux-scsi

Il 05/09/2012 22:18, Ric Wheeler ha scritto:
>>
> 
> Hi Paolo,
> 
> Both of these commands are destructive. WRITE_SAME (if done without the
> discard bits set) can also take a very long time to be destructive and
> tie up the storage.

FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I
don't think WRITE SAME slowness is limited to the case where a real
write is requested; discarding can be just as slow).

Also, the two new commands are anyway restricted to programs that have
write access to the disk.  If you have read-only access, you won't be
able to issue any destructive command (there is one exception, START
STOP UNIT is allowed even with read-only capability and is somewhat
destructive).

Honestly, the only reason why these two commands weren't included, is
that the current whitelist is heavily tailored towards CD/DVD burning.

> I think that restricting them to CAP_SYS_RAWIO seems reasonable - better
> to vet and give the appropriate apps the needed capability than to
> widely open up the safety check?

CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is
insecure.

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-06  6:31         ` Paolo Bonzini
@ 2012-09-06 11:31           ` Ric Wheeler
  2012-09-06 11:49             ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Ric Wheeler @ 2012-09-06 11:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel,
	linux-scsi

On 09/06/2012 02:31 AM, Paolo Bonzini wrote:
> Il 05/09/2012 22:18, Ric Wheeler ha scritto:
>> Hi Paolo,
>>
>> Both of these commands are destructive. WRITE_SAME (if done without the
>> discard bits set) can also take a very long time to be destructive and
>> tie up the storage.
> FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I
> don't think WRITE SAME slowness is limited to the case where a real
> write is requested; discarding can be just as slow).
>
> Also, the two new commands are anyway restricted to programs that have
> write access to the disk.  If you have read-only access, you won't be
> able to issue any destructive command (there is one exception, START
> STOP UNIT is allowed even with read-only capability and is somewhat
> destructive).
>
> Honestly, the only reason why these two commands weren't included, is
> that the current whitelist is heavily tailored towards CD/DVD burning.

Hi Paolo,

I assume that FORMAT_UNIT is for CD/DVD needs - not sure what a S-ATA disk would 
do with that. If it is destructive, we should probably think about how to make 
it more secure and see how many applications we would break.

>
>> I think that restricting them to CAP_SYS_RAWIO seems reasonable - better
>> to vet and give the appropriate apps the needed capability than to
>> widely open up the safety check?
> CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is
> insecure.
>
> Paolo

I don't see allowing anyone who can open the device to zero the data as better 
though :)

Ric




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-06 11:31           ` Ric Wheeler
@ 2012-09-06 11:49             ` Paolo Bonzini
  2012-09-06 12:08               ` Ric Wheeler
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-06 11:49 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel,
	linux-scsi

Il 06/09/2012 13:31, Ric Wheeler ha scritto:
>>> Both of these commands are destructive. WRITE_SAME (if done without the
>>> discard bits set) can also take a very long time to be destructive and
>>> tie up the storage.
>>
>> FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I
>> don't think WRITE SAME slowness is limited to the case where a real
>> write is requested; discarding can be just as slow).
>>
>> Also, the two new commands are anyway restricted to programs that have
>> write access to the disk.  If you have read-only access, you won't be
>> able to issue any destructive command (there is one exception, START
>> STOP UNIT is allowed even with read-only capability and is somewhat
>> destructive).
>>
>> Honestly, the only reason why these two commands weren't included, is
>> that the current whitelist is heavily tailored towards CD/DVD burning.
> 
> I assume that FORMAT_UNIT is for CD/DVD needs - not sure what a S-ATA
> disk would do with that.

According to the standard, the translation layer can write a
user-provided pattern to every sector in the disk.  It's an optional
feature and libata doesn't do that, but it is still possible.

> If it is destructive, we should probably think
> about how to make it more secure and see how many applications we would
> break.

We have filesystem permissions to make it secure.  They already do.

>>> I think that restricting them to CAP_SYS_RAWIO seems reasonable - better
>>> to vet and give the appropriate apps the needed capability than to
>>> widely open up the safety check?
>> CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is
>> insecure.
> 
> I don't see allowing anyone who can open the device to zero the data as
> better though :)

Note: anyone who can open it for writing!  And they can just as well
issue WRITE, it just takes a little more effort than with WRITE SAME. :)
 If you only have read access, you cannot issue WRITE or FORMAT UNIT,
and with this patch you will not be able to issue WRITE SAME.

I'm all for providing more versatile filters---which can be both
stricter and looser depending on the configuration than the default.
For example
http://www.redhat.com/archives/libvir-list/2012-June/msg00505.html is a
possible spec for BPF-based filtering of CDBs.

However, the default whitelist (which is all we have for now) should
provide a reasonable default for a user that already has been granted
access to the device by the normal access control mechanisms.  I believe
WRITE SAME and UNMAP fit the definition of a reasonable default.

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-06 11:49             ` Paolo Bonzini
@ 2012-09-06 12:08               ` Ric Wheeler
  2012-09-06 12:36                 ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Ric Wheeler @ 2012-09-06 12:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel,
	linux-scsi

On 09/06/2012 07:49 AM, Paolo Bonzini wrote:
> Il 06/09/2012 13:31, Ric Wheeler ha scritto:
>>>> Both of these commands are destructive. WRITE_SAME (if done without the
>>>> discard bits set) can also take a very long time to be destructive and
>>>> tie up the storage.
>>> FORMAT_UNIT has the same characteristics and yet it is allowed (btw, I
>>> don't think WRITE SAME slowness is limited to the case where a real
>>> write is requested; discarding can be just as slow).
>>>
>>> Also, the two new commands are anyway restricted to programs that have
>>> write access to the disk.  If you have read-only access, you won't be
>>> able to issue any destructive command (there is one exception, START
>>> STOP UNIT is allowed even with read-only capability and is somewhat
>>> destructive).
>>>
>>> Honestly, the only reason why these two commands weren't included, is
>>> that the current whitelist is heavily tailored towards CD/DVD burning.
>> I assume that FORMAT_UNIT is for CD/DVD needs - not sure what a S-ATA
>> disk would do with that.
> According to the standard, the translation layer can write a
> user-provided pattern to every sector in the disk.  It's an optional
> feature and libata doesn't do that, but it is still possible.

It is not possible today with our stack though, any patch that would change that 
would also need to be vetted.

>
>> If it is destructive, we should probably think
>> about how to make it more secure and see how many applications we would
>> break.
> We have filesystem permissions to make it secure.  They already do.
>
>>>> I think that restricting them to CAP_SYS_RAWIO seems reasonable - better
>>>> to vet and give the appropriate apps the needed capability than to
>>>> widely open up the safety check?
>>> CAP_SYS_RAWIO is so wide in its scope, that anything that requires it is
>>> insecure.
>> I don't see allowing anyone who can open the device to zero the data as
>> better though :)
> Note: anyone who can open it for writing!  And they can just as well
> issue WRITE, it just takes a little more effort than with WRITE SAME. :)
>   If you only have read access, you cannot issue WRITE or FORMAT UNIT,
> and with this patch you will not be able to issue WRITE SAME.
>
> I'm all for providing more versatile filters---which can be both
> stricter and looser depending on the configuration than the default.
> For example
> http://www.redhat.com/archives/libvir-list/2012-June/msg00505.html is a
> possible spec for BPF-based filtering of CDBs.
>
> However, the default whitelist (which is all we have for now) should
> provide a reasonable default for a user that already has been granted
> access to the device by the normal access control mechanisms.  I believe
> WRITE SAME and UNMAP fit the definition of a reasonable default.
>
> Paolo

This just seems like an argument over whether or not capabilities make sense. In 
general, anything as destructive as a single CDB that can kill all of your data 
should be tightly controlled.

Pushing more code in the data path is not where we are going - we routinely need 
to disable IO scheduling for example when driving IO to high speed/low latency 
devices and are actively looking at how to tackle other performance bottlenecks 
in the stack.

I don't see a strong reason that our existing scheme (root or CAP_SYS_RAWIO 
access) prevents you from doing what you need to do.

Ric


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-06 12:08               ` Ric Wheeler
@ 2012-09-06 12:36                 ` Paolo Bonzini
  2012-09-06 14:20                   ` Lukáš Czerner
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-06 12:36 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: axboe, Mike Snitzer, Alan Cox, Martin K. Petersen, linux-kernel,
	linux-scsi

Il 06/09/2012 14:08, Ric Wheeler ha scritto:
>> According to the standard, the translation layer can write a
>> user-provided pattern to every sector in the disk.  It's an optional
>> feature and libata doesn't do that, but it is still possible.
> 
> It is not possible today with our stack though, any patch that would
> change that would also need to be vetted.

It is not possible with SATA disks, but native SCSI disks might well
interpret FORMAT UNIT destructively.

>>> I don't see allowing anyone who can open the device to zero the data as
>>> better though :)
>> Note: anyone who can open it for writing!  And they can just as well
>> issue WRITE, it just takes a little more effort than with WRITE SAME. :)
>>   If you only have read access, you cannot issue WRITE or FORMAT UNIT,
>> and with this patch you will not be able to issue WRITE SAME.
> 
> This just seems like an argument over whether or not capabilities make
> sense. In general, anything as destructive as a single CDB that can kill
> all of your data should be tightly controlled.

In practice, a single write to the first MB of the disk is just as
destructive.  For that you do not even need a SCSI command.

> Pushing more code in the data path is not where we are going - we
> routinely need to disable IO scheduling for example when driving IO to
> high speed/low latency devices and are actively looking at how to tackle
> other performance bottlenecks in the stack.

I am not talking about the regular data path, only of SG_IO.

> I don't see a strong reason that our existing scheme (root or
> CAP_SYS_RAWIO access) prevents you from doing what you need to do.

Here are three:

- CAP_SYS_RAWIO partly bypasses DAC; you can issue destructive commands
even if you only opened the disk for reading.  CAP_SYS_RAWIO also gives
access to _really_ destructive commands (WRITE BUFFER and PERSISTENT
RESERVE OUT for example).

- CAP_SYS_RAWIO lets you send SCSI commands to partitions, and they will
gladly read/write the disk going outside the boundaries of the
partition.  Changing this behavior was rejected upstream already.

- CAP_SYS_RAWIO also gives access to I/O ports, mmap at address 0, and
too many other insecure things.

All the above mean that:

- any application using CAP_SYS_RAWIO would have to implement its own
whitelisting, even if just to duplicate what is done in the kernel;

- exploiting a CAP_SYS_RAWIO process leads to root too easily, and it is
not possible to give the capability to anything that will run in a
hostile environment (in my case QEMU).

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-06 12:36                 ` Paolo Bonzini
@ 2012-09-06 14:20                   ` Lukáš Czerner
  0 siblings, 0 replies; 24+ messages in thread
From: Lukáš Czerner @ 2012-09-06 14:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ric Wheeler, axboe, Mike Snitzer, Alan Cox, Martin K. Petersen,
	linux-kernel, linux-scsi

On Thu, 6 Sep 2012, Paolo Bonzini wrote:

> Date: Thu, 06 Sep 2012 14:36:53 +0200
> From: Paolo Bonzini <pbonzini@redhat.com>
> To: Ric Wheeler <ricwheeler@gmail.com>
> Cc: axboe@kernel.dk, Mike Snitzer <snitzer@redhat.com>,
>     Alan Cox <alan@lxorguk.ukuu.org.uk>,
>     Martin K. Petersen <martin.petersen@oracle.com>,
>     linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org
> Subject: Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without
>     CAP_SYS_RAWIO
> 
> Il 06/09/2012 14:08, Ric Wheeler ha scritto:
> >> According to the standard, the translation layer can write a
> >> user-provided pattern to every sector in the disk.  It's an optional
> >> feature and libata doesn't do that, but it is still possible.
> > 
> > It is not possible today with our stack though, any patch that would
> > change that would also need to be vetted.
> 
> It is not possible with SATA disks, but native SCSI disks might well
> interpret FORMAT UNIT destructively.
> 
> >>> I don't see allowing anyone who can open the device to zero the data as
> >>> better though :)
> >> Note: anyone who can open it for writing!  And they can just as well
> >> issue WRITE, it just takes a little more effort than with WRITE SAME. :)
> >>   If you only have read access, you cannot issue WRITE or FORMAT UNIT,
> >> and with this patch you will not be able to issue WRITE SAME.
> > 
> > This just seems like an argument over whether or not capabilities make
> > sense. In general, anything as destructive as a single CDB that can kill
> > all of your data should be tightly controlled.
> 
> In practice, a single write to the first MB of the disk is just as
> destructive.  For that you do not even need a SCSI command.
> 
> > Pushing more code in the data path is not where we are going - we
> > routinely need to disable IO scheduling for example when driving IO to
> > high speed/low latency devices and are actively looking at how to tackle
> > other performance bottlenecks in the stack.
> 
> I am not talking about the regular data path, only of SG_IO.
> 
> > I don't see a strong reason that our existing scheme (root or
> > CAP_SYS_RAWIO access) prevents you from doing what you need to do.
> 
> Here are three:
> 
> - CAP_SYS_RAWIO partly bypasses DAC; you can issue destructive commands
> even if you only opened the disk for reading.  CAP_SYS_RAWIO also gives
> access to _really_ destructive commands (WRITE BUFFER and PERSISTENT
> RESERVE OUT for example).
> 
> - CAP_SYS_RAWIO lets you send SCSI commands to partitions, and they will
> gladly read/write the disk going outside the boundaries of the
> partition.  Changing this behavior was rejected upstream already.
> 
> - CAP_SYS_RAWIO also gives access to I/O ports, mmap at address 0, and
> too many other insecure things.
> 
> All the above mean that:
> 
> - any application using CAP_SYS_RAWIO would have to implement its own
> whitelisting, even if just to duplicate what is done in the kernel;
> 
> - exploiting a CAP_SYS_RAWIO process leads to root too easily, and it is
> not possible to give the capability to anything that will run in a
> hostile environment (in my case QEMU).

So at fist I did not think this is such a good idea however there
are several good points you've mentioned.

CAP_SYS_RAWIO is indeed too big hammer for this and it is not secure
to allow such application to possess such capability. Moreover
WRITE_SAME indeed is almost the SAME as WRITE :), only easier and
delegated to the storage itself.

UNMAP or WRITE_WAME w/ unamp bit is a little bit trickier but
thinking about it some more I do not see any real reason why the
user with write permission should not be able to use this. Yes, it
is not technically write and it has other consequences as well, but
none of it seems to be exploitable more than simple write.

Moreover looking at BLKDISCARD ioctl there is no such restrictions
(obviously) but neither is with TRIM ata command. So with sata SSD
you're allowed to use TRIM command if you have write permission to
that device. So I guess having this consistent is good idea and
considering the points above I think it is ok to allow WRITE_SAME
and UNMAP without CAP_SYS_RAWIO.

Thanks!
-Lukas

> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-07-20 16:30 [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO Paolo Bonzini
  2012-08-01 15:53 ` Paolo Bonzini
@ 2012-09-11 16:59 ` Tejun Heo
  2012-09-11 17:56   ` Paolo Bonzini
  1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2012-09-11 16:59 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley

Hello,

On Fri, Jul 20, 2012 at 06:30:01PM +0200, Paolo Bonzini wrote:
> These commands cannot be issued right now without giving CAP_SYS_RAWIO to
> the process who wishes to send them.  These commands can be useful also to
> non-privileged programs who have access to the block devices.  For example
> a virtual machine monitor needs them to forward trim/discard to host disks.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/scsi_ioctl.c |    3 ++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 260fa80..dd71f18 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -168,13 +168,16 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter)
>  	/* Basic writing commands */
>  	__set_bit(WRITE_6, filter->write_ok);
>  	__set_bit(WRITE_10, filter->write_ok);
> +	__set_bit(WRITE_SAME, filter->write_ok);
>  	__set_bit(WRITE_VERIFY, filter->write_ok);
>  	__set_bit(WRITE_12, filter->write_ok);
>  	__set_bit(WRITE_VERIFY_12, filter->write_ok);
>  	__set_bit(WRITE_16, filter->write_ok);
> +	__set_bit(WRITE_SAME_16, filter->write_ok);
>  	__set_bit(WRITE_LONG, filter->write_ok);
>  	__set_bit(WRITE_LONG_2, filter->write_ok);
>  	__set_bit(ERASE, filter->write_ok);
> +	__set_bit(UNMAP, filter->write_ok);

FWIW, I don't think this is the right way to expose functionality
which needs management in terms of access control, interpretation
(stacking drivers) and serving concurrent users.  SG_IO filtering was
mostly for cd/dvd burning and other removeable devices.  Especially
for the former, the possibility of having a properly encapsulated
interface was long lost and I think significant part of that is how
the underlying technology and its hardware interface developed.  We
basically just gave up.  Fortunately, they're going the way of
floppies.

So, it wouldn't be a good idea to abuse SG_IO filtering for exposing
trim/discard.  It's something which should be retired or at least
severely restricted in time.  I don't think we want to be developing
new uses of it.

I think trim/discards are fairly easy to abstract and common enough to
justify having properly abstracted interface.  In fact, we already
have block layer interface for it - BLKDISCARD.  If it's lacking,
let's improve that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 16:59 ` Tejun Heo
@ 2012-09-11 17:56   ` Paolo Bonzini
  2012-09-11 18:29     ` Tejun Heo
  2012-09-12  8:05     ` James Bottomley
  0 siblings, 2 replies; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-11 17:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley

Il 11/09/2012 18:59, Tejun Heo ha scritto:
> FWIW, I don't think this is the right way to expose functionality
> which needs management in terms of access control, interpretation
> (stacking drivers) and serving concurrent users.  SG_IO filtering was
> mostly for cd/dvd burning and other removeable devices.

Understood; unfortunately, there is another major user of it
(virtualization).  If you are passing "raw" LUNs down to a virtual
machine, there's no possibility at all to use a properly encapsulated
interface and still be able to pass sense data etc. back to the virtual
machine.  And it's only going to grow now that people are starting to
use virtio-scsi.

The set of use cases is so variable that no single filter can accomodate
all of them: high availability people want persistent reservations, NAS
people want trim/discard, but these are just two groups.  Someone is
using a Windows VM to run vendor tools and wants to have access to
vendor-specific commands.

You can tell this last group to use root, but not everyone else who is
already relying on Unix permissions, SELinux and/or device cgroups to
confine their virtual machines.

A generic filter (see
http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal)
would be satisfactory for everyone, but it's also a major undertaking
and so far I've not received a single comment about it.

> So, it wouldn't be a good idea to abuse SG_IO filtering for exposing
> trim/discard.  It's something which should be retired or at least
> severely restricted in time.  I don't think we want to be developing
> new uses of it.
> 
> I think trim/discards are fairly easy to abstract and common enough to
> justify having properly abstracted interface.  In fact, we already
> have block layer interface for it - BLKDISCARD.  If it's lacking,
> let's improve that.

I do want to improve the block layer interfaces to avoid that people use
SG_IO.  But unfortunately this is for a completely different use case.

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 17:56   ` Paolo Bonzini
@ 2012-09-11 18:29     ` Tejun Heo
  2012-09-11 18:54       ` Paolo Bonzini
  2012-09-12  8:05     ` James Bottomley
  1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2012-09-11 18:29 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley

Hello, Paolo.

On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote:
> Understood; unfortunately, there is another major user of it
> (virtualization).  If you are passing "raw" LUNs down to a virtual
> machine, there's no possibility at all to use a properly encapsulated

Is there still command filtering issue when you're passing "raw" LUNs
down?

> interface and still be able to pass sense data etc. back to the virtual
> machine.  And it's only going to grow now that people are starting to
> use virtio-scsi.
> 
> The set of use cases is so variable that no single filter can accomodate
> all of them: high availability people want persistent reservations, NAS
> people want trim/discard, but these are just two groups.  Someone is
> using a Windows VM to run vendor tools and wants to have access to
> vendor-specific commands.
> 
> You can tell this last group to use root, but not everyone else who is
> already relying on Unix permissions, SELinux and/or device cgroups to
> confine their virtual machines.

You listed three - HA w/ persistent reservation, NAS w/ trim/discard
and the third which you said that using root would be fine.  Dunno
much about persistent reservation but I don't see why trim/discard
can't use existing block layer facilities whether from userland or
virtio-scsi?

> A generic filter (see
> http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal)
> would be satisfactory for everyone, but it's also a major undertaking
> and so far I've not received a single comment about it.

Maybe I'm just not familiar with the problem space but I really hope
things don't come to that.  It's not like kernel by itself has to
support every single possible use cases.

> > So, it wouldn't be a good idea to abuse SG_IO filtering for exposing
> > trim/discard.  It's something which should be retired or at least
> > severely restricted in time.  I don't think we want to be developing
> > new uses of it.
> > 
> > I think trim/discards are fairly easy to abstract and common enough to
> > justify having properly abstracted interface.  In fact, we already
> > have block layer interface for it - BLKDISCARD.  If it's lacking,
> > let's improve that.
> 
> I do want to improve the block layer interfaces to avoid that people use
> SG_IO.  But unfortunately this is for a completely different use case.

Hmmm?  This was about discard, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 18:29     ` Tejun Heo
@ 2012-09-11 18:54       ` Paolo Bonzini
  2012-09-11 19:13         ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-11 18:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley

Il 11/09/2012 20:29, Tejun Heo ha scritto:> Hello, Paolo.
> 
> On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote:
>> Understood; unfortunately, there is another major user of it
>> (virtualization).  If you are passing "raw" LUNs down to a virtual
>> machine, there's no possibility at all to use a properly encapsulated
> 
> Is there still command filtering issue when you're passing "raw" LUNs
> down?

Yes, the passing down is just a userland program that gets SCSI
commands from the guest, sends them via SG_IO, and passes back the
result.  If the userland program is unprivileged (it usually is), then
you go through the filter.

>> The set of use cases is so variable that no single filter can accomodate
>> all of them: high availability people want persistent reservations, NAS
>> people want trim/discard, but these are just two groups.  Someone is
>> using a Windows VM to run vendor tools and wants to have access to
>> vendor-specific commands.
>>
>> You can tell this last group to use root, but not everyone else who is
>> already relying on Unix permissions, SELinux and/or device cgroups to
>> confine their virtual machines.
> 
> You listed three - HA w/ persistent reservation, NAS w/ trim/discard
> and the third which you said that using root would be fine.  Dunno
> much about persistent reservation but I don't see why trim/discard
> can't use existing block layer facilities whether from userland or
> virtio-scsi?

This is the userland for virtio-scsi (the kernel part of virtio-scsi is just
a driver running in the guest).  It can run in two mode: it can do its own
SCSI emulation, or it can just relay CDBs and their results.

It can (and does) use higher-level services if SCSI emulation is done in
userland.  In that case, trim/discard can become a BLKDISCARD or a fallocate
for example.  However, in this case userland doesn't do any emulation and in
fact doesn't even need to know that this CDB is a discard.

Also, if it fails, there's no way to reconstruct the NAS's sense data to
pass it back to the guest.  We do a limited amount of "making up" sense
data (for example if a command is filtered, all we get is an errno value;
and we say it was not recognized), but it should really be as simple
and limited as possible.

>> A generic filter (see
>> http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal)
>> would be satisfactory for everyone, but it's also a major undertaking
>> and so far I've not received a single comment about it.
> 
> Maybe I'm just not familiar with the problem space but I really hope
> things don't come to that.

Why not? :)  (BTW it was suggested by Alan Cox, that's just my proposal for
how to do it).  I think that it's a good idea, but it's a big bazooka for
the smaller issue of supporting trim/discard.

>>> So, it wouldn't be a good idea to abuse SG_IO filtering for exposing
>>> trim/discard.  It's something which should be retired or at least
>>> severely restricted in time.  I don't think we want to be developing
>>> new uses of it.
>>>
>>> I think trim/discards are fairly easy to abstract and common enough to
>>> justify having properly abstracted interface.  In fact, we already
>>> have block layer interface for it - BLKDISCARD.  If it's lacking,
>>> let's improve that.
>>
>> I do want to improve the block layer interfaces to avoid that people use
>> SG_IO.  But unfortunately this is for a completely different use case.
> 
> Hmmm?  This was about discard, no?

One example of block layer interfaces that I want to add is BLKPING, so
that you can see if the NAS is reachable.  Then SCSI emulation can map
the "test unit ready" command to BLKPING.  There's a handful of such
ioctls that would be useful, such as BLKDISCARD itself.

But this is for the other direction, where ioctls are not enough accurate.

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 18:54       ` Paolo Bonzini
@ 2012-09-11 19:13         ` Tejun Heo
  2012-09-11 19:24           ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2012-09-11 19:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley

Hello, Paolo.

On Tue, Sep 11, 2012 at 08:54:03PM +0200, Paolo Bonzini wrote:
> > On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote:
> >> Understood; unfortunately, there is another major user of it
> >> (virtualization).  If you are passing "raw" LUNs down to a virtual
> >> machine, there's no possibility at all to use a properly encapsulated
> > 
> > Is there still command filtering issue when you're passing "raw" LUNs
> > down?
> 
> Yes, the passing down is just a userland program that gets SCSI
> commands from the guest, sends them via SG_IO, and passes back the
> result.  If the userland program is unprivileged (it usually is), then
> you go through the filter.

Could being able to bypass the filters for this "you own this LUN" be
a solution?  Or is it that we still need command filtering for
whatever reason?

> This is the userland for virtio-scsi (the kernel part of virtio-scsi is just
> a driver running in the guest).  It can run in two mode: it can do its own
> SCSI emulation, or it can just relay CDBs and their results.
> 
> It can (and does) use higher-level services if SCSI emulation is done in
> userland.  In that case, trim/discard can become a BLKDISCARD or a fallocate
> for example.  However, in this case userland doesn't do any emulation and in
> fact doesn't even need to know that this CDB is a discard.

Couldn't it intercept some of them - e.g. RWs and discards?  What's
the benifit / use case of doing pure bypass?  Would the benefits be
strong enough to justify whole bpf cdb filtering?

> Also, if it fails, there's no way to reconstruct the NAS's sense data to
> pass it back to the guest.  We do a limited amount of "making up" sense
> data (for example if a command is filtered, all we get is an errno value;
> and we say it was not recognized), but it should really be as simple
> and limited as possible.

Yeah, I agree losing sense data could suck but that alone doesn't seem
to be a very strong justification for the whole deal and there could
be different / smaller ways to solve the sense data problem.

> >> A generic filter (see
> >> http://article.gmane.org/gmane.linux.kernel/1312326 for a proposal)
> >> would be satisfactory for everyone, but it's also a major undertaking
> >> and so far I've not received a single comment about it.
> > 
> > Maybe I'm just not familiar with the problem space but I really hope
> > things don't come to that.
> 
> Why not? :)  (BTW it was suggested by Alan Cox, that's just my proposal for
> how to do it).  I think that it's a good idea, but it's a big bazooka for
> the smaller issue of supporting trim/discard.

I guess I mostly wanna know for sure that there's big / strong enough
targets for the big bazooka.  :)

> > Hmmm?  This was about discard, no?
> 
> One example of block layer interfaces that I want to add is BLKPING, so
> that you can see if the NAS is reachable.  Then SCSI emulation can map
> the "test unit ready" command to BLKPING.  There's a handful of such
> ioctls that would be useful, such as BLKDISCARD itself.

Can't you make use of the existing disk events mechanism for that?
Block layer already knows how to watch readiness of a device and tell
the userland about it via uevent.  Hooking to that shouldn't be too
difficult and I think probably is the right approach given that all
hotplug userland hotplug operations go through the same channel.

If you absoluately has to test it from userland, READ on the first
sector?  That essentially is what libata does for START_STOP although
it uses VERIFY instead of READ.  Given how partition code behaves, any
device which fails on READ on block0 isn't gonna work well with linux
anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 19:13         ` Tejun Heo
@ 2012-09-11 19:24           ` Paolo Bonzini
  2012-09-11 20:01             ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-11 19:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley

Il 11/09/2012 21:13, Tejun Heo ha scritto:
> Hello, Paolo.
> 
> On Tue, Sep 11, 2012 at 08:54:03PM +0200, Paolo Bonzini wrote:
>>> On Tue, Sep 11, 2012 at 07:56:53PM +0200, Paolo Bonzini wrote:
>>>> Understood; unfortunately, there is another major user of it
>>>> (virtualization).  If you are passing "raw" LUNs down to a virtual
>>>> machine, there's no possibility at all to use a properly encapsulated
>>>
>>> Is there still command filtering issue when you're passing "raw" LUNs
>>> down?
>>
>> Yes, the passing down is just a userland program that gets SCSI
>> commands from the guest, sends them via SG_IO, and passes back the
>> result.  If the userland program is unprivileged (it usually is), then
>> you go through the filter.
> 
> Could being able to bypass the filters for this "you own this LUN" be
> a solution?  Or is it that we still need command filtering for
> whatever reason?

Yes, it could be.  Enabling/disabling the filters from a privileged
program and passing the unfiltered fd via SCM_RIGHTS would be enough.

>> This is the userland for virtio-scsi (the kernel part of virtio-scsi is just
>> a driver running in the guest).  It can run in two mode: it can do its own
>> SCSI emulation, or it can just relay CDBs and their results.
>>
>> It can (and does) use higher-level services if SCSI emulation is done in
>> userland.  In that case, trim/discard can become a BLKDISCARD or a fallocate
>> for example.  However, in this case userland doesn't do any emulation and in
>> fact doesn't even need to know that this CDB is a discard.
> 
> Couldn't it intercept some of them - e.g. RWs and discards?
> What's the benifit / use case of doing pure bypass?

Basically, using the same storage technology for bare metal and
virtualized systems.  IMHO losing sense data is a no-no, but the above
solution could be feasible too.

> Would the benefits be strong enough to justify whole bpf cdb filtering?

If we can get a simpler solution that is okay with kernel maintainers,
I'm all for it.

>>> Hmmm?  This was about discard, no?
>>
>> One example of block layer interfaces that I want to add is BLKPING, so
>> that you can see if the NAS is reachable.  Then SCSI emulation can map
>> the "test unit ready" command to BLKPING.  There's a handful of such
>> ioctls that would be useful, such as BLKDISCARD itself.
> 
> Can't you make use of the existing disk events mechanism for that?
> Block layer already knows how to watch readiness of a device and tell
> the userland about it via uevent.

How?  But anyway i don't want to divert the discussion from the actual
topic...

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 19:24           ` Paolo Bonzini
@ 2012-09-11 20:01             ` Tejun Heo
  2012-09-11 21:50               ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2012-09-11 20:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley, Kay Sievers

Hello, Paolo.

On Tue, Sep 11, 2012 at 09:24:32PM +0200, Paolo Bonzini wrote:
> > Couldn't it intercept some of them - e.g. RWs and discards?
> > What's the benifit / use case of doing pure bypass?
> 
> Basically, using the same storage technology for bare metal and
> virtualized systems.  IMHO losing sense data is a no-no, but the above
> solution could be feasible too.

Most of my experience with storage devices comes from SATA where error
data is more of the deal "take some hints if you really want but
there's pretty good chance it's completely bogus", so my perception
about the importance of sense data might not match the fancy SCSI
land.  I don't know.

Either way, with or without virtualization, making detailed error
information to userland is a valid goal.  I *think* we're finally
getting there after years of talking via structured printk.  I don't
know much about the details but heard about exposing sense data via
printk.  cc'ing Kay.  Kay, could that be useful for virtualization use
cases too?  They want to obtain the sense data after command failure.
I suppose there would be some challenge in matching actions and error
logs tho and it could be too clunky to use this way.

> > Can't you make use of the existing disk events mechanism for that?
> > Block layer already knows how to watch readiness of a device and tell
> > the userland about it via uevent.
> 
> How?  But anyway i don't want to divert the discussion from the actual
> topic...

Disk events mechanism is there to watch (either via async notification
or polling) media change and device readiness and generates the usual
uevents when it detects them.  For sd devices, it basically issues TUR
periodically, so it's already doing pretty much what you need.

I guess the repeating question is whether to solve the problem within
the framework the underlying OS is providing or having direct access
to the raw hardware.  I don't know the answer.

Accessing the "raw" hardware does have its advantages but managing
multiple users so that they don't step on each other's foot is one of
the main reasons we have this kernel thing after all, so there's
natural tension between "wanting raw" and "multiuser security/whatever
concerns".  Beyond certain point, I think it essentially becomes
wanting the cake and eating it too.

I personally hope "raw" to be strictly confined to specific areas
where performance impact of having kernel inbetween is prohibitive but
that's just me hoping.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 20:01             ` Tejun Heo
@ 2012-09-11 21:50               ` Paolo Bonzini
  2012-09-11 22:02                 ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-11 21:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley,
	Kay Sievers, Al Viro

[Al: you can jump down to "One problem:"]

Il 11/09/2012 22:01, Tejun Heo ha scritto:
> Hello, Paolo.
> 
> On Tue, Sep 11, 2012 at 09:24:32PM +0200, Paolo Bonzini wrote:
>>> Couldn't it intercept some of them - e.g. RWs and discards?
>>> What's the benifit / use case of doing pure bypass?
>>
>> Basically, using the same storage technology for bare metal and
>> virtualized systems.  IMHO losing sense data is a no-no, but the above
>> solution could be feasible too.
> 
> Either way, with or without virtualization, making detailed error
> information to userland is a valid goal.  I *think* we're finally
> getting there after years of talking via structured printk.  I don't
> know much about the details but heard about exposing sense data via
> printk.

Wait wait, there is already a perfectly 1:1 solution for this, and it's
SG_IO.

I think error processing falls roughly in two categories: "I need each
command's precise state" and "I need to know if/when something bad
happens".  Luckily, I/O also falls roughly in the same two categories:
"I need precise control of each commands" and "I just care of getting
this to disk".  The former can use SG_IO, the latter can use logs.

So, let's not complicate the problem further.  We have a perfectly sane
API that (with different names) is even provided by almost every
operating system in existence.  There's just this little detail of
filtering that is done for unprivileged processes; I hoped to fix 50% of
the problem with this 3-line patch but it's not the end of the world if
it's rejected constructively.

The solution I outlined in my previous email:

>> Enabling/disabling the filters from a privileged
>> program and passing the unfiltered fd via SCM_RIGHTS would be enough.

would entail some userland coding, but nothing paramount at all (and
closer to my usual territory :)).  And we would have to do it anyway for
the reservations case.

Basically it would be a ioctl(fd, SG_SET_FILTER_ENABLED, arg) where arg
can be:

-1 for "enable/disable based on CAP_SYS_RAWIO" (default)
0 for always enable filter
1 for always disable filter

And also a dual ioctl(fd, SG_GET_FILTER_ENABLED, arg).

One problem: to do this, I need to access some "struct file" member in
SG_IO, and thus change the ioctl member from block_device/fmode to
block_device/file.   This would partially undo the 2007 switch from
inode/file by Al Viro.  He was already asked about it in
https://lkml.org/lkml/2012/6/12/414, let's try again here.

>>> Can't you make use of the existing disk events mechanism for that?
>>> Block layer already knows how to watch readiness of a device and tell
>>> the userland about it via uevent.
>>
>> How?  But anyway i don't want to divert the discussion from the actual
>> topic...
> 
> Disk events mechanism is there to watch (either via async notification
> or polling) media change and device readiness and generates the usual
> uevents when it detects them.  For sd devices, it basically issues TUR
> periodically, so it's already doing pretty much what you need.

Ah, no, we can't do that because the device should be opened with
O_EXCL.  It is not right now, but it's a bug.  It's not very different
from burning a CD (in fact, it's absolutely the same if you burn a CD
inside a guest :)).

> I guess the repeating question is whether to solve the problem within
> the framework the underlying OS is providing or having direct access
> to the raw hardware.  I don't know the answer.
> 
> Accessing the "raw" hardware does have its advantages but managing
> multiple users

In this case, the constraints pretty much guarantee that you have only
one user.  To stick with everyday hardware, if you pass your CD drive to
a guest you can well expect that the host will not be able to use it.

Or, if you have more than one user, that they know what they are doing
(reservations, etc.).

> I personally hope "raw" to be strictly confined to specific areas
> where performance impact of having kernel inbetween is prohibitive but
> that's just me hoping.

Well, it's not just about performance but also about precision sometimes.

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 21:50               ` Paolo Bonzini
@ 2012-09-11 22:02                 ` Tejun Heo
  2012-09-11 22:10                   ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2012-09-11 22:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley,
	Kay Sievers, Al Viro

Hello, Paolo.

On Tue, Sep 11, 2012 at 11:50:38PM +0200, Paolo Bonzini wrote:
> > Either way, with or without virtualization, making detailed error
> > information to userland is a valid goal.  I *think* we're finally
> > getting there after years of talking via structured printk.  I don't
> > know much about the details but heard about exposing sense data via
> > printk.
> 
> Wait wait, there is already a perfectly 1:1 solution for this, and it's
> SG_IO.
> 
> I think error processing falls roughly in two categories: "I need each
> command's precise state" and "I need to know if/when something bad
> happens".  Luckily, I/O also falls roughly in the same two categories:
> "I need precise control of each commands" and "I just care of getting
> this to disk".  The former can use SG_IO, the latter can use logs.

SG_IO itself is a bypassing interface.  It bypasses most of block
layer and the kernel doesn't have any idea (apart from the adhoc
filtering) about what's going on.  The problem can be approached from
both directions (make use of OS IO layer improving it as needed or add
more intelligence to the bypass thing) and I'm not sure at all adding
more capability to the adhoc filtering is the better direction.

So, what I wanna say is that if you can get by with the adhoc thing we
already have in place for cd/dvd, that's great; otherwise, it's not
clear at all whether expanding that adhoc filtering is a good idea.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 22:02                 ` Tejun Heo
@ 2012-09-11 22:10                   ` Paolo Bonzini
  2012-09-11 22:13                     ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-11 22:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley,
	Kay Sievers, Al Viro

Il 12/09/2012 00:02, Tejun Heo ha scritto:
> SG_IO itself is a bypassing interface.  It bypasses most of block
> layer and the kernel doesn't have any idea (apart from the adhoc
> filtering) about what's going on.

That's very much the point.  The guest must have free reins.

You asked "Could being able to bypass the filters for this "you own this
LUN" be a solution?", I said yes and outlined how.  Do you agree with
the proposed solution?

> The problem can be approached from
> both directions (make use of OS IO layer improving it as needed or add
> more intelligence to the bypass thing)

Or remove intelligence if it gets in the way.  Of course the removal
must be done by an appropriately-privileged program.

> and I'm not sure at all adding
> more capability to the adhoc filtering is the better direction.

Sure, I'm fine with leaving the current ad hoc filtering aside.  Again,
I was hoping to get most of the job done by loosening the filter a bit,
but discussion is inversely proportional to patch length sometimes.

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 22:10                   ` Paolo Bonzini
@ 2012-09-11 22:13                     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2012-09-11 22:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, axboe, linux-scsi, James E.J. Bottomley,
	Kay Sievers, Al Viro

Hello,

On Tue, Sep 11, 2012 at 3:10 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 12/09/2012 00:02, Tejun Heo ha scritto:
>> SG_IO itself is a bypassing interface.  It bypasses most of block
>> layer and the kernel doesn't have any idea (apart from the adhoc
>> filtering) about what's going on.
>
> That's very much the point.  The guest must have free reins.
>
> You asked "Could being able to bypass the filters for this "you own this
> LUN" be a solution?", I said yes and outlined how.  Do you agree with
> the proposed solution?

Ooh, yeah, I like that one. I was still thinking about the bpf one. It
being based on cgroup made me worried even more. :)

> Sure, I'm fine with leaving the current ad hoc filtering aside.  Again,
> I was hoping to get most of the job done by loosening the filter a bit,
> but discussion is inversely proportional to patch length sometimes.

Sorry about dragging it on. The cdb filtering one has always been
controversial and I think it'll continue to be. It's an unfortunate
thing that we had to add.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-11 17:56   ` Paolo Bonzini
  2012-09-11 18:29     ` Tejun Heo
@ 2012-09-12  8:05     ` James Bottomley
  2012-09-12  8:18       ` Paolo Bonzini
  1 sibling, 1 reply; 24+ messages in thread
From: James Bottomley @ 2012-09-12  8:05 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Tejun Heo, linux-kernel, axboe, linux-scsi

On Tue, 2012-09-11 at 19:56 +0200, Paolo Bonzini wrote:
> The set of use cases is so variable that no single filter can accomodate
> all of them: high availability people want persistent reservations, NAS
> people want trim/discard, but these are just two groups.  Someone is
> using a Windows VM to run vendor tools and wants to have access to
> vendor-specific commands.
> 
> You can tell this last group to use root, but not everyone else who is
> already relying on Unix permissions, SELinux and/or device cgroups to
> confine their virtual machines.

This is why the whole filter thing was mutable via sysfs.  That way the
admin could set this up per device.  It sounds like this is what you
want to fix, rather than opening up more holes in an already leaky
security apparatus.  The ideal is that we would be much more restrictive
by default and give root the ability to override this both globally and
per-device to conform to whatever policy it has for the virtual
environments.

The patch which removed all of the sysfs pieces was:

commit 018e0446890661504783f92388ecce7138c1566d
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Fri Jun 26 16:27:10 2009 +0200

    block: get rid of queue-private command filter

So that's probably the place to start for putting it back properly.

James



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO
  2012-09-12  8:05     ` James Bottomley
@ 2012-09-12  8:18       ` Paolo Bonzini
  0 siblings, 0 replies; 24+ messages in thread
From: Paolo Bonzini @ 2012-09-12  8:18 UTC (permalink / raw)
  To: James Bottomley; +Cc: Tejun Heo, linux-kernel, axboe, linux-scsi

Il 12/09/2012 10:05, James Bottomley ha scritto:
> This is why the whole filter thing was mutable via sysfs.  That way the
> admin could set this up per device.  It sounds like this is what you
> want to fix, rather than opening up more holes in an already leaky
> security apparatus.

It is, thanks for the useful history lesson!

Paolo

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-09-12  8:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-20 16:30 [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO Paolo Bonzini
2012-08-01 15:53 ` Paolo Bonzini
2012-08-28 11:04   ` Paolo Bonzini
2012-09-05 14:41     ` [Ping^3] " Paolo Bonzini
2012-09-05 20:18       ` Ric Wheeler
2012-09-06  6:31         ` Paolo Bonzini
2012-09-06 11:31           ` Ric Wheeler
2012-09-06 11:49             ` Paolo Bonzini
2012-09-06 12:08               ` Ric Wheeler
2012-09-06 12:36                 ` Paolo Bonzini
2012-09-06 14:20                   ` Lukáš Czerner
2012-09-11 16:59 ` Tejun Heo
2012-09-11 17:56   ` Paolo Bonzini
2012-09-11 18:29     ` Tejun Heo
2012-09-11 18:54       ` Paolo Bonzini
2012-09-11 19:13         ` Tejun Heo
2012-09-11 19:24           ` Paolo Bonzini
2012-09-11 20:01             ` Tejun Heo
2012-09-11 21:50               ` Paolo Bonzini
2012-09-11 22:02                 ` Tejun Heo
2012-09-11 22:10                   ` Paolo Bonzini
2012-09-11 22:13                     ` Tejun Heo
2012-09-12  8:05     ` James Bottomley
2012-09-12  8:18       ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).