* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
@ 2013-02-07 12:08 ` Boaz Harrosh
0 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2013-02-07 12:08 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 01:27 PM, Hannes Reinecke wrote:
> On 02/07/2013 11:01 AM, Darrick J. Wong wrote:
>> On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote:
>>> On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote:
>>>>
>>>> On Feb 6, 2013, at 3:24 PM, "Darrick J. Wong" <darrick.wong@oracle.com> wrote:
>>>>
>>>>> On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm interested in discussing how to pass protection information to and from
>>>>>> userspace. Maybe Martin could be enlisted for the discussion.
>>>>>>
>>>>>> I read that some work has already been done in this area but have not been able
>>>>>> to locate it. It looks like the bio-integrity code already makes it possible
>>>>>> to generate the t10-dif crc in the filesystem. It would be good to be able to
>>>>>> get the guard and application tags back out to backup applications such as
>>>>>> xfsdump. Enabling other applications to generate their own tags in userspace
>>>>>> is also interesting.
>>>>>
>>>>> This one's been on my list for a couple of years (and companies) too. A few
>>>>> years ago Joel Becker had support for it in his sys_dio proposal (that hasn't
>>>>> gone anywhere), and more recently I've theorized that we could add a magic
>>>>> fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT
>>>>> *{read,write}v call as the PI buffer, which I think is similar to how DIX gets
>>>>> PI data to a disk. But it's not like I have any code to show for it.
>>>>>
>>>>> I /think/ it's fairly straightforward to change the directio submit code to
>>>>> find the userspace PI buffer and amend the block integrity code to attach our
>>>>> own PI buffer. You'd still have to let the block layer set the sector # field,
>>>>> but afaik that won't affect the crc or the app tag.
>>>>>
>>>>> I hear that the NFS guys want to propose some sort of protocol for transmitting
>>>>> PI data (across NFS), but I haven't seen anything concrete yet.
>>>>
>>>> I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space.
>>>>
>>>> Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage.
>>>
>>> I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio()
>>> coding hasn't happened. I do think we're better off with some kind of
>>> explicit API than some magic state on the file. I mean, even something
>>> like:
>>>
>>> ssize_t write_with_pi(int fd, const void *buf, size_t count,
>>> const void *pi, size_t pi_count);
>>>
>>> It's not as nice as a non-historical API (eg sys_dio), but it also
>>> probably plays nicer with buffered I/O.
>>
>> I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
>> and all the other plumbing necessary to make that happen...
>>
>> void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
>> int iovcnt, long long offset, const void *pi,
>> size_t pi_count);
>>
> This is also what I've envisioned.
> Updating io_prep / async I/O is reasonably easy as its been using a
> separate structure for passing in the I/O details.
>
> Normal read/write calls don't really map as you simply don't have
> enough parameter to feed PI information into the kernel.
> So for that you'd need to invent a new interface / syscall.
>
> For aio we just need to add additional fields to an existing structure.
>
> So yeah, I'd be interested in that discussion as well.
>
Me too, in multiple fronts. It's part of my general concern about
"things we would like for user-mode servers"
I think that the current aio and libaio Interface is broken for a long
time, for multitude of reasons. For instance the nested structure definitions
are COMPAT broken, and lots of missing pieces. (For example search in archives
for why bsg does not support sg-lists.)
And there are all these additions that everyone wants on top, that call for
a new interface anyway.
So I would like to see a deep fixup of this interface, with an aio version2
that can take into considerations, all of future needs including these
above. Kernel code will be very happy to be implemented with the new, interface
and a COMPAT layer could be put in place for the old interface.
All interested parties should bring to the table what is the extension/changes
they need. And we can try and union all of them together.
(My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
I know that qemu was wanting this for a while as well as the multitude of
user-mode servers)
Thanks
Boaz
> Cheers,
>
> Hannes
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 12:08 ` Boaz Harrosh
@ 2013-02-07 12:16 ` Boaz Harrosh
-1 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2013-02-07 12:16 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 02:08 PM, Boaz Harrosh wrote:
> On 02/07/2013 01:27 PM, Hannes Reinecke wrote:
>> On 02/07/2013 11:01 AM, Darrick J. Wong wrote:
>>> On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote:
>>>> On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote:
>>>>>
>>>>> On Feb 6, 2013, at 3:24 PM, "Darrick J. Wong" <darrick.wong@oracle.com> wrote:
>>>>>
>>>>>> On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm interested in discussing how to pass protection information to and from
>>>>>>> userspace. Maybe Martin could be enlisted for the discussion.
>>>>>>>
>>>>>>> I read that some work has already been done in this area but have not been able
>>>>>>> to locate it. It looks like the bio-integrity code already makes it possible
>>>>>>> to generate the t10-dif crc in the filesystem. It would be good to be able to
>>>>>>> get the guard and application tags back out to backup applications such as
>>>>>>> xfsdump. Enabling other applications to generate their own tags in userspace
>>>>>>> is also interesting.
>>>>>>
>>>>>> This one's been on my list for a couple of years (and companies) too. A few
>>>>>> years ago Joel Becker had support for it in his sys_dio proposal (that hasn't
>>>>>> gone anywhere), and more recently I've theorized that we could add a magic
>>>>>> fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT
>>>>>> *{read,write}v call as the PI buffer, which I think is similar to how DIX gets
>>>>>> PI data to a disk. But it's not like I have any code to show for it.
>>>>>>
>>>>>> I /think/ it's fairly straightforward to change the directio submit code to
>>>>>> find the userspace PI buffer and amend the block integrity code to attach our
>>>>>> own PI buffer. You'd still have to let the block layer set the sector # field,
>>>>>> but afaik that won't affect the crc or the app tag.
>>>>>>
>>>>>> I hear that the NFS guys want to propose some sort of protocol for transmitting
>>>>>> PI data (across NFS), but I haven't seen anything concrete yet.
>>>>>
>>>>> I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space.
>>>>>
>>>>> Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage.
>>>>
>>>> I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio()
>>>> coding hasn't happened. I do think we're better off with some kind of
>>>> explicit API than some magic state on the file. I mean, even something
>>>> like:
>>>>
>>>> ssize_t write_with_pi(int fd, const void *buf, size_t count,
>>>> const void *pi, size_t pi_count);
>>>>
>>>> It's not as nice as a non-historical API (eg sys_dio), but it also
>>>> probably plays nicer with buffered I/O.
>>>
>>> I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
>>> and all the other plumbing necessary to make that happen...
>>>
>>> void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
>>> int iovcnt, long long offset, const void *pi,
>>> size_t pi_count);
>>>
>> This is also what I've envisioned.
>> Updating io_prep / async I/O is reasonably easy as its been using a
>> separate structure for passing in the I/O details.
>>
>> Normal read/write calls don't really map as you simply don't have
>> enough parameter to feed PI information into the kernel.
>> So for that you'd need to invent a new interface / syscall.
>>
>> For aio we just need to add additional fields to an existing structure.
>>
>> So yeah, I'd be interested in that discussion as well.
>>
>
> Me too, in multiple fronts. It's part of my general concern about
> "things we would like for user-mode servers"
>
> I think that the current aio and libaio Interface is broken for a long
> time, for multitude of reasons. For instance the nested structure definitions
> are COMPAT broken, and lots of missing pieces. (For example search in archives
> for why bsg does not support sg-lists.)
>
> And there are all these additions that everyone wants on top, that call for
> a new interface anyway.
>
> So I would like to see a deep fixup of this interface, with an aio version2
> that can take into considerations, all of future needs including these
> above. Kernel code will be very happy to be implemented with the new, interface
> and a COMPAT layer could be put in place for the old interface.
>
> All interested parties should bring to the table what is the extension/changes
> they need. And we can try and union all of them together.
>
> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
> I know that qemu was wanting this for a while as well as the multitude of
> user-mode servers)
>
I wanted to add that there is another LSF/MM thread going on about:
"[LSF TOPIC] What to do about O_DIRECT?"
All these guys should be participating here, so to change core structures
and behavior to a better model, that helps us here, and not against us.
(Again libaio should be changed in concert with Kernel's new API, and we
can sacrifice old user-mode performance, with a COMPAT layer. Distro
maintainers should consider replacing libaio, together with the new
Kernel, so it is only those that do their own mix-and-match, who can
fix that mismatch too)
> Thanks
> Boaz
>
>> Cheers,
>>
>> Hannes
>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
@ 2013-02-07 12:16 ` Boaz Harrosh
0 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2013-02-07 12:16 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 02:08 PM, Boaz Harrosh wrote:
> On 02/07/2013 01:27 PM, Hannes Reinecke wrote:
>> On 02/07/2013 11:01 AM, Darrick J. Wong wrote:
>>> On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote:
>>>> On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote:
>>>>>
>>>>> On Feb 6, 2013, at 3:24 PM, "Darrick J. Wong" <darrick.wong@oracle.com> wrote:
>>>>>
>>>>>> On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm interested in discussing how to pass protection information to and from
>>>>>>> userspace. Maybe Martin could be enlisted for the discussion.
>>>>>>>
>>>>>>> I read that some work has already been done in this area but have not been able
>>>>>>> to locate it. It looks like the bio-integrity code already makes it possible
>>>>>>> to generate the t10-dif crc in the filesystem. It would be good to be able to
>>>>>>> get the guard and application tags back out to backup applications such as
>>>>>>> xfsdump. Enabling other applications to generate their own tags in userspace
>>>>>>> is also interesting.
>>>>>>
>>>>>> This one's been on my list for a couple of years (and companies) too. A few
>>>>>> years ago Joel Becker had support for it in his sys_dio proposal (that hasn't
>>>>>> gone anywhere), and more recently I've theorized that we could add a magic
>>>>>> fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT
>>>>>> *{read,write}v call as the PI buffer, which I think is similar to how DIX gets
>>>>>> PI data to a disk. But it's not like I have any code to show for it.
>>>>>>
>>>>>> I /think/ it's fairly straightforward to change the directio submit code to
>>>>>> find the userspace PI buffer and amend the block integrity code to attach our
>>>>>> own PI buffer. You'd still have to let the block layer set the sector # field,
>>>>>> but afaik that won't affect the crc or the app tag.
>>>>>>
>>>>>> I hear that the NFS guys want to propose some sort of protocol for transmitting
>>>>>> PI data (across NFS), but I haven't seen anything concrete yet.
>>>>>
>>>>> I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space.
>>>>>
>>>>> Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage.
>>>>
>>>> I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio()
>>>> coding hasn't happened. I do think we're better off with some kind of
>>>> explicit API than some magic state on the file. I mean, even something
>>>> like:
>>>>
>>>> ssize_t write_with_pi(int fd, const void *buf, size_t count,
>>>> const void *pi, size_t pi_count);
>>>>
>>>> It's not as nice as a non-historical API (eg sys_dio), but it also
>>>> probably plays nicer with buffered I/O.
>>>
>>> I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
>>> and all the other plumbing necessary to make that happen...
>>>
>>> void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
>>> int iovcnt, long long offset, const void *pi,
>>> size_t pi_count);
>>>
>> This is also what I've envisioned.
>> Updating io_prep / async I/O is reasonably easy as its been using a
>> separate structure for passing in the I/O details.
>>
>> Normal read/write calls don't really map as you simply don't have
>> enough parameter to feed PI information into the kernel.
>> So for that you'd need to invent a new interface / syscall.
>>
>> For aio we just need to add additional fields to an existing structure.
>>
>> So yeah, I'd be interested in that discussion as well.
>>
>
> Me too, in multiple fronts. It's part of my general concern about
> "things we would like for user-mode servers"
>
> I think that the current aio and libaio Interface is broken for a long
> time, for multitude of reasons. For instance the nested structure definitions
> are COMPAT broken, and lots of missing pieces. (For example search in archives
> for why bsg does not support sg-lists.)
>
> And there are all these additions that everyone wants on top, that call for
> a new interface anyway.
>
> So I would like to see a deep fixup of this interface, with an aio version2
> that can take into considerations, all of future needs including these
> above. Kernel code will be very happy to be implemented with the new, interface
> and a COMPAT layer could be put in place for the old interface.
>
> All interested parties should bring to the table what is the extension/changes
> they need. And we can try and union all of them together.
>
> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
> I know that qemu was wanting this for a while as well as the multitude of
> user-mode servers)
>
I wanted to add that there is another LSF/MM thread going on about:
"[LSF TOPIC] What to do about O_DIRECT?"
All these guys should be participating here, so to change core structures
and behavior to a better model, that helps us here, and not against us.
(Again libaio should be changed in concert with Kernel's new API, and we
can sacrifice old user-mode performance, with a COMPAT layer. Distro
maintainers should consider replacing libaio, together with the new
Kernel, so it is only those that do their own mix-and-match, who can
fix that mismatch too)
> Thanks
> Boaz
>
>> Cheers,
>>
>> Hannes
>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 12:16 ` Boaz Harrosh
(?)
@ 2013-02-07 12:33 ` Hannes Reinecke
2013-02-07 12:54 ` Boaz Harrosh
-1 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2013-02-07 12:33 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 01:16 PM, Boaz Harrosh wrote:
> On 02/07/2013 02:08 PM, Boaz Harrosh wrote:
>> On 02/07/2013 01:27 PM, Hannes Reinecke wrote:
>>> On 02/07/2013 11:01 AM, Darrick J. Wong wrote:
>>>> On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote:
>>>>> On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote:
>>>>>>
>>>>>> On Feb 6, 2013, at 3:24 PM, "Darrick J. Wong" <darrick.wong@oracle.com> wrote:
>>>>>>
>>>>>>> On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm interested in discussing how to pass protection information to and from
>>>>>>>> userspace. Maybe Martin could be enlisted for the discussion.
>>>>>>>>
>>>>>>>> I read that some work has already been done in this area but have not been able
>>>>>>>> to locate it. It looks like the bio-integrity code already makes it possible
>>>>>>>> to generate the t10-dif crc in the filesystem. It would be good to be able to
>>>>>>>> get the guard and application tags back out to backup applications such as
>>>>>>>> xfsdump. Enabling other applications to generate their own tags in userspace
>>>>>>>> is also interesting.
>>>>>>>
>>>>>>> This one's been on my list for a couple of years (and companies) too. A few
>>>>>>> years ago Joel Becker had support for it in his sys_dio proposal (that hasn't
>>>>>>> gone anywhere), and more recently I've theorized that we could add a magic
>>>>>>> fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT
>>>>>>> *{read,write}v call as the PI buffer, which I think is similar to how DIX gets
>>>>>>> PI data to a disk. But it's not like I have any code to show for it.
>>>>>>>
>>>>>>> I /think/ it's fairly straightforward to change the directio submit code to
>>>>>>> find the userspace PI buffer and amend the block integrity code to attach our
>>>>>>> own PI buffer. You'd still have to let the block layer set the sector # field,
>>>>>>> but afaik that won't affect the crc or the app tag.
>>>>>>>
>>>>>>> I hear that the NFS guys want to propose some sort of protocol for transmitting
>>>>>>> PI data (across NFS), but I haven't seen anything concrete yet.
>>>>>>
>>>>>> I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space.
>>>>>>
>>>>>> Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage.
>>>>>
>>>>> I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio()
>>>>> coding hasn't happened. I do think we're better off with some kind of
>>>>> explicit API than some magic state on the file. I mean, even something
>>>>> like:
>>>>>
>>>>> ssize_t write_with_pi(int fd, const void *buf, size_t count,
>>>>> const void *pi, size_t pi_count);
>>>>>
>>>>> It's not as nice as a non-historical API (eg sys_dio), but it also
>>>>> probably plays nicer with buffered I/O.
>>>>
>>>> I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
>>>> and all the other plumbing necessary to make that happen...
>>>>
>>>> void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
>>>> int iovcnt, long long offset, const void *pi,
>>>> size_t pi_count);
>>>>
>>> This is also what I've envisioned.
>>> Updating io_prep / async I/O is reasonably easy as its been using a
>>> separate structure for passing in the I/O details.
>>>
>>> Normal read/write calls don't really map as you simply don't have
>>> enough parameter to feed PI information into the kernel.
>>> So for that you'd need to invent a new interface / syscall.
>>>
>>> For aio we just need to add additional fields to an existing structure.
>>>
>>> So yeah, I'd be interested in that discussion as well.
>>>
>>
>> Me too, in multiple fronts. It's part of my general concern about
>> "things we would like for user-mode servers"
>>
>> I think that the current aio and libaio Interface is broken for a long
>> time, for multitude of reasons. For instance the nested structure definitions
>> are COMPAT broken, and lots of missing pieces. (For example search in archives
>> for why bsg does not support sg-lists.)
>>
>> And there are all these additions that everyone wants on top, that call for
>> a new interface anyway.
>>
>> So I would like to see a deep fixup of this interface, with an aio version2
>> that can take into considerations, all of future needs including these
>> above. Kernel code will be very happy to be implemented with the new, interface
>> and a COMPAT layer could be put in place for the old interface.
>>
>> All interested parties should bring to the table what is the extension/changes
>> they need. And we can try and union all of them together.
>>
>> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
>> I know that qemu was wanting this for a while as well as the multitude of
>> user-mode servers)
>>
>
> I wanted to add that there is another LSF/MM thread going on about:
> "[LSF TOPIC] What to do about O_DIRECT?"
>
> All these guys should be participating here, so to change core structures
> and behavior to a better model, that helps us here, and not against us.
>
> (Again libaio should be changed in concert with Kernel's new API, and we
> can sacrifice old user-mode performance, with a COMPAT layer. Distro
> maintainers should consider replacing libaio, together with the new
> Kernel, so it is only those that do their own mix-and-match, who can
> fix that mismatch too)
>
And while we're at it, I still would _love_ to connect aio_cancel()
and blk_abort_request().
That way we could sensibly abort an I/O and get out of the darn 'D'
state.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 12:33 ` Hannes Reinecke
@ 2013-02-07 12:54 ` Boaz Harrosh
0 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2013-02-07 12:54 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 02:33 PM, Hannes Reinecke wrote:
> On 02/07/2013 01:16 PM, Boaz Harrosh wrote:
>> (Again libaio should be changed in concert with Kernel's new API, and we
>> can sacrifice old user-mode performance, with a COMPAT layer. Distro
>> maintainers should consider replacing libaio, together with the new
>> Kernel, so it is only those that do their own mix-and-match, who can
>> fix that mismatch too)
>>
> And while we're at it, I still would _love_ to connect aio_cancel()
> and blk_abort_request().
>
> That way we could sensibly abort an I/O and get out of the darn 'D'
> state.
>
Yes!! Thanks. It is very interesting how the socket side of the world
had it correct for ages, and the same "fd" object on disks is second grade
citizen in UNIX land. (Anybody voting for epoll on async disk IO? )
Thanks Hannes yes that too. And wait_interuptable() too, at couple of
places, will need some serious error handling audit for that.
> Cheers,
>
> Hannes
>
Boaz
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
@ 2013-02-07 12:54 ` Boaz Harrosh
0 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2013-02-07 12:54 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 02:33 PM, Hannes Reinecke wrote:
> On 02/07/2013 01:16 PM, Boaz Harrosh wrote:
>> (Again libaio should be changed in concert with Kernel's new API, and we
>> can sacrifice old user-mode performance, with a COMPAT layer. Distro
>> maintainers should consider replacing libaio, together with the new
>> Kernel, so it is only those that do their own mix-and-match, who can
>> fix that mismatch too)
>>
> And while we're at it, I still would _love_ to connect aio_cancel()
> and blk_abort_request().
>
> That way we could sensibly abort an I/O and get out of the darn 'D'
> state.
>
Yes!! Thanks. It is very interesting how the socket side of the world
had it correct for ages, and the same "fd" object on disks is second grade
citizen in UNIX land. (Anybody voting for epoll on async disk IO? )
Thanks Hannes yes that too. And wait_interuptable() too, at couple of
places, will need some serious error handling audit for that.
> Cheers,
>
> Hannes
>
Boaz
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 12:08 ` Boaz Harrosh
(?)
(?)
@ 2013-02-07 12:29 ` Bart Van Assche
2013-02-07 12:47 ` Boaz Harrosh
-1 siblings, 1 reply; 28+ messages in thread
From: Bart Van Assche @ 2013-02-07 12:29 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Hannes Reinecke, Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc,
linux-fsdevel, linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/13 13:08, Boaz Harrosh wrote:
> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
> I know that qemu was wanting this for a while as well as the multitude of
> user-mode servers)
Do you think it would help / make sense if sg_alloc_table() would be
modified such that it allocates the entire scatterlist table via one
vmalloc() call instead of chaining several page-sized scatterlist tables
? Note: such a change is not possible without modifying
scsi_alloc_sgtable().
Bart.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 12:29 ` Bart Van Assche
@ 2013-02-07 12:47 ` Boaz Harrosh
0 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2013-02-07 12:47 UTC (permalink / raw)
To: Bart Van Assche
Cc: Hannes Reinecke, Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc,
linux-fsdevel, linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 02:29 PM, Bart Van Assche wrote:
> On 02/07/13 13:08, Boaz Harrosh wrote:
>> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
>> I know that qemu was wanting this for a while as well as the multitude of
>> user-mode servers)
>
> Do you think it would help / make sense if sg_alloc_table() would be
> modified such that it allocates the entire scatterlist table via one
> vmalloc() call instead of chaining several page-sized scatterlist tables
> ? Note: such a change is not possible without modifying
> scsi_alloc_sgtable().
>
I don't think so, no. sg_alloc_table() is used not only for direct IO
also for buffered, Now vmalloc() is terribly slow and would be a bottleneck
in today's SSD performance.
I love it that the Linux Kernel never uses vmalloc internally, and only ever
chains everything to upto PAGE_SIZE sized objects. Coming from all these
other OSs that don't, believe me, it is great great performance pain.
(TLBs are a bitch)
> Bart.
>
Thanks
Boaz
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
@ 2013-02-07 12:47 ` Boaz Harrosh
0 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2013-02-07 12:47 UTC (permalink / raw)
To: Bart Van Assche
Cc: Hannes Reinecke, Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc,
linux-fsdevel, linux-scsi, martin.petersen, FUJITA Tomonori
On 02/07/2013 02:29 PM, Bart Van Assche wrote:
> On 02/07/13 13:08, Boaz Harrosh wrote:
>> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
>> I know that qemu was wanting this for a while as well as the multitude of
>> user-mode servers)
>
> Do you think it would help / make sense if sg_alloc_table() would be
> modified such that it allocates the entire scatterlist table via one
> vmalloc() call instead of chaining several page-sized scatterlist tables
> ? Note: such a change is not possible without modifying
> scsi_alloc_sgtable().
>
I don't think so, no. sg_alloc_table() is used not only for direct IO
also for buffered, Now vmalloc() is terribly slow and would be a bottleneck
in today's SSD performance.
I love it that the Linux Kernel never uses vmalloc internally, and only ever
chains everything to upto PAGE_SIZE sized objects. Coming from all these
other OSs that don't, believe me, it is great great performance pain.
(TLBs are a bitch)
> Bart.
>
Thanks
Boaz
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 12:08 ` Boaz Harrosh
@ 2013-02-07 16:19 ` Jeff Moyer
-1 siblings, 0 replies; 28+ messages in thread
From: Jeff Moyer @ 2013-02-07 16:19 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Hannes Reinecke, Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc,
linux-fsdevel, linux-scsi, martin.petersen, FUJITA Tomonori,
Zach Brown
Boaz Harrosh <bharrosh@panasas.com> writes:
>>> I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
>>> and all the other plumbing necessary to make that happen...
>>>
>>> void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
>>> int iovcnt, long long offset, const void *pi,
>>> size_t pi_count);
>>>
>> This is also what I've envisioned.
>> Updating io_prep / async I/O is reasonably easy as its been using a
>> separate structure for passing in the I/O details.
>>
>> Normal read/write calls don't really map as you simply don't have
>> enough parameter to feed PI information into the kernel.
>> So for that you'd need to invent a new interface / syscall.
>>
>> For aio we just need to add additional fields to an existing structure.
>>
>> So yeah, I'd be interested in that discussion as well.
Sure, it's easy to start there, but then you eventually end up having to
add a non-aio interface as well. Let's not take the latter off the
table.
> Me too, in multiple fronts. It's part of my general concern about
> "things we would like for user-mode servers"
>
> I think that the current aio and libaio Interface is broken for a long
> time, for multitude of reasons. For instance the nested structure definitions
> are COMPAT broken
News to me. I run the libaio test harness built with -m32 on 64 bit
regularly. What, exactly, is broken?
> , and lots of missing pieces. (For example search in archives
> for why bsg does not support sg-lists.)
> And there are all these additions that everyone wants on top, that call for
> a new interface anyway.
What was proposed above does not require a new interface. It's just an
additional IO_CMD_*. I'm not saying there aren't reasons for a new
interface, it's just I didn't see any in this thread.
> So I would like to see a deep fixup of this interface, with an aio version2
> that can take into considerations, all of future needs including these
> above. Kernel code will be very happy to be implemented with the new, interface
> and a COMPAT layer could be put in place for the old interface.
>
> All interested parties should bring to the table what is the extension/changes
> they need. And we can try and union all of them together.
>
> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
> I know that qemu was wanting this for a while as well as the multitude of
> user-mode servers)
I'm not sure how that's directly related to aio, but ok. If we're going
to rewrite the aio code, I think Zach's acall would be a good start, at
least on the API front:
http://lwn.net/Articles/316806/
Cheers,
Jeff
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
@ 2013-02-07 16:19 ` Jeff Moyer
0 siblings, 0 replies; 28+ messages in thread
From: Jeff Moyer @ 2013-02-07 16:19 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Hannes Reinecke, Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc,
linux-fsdevel, linux-scsi, martin.petersen, FUJITA Tomonori,
Zach Brown
Boaz Harrosh <bharrosh@panasas.com> writes:
>>> I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
>>> and all the other plumbing necessary to make that happen...
>>>
>>> void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
>>> int iovcnt, long long offset, const void *pi,
>>> size_t pi_count);
>>>
>> This is also what I've envisioned.
>> Updating io_prep / async I/O is reasonably easy as its been using a
>> separate structure for passing in the I/O details.
>>
>> Normal read/write calls don't really map as you simply don't have
>> enough parameter to feed PI information into the kernel.
>> So for that you'd need to invent a new interface / syscall.
>>
>> For aio we just need to add additional fields to an existing structure.
>>
>> So yeah, I'd be interested in that discussion as well.
Sure, it's easy to start there, but then you eventually end up having to
add a non-aio interface as well. Let's not take the latter off the
table.
> Me too, in multiple fronts. It's part of my general concern about
> "things we would like for user-mode servers"
>
> I think that the current aio and libaio Interface is broken for a long
> time, for multitude of reasons. For instance the nested structure definitions
> are COMPAT broken
News to me. I run the libaio test harness built with -m32 on 64 bit
regularly. What, exactly, is broken?
> , and lots of missing pieces. (For example search in archives
> for why bsg does not support sg-lists.)
> And there are all these additions that everyone wants on top, that call for
> a new interface anyway.
What was proposed above does not require a new interface. It's just an
additional IO_CMD_*. I'm not saying there aren't reasons for a new
interface, it's just I didn't see any in this thread.
> So I would like to see a deep fixup of this interface, with an aio version2
> that can take into considerations, all of future needs including these
> above. Kernel code will be very happy to be implemented with the new, interface
> and a COMPAT layer could be put in place for the old interface.
>
> All interested parties should bring to the table what is the extension/changes
> they need. And we can try and union all of them together.
>
> (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
> I know that qemu was wanting this for a while as well as the multitude of
> user-mode servers)
I'm not sure how that's directly related to aio, but ok. If we're going
to rewrite the aio code, I think Zach's acall would be a good start, at
least on the API front:
http://lwn.net/Articles/316806/
Cheers,
Jeff
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 16:19 ` Jeff Moyer
(?)
@ 2013-02-07 17:27 ` Zach Brown
2013-02-07 17:36 ` Joel Becker
-1 siblings, 1 reply; 28+ messages in thread
From: Zach Brown @ 2013-02-07 17:27 UTC (permalink / raw)
To: Jeff Moyer
Cc: Boaz Harrosh, Hannes Reinecke, Darrick J. Wong, Chuck Lever,
Ben Myers, lsf-pc, linux-fsdevel, linux-scsi, martin.petersen,
FUJITA Tomonori
On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote:
> Boaz Harrosh <bharrosh@panasas.com> writes:
> >>
> >> For aio we just need to add additional fields to an existing structure.
> >>
> >> So yeah, I'd be interested in that discussion as well.
>
> Sure, it's easy to start there, but then you eventually end up having to
> add a non-aio interface as well. Let's not take the latter off the
> table.
I agree that a sync variant should't be ignored, but needing a sync
interface with PI arguments also shouldn't get in the way of adding
support to the aio+dio path. Simply because it's what people use :/.
> I'm not sure how that's directly related to aio, but ok. If we're going
> to rewrite the aio code, I think Zach's acall would be a good start, at
> least on the API front:
> http://lwn.net/Articles/316806/
Yeah, I'm happy to chat about this stuff if people are interested. I
think I'd do things differently today than what was done in that aged
acall prototype.
- z
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 17:27 ` Zach Brown
@ 2013-02-07 17:36 ` Joel Becker
2013-02-07 21:04 ` J. Bruce Fields
0 siblings, 1 reply; 28+ messages in thread
From: Joel Becker @ 2013-02-07 17:36 UTC (permalink / raw)
To: Zach Brown
Cc: Jeff Moyer, Boaz Harrosh, Hannes Reinecke, Darrick J. Wong,
Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel, linux-scsi,
martin.petersen, FUJITA Tomonori
Dear LSF committee,
I'd like to explicitly request attendance for this discussion
:-)
Joel
On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote:
> On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote:
> > Boaz Harrosh <bharrosh@panasas.com> writes:
> > >>
> > >> For aio we just need to add additional fields to an existing structure.
> > >>
> > >> So yeah, I'd be interested in that discussion as well.
> >
> > Sure, it's easy to start there, but then you eventually end up having to
> > add a non-aio interface as well. Let's not take the latter off the
> > table.
>
> I agree that a sync variant should't be ignored, but needing a sync
> interface with PI arguments also shouldn't get in the way of adding
> support to the aio+dio path. Simply because it's what people use :/.
>
> > I'm not sure how that's directly related to aio, but ok. If we're going
> > to rewrite the aio code, I think Zach's acall would be a good start, at
> > least on the API front:
> > http://lwn.net/Articles/316806/
>
> Yeah, I'm happy to chat about this stuff if people are interested. I
> think I'd do things differently today than what was done in that aged
> acall prototype.
>
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
"You can get more with a kind word and a gun than you can with
a kind word alone."
- Al Capone
http://www.jlbec.org/
jlbec@evilplan.org
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 17:36 ` Joel Becker
@ 2013-02-07 21:04 ` J. Bruce Fields
2013-02-08 9:38 ` Joel Becker
0 siblings, 1 reply; 28+ messages in thread
From: J. Bruce Fields @ 2013-02-07 21:04 UTC (permalink / raw)
To: Zach Brown, Jeff Moyer, Boaz Harrosh, Hannes Reinecke,
Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On Thu, Feb 07, 2013 at 09:36:39AM -0800, Joel Becker wrote:
> Dear LSF committee,
> I'd like to explicitly request attendance for this discussion
> :-)
http://marc.info/?l=linux-fsdevel&m=135894412908342&w=2
"Also, the way I compile the list of requests is from thread
heads ... that means don't send your attendee request as a
reply to something else either otherwise it might get missed."
--b.
>
> Joel
>
> On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote:
> > On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote:
> > > Boaz Harrosh <bharrosh@panasas.com> writes:
> > > >>
> > > >> For aio we just need to add additional fields to an existing structure.
> > > >>
> > > >> So yeah, I'd be interested in that discussion as well.
> > >
> > > Sure, it's easy to start there, but then you eventually end up having to
> > > add a non-aio interface as well. Let's not take the latter off the
> > > table.
> >
> > I agree that a sync variant should't be ignored, but needing a sync
> > interface with PI arguments also shouldn't get in the way of adding
> > support to the aio+dio path. Simply because it's what people use :/.
> >
> > > I'm not sure how that's directly related to aio, but ok. If we're going
> > > to rewrite the aio code, I think Zach's acall would be a good start, at
> > > least on the API front:
> > > http://lwn.net/Articles/316806/
> >
> > Yeah, I'm happy to chat about this stuff if people are interested. I
> > think I'd do things differently today than what was done in that aged
> > acall prototype.
> >
> > - z
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
>
> "You can get more with a kind word and a gun than you can with
> a kind word alone."
> - Al Capone
>
> http://www.jlbec.org/
> jlbec@evilplan.org
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
2013-02-07 21:04 ` J. Bruce Fields
@ 2013-02-08 9:38 ` Joel Becker
0 siblings, 0 replies; 28+ messages in thread
From: Joel Becker @ 2013-02-08 9:38 UTC (permalink / raw)
To: J. Bruce Fields
Cc: Zach Brown, Jeff Moyer, Boaz Harrosh, Hannes Reinecke,
Darrick J. Wong, Chuck Lever, Ben Myers, lsf-pc, linux-fsdevel,
linux-scsi, martin.petersen, FUJITA Tomonori
On Thu, Feb 07, 2013 at 04:04:36PM -0500, J. Bruce Fields wrote:
> On Thu, Feb 07, 2013 at 09:36:39AM -0800, Joel Becker wrote:
> > Dear LSF committee,
> > I'd like to explicitly request attendance for this discussion
> > :-)
>
> http://marc.info/?l=linux-fsdevel&m=135894412908342&w=2
>
> "Also, the way I compile the list of requests is from thread
> heads ... that means don't send your attendee request as a
> reply to something else either otherwise it might get missed."
Ack. Send as such.
Thanks,
Joel
>
> --b.
>
> >
> > Joel
> >
> > On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote:
> > > On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote:
> > > > Boaz Harrosh <bharrosh@panasas.com> writes:
> > > > >>
> > > > >> For aio we just need to add additional fields to an existing structure.
> > > > >>
> > > > >> So yeah, I'd be interested in that discussion as well.
> > > >
> > > > Sure, it's easy to start there, but then you eventually end up having to
> > > > add a non-aio interface as well. Let's not take the latter off the
> > > > table.
> > >
> > > I agree that a sync variant should't be ignored, but needing a sync
> > > interface with PI arguments also shouldn't get in the way of adding
> > > support to the aio+dio path. Simply because it's what people use :/.
> > >
> > > > I'm not sure how that's directly related to aio, but ok. If we're going
> > > > to rewrite the aio code, I think Zach's acall would be a good start, at
> > > > least on the API front:
> > > > http://lwn.net/Articles/316806/
> > >
> > > Yeah, I'm happy to chat about this stuff if people are interested. I
> > > think I'd do things differently today than what was done in that aged
> > > acall prototype.
> > >
> > > - z
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> >
> > "You can get more with a kind word and a gun than you can with
> > a kind word alone."
> > - Al Capone
> >
> > http://www.jlbec.org/
> > jlbec@evilplan.org
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
"You look in her eyes, the music begins to play.
Hopeless romantics, here we go again."
http://www.jlbec.org/
jlbec@evilplan.org
^ permalink raw reply [flat|nested] 28+ messages in thread