[lustre-devel] lustre and loopback device

All of lore.kernel.org
 help / color / mirror / Atom feed

* [lustre-devel] lustre and loopback device
@ 2018-03-23  3:26 James Simmons
  2018-03-26  0:16 ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: James Simmons @ 2018-03-23  3:26 UTC (permalink / raw)
  To: lustre-devel


Hi Neil

      So once long ago lustre had its own loopback device due to the 
upstream loopback device not supporting Direct I/O. Once it did we
dropped support for our custom driver. Recently their has been interest
in using the loopback driver and Jinshan discussed with me about reviving
our custom driver which I'm not thrilled about. He was seeing problems
with Direct I/O above 64K. Do you know the details why that limitation
exist. Perhaps it can be resolved or maybe we are missing something?
Thanks for your help.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-03-23  3:26 [lustre-devel] lustre and loopback device James Simmons
@ 2018-03-26  0:16 ` NeilBrown
  2018-03-30 19:12   ` James Simmons
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2018-03-26  0:16 UTC (permalink / raw)
  To: lustre-devel

On Fri, Mar 23 2018, James Simmons wrote:

> Hi Neil
>
>       So once long ago lustre had its own loopback device due to the 
> upstream loopback device not supporting Direct I/O. Once it did we
> dropped support for our custom driver. Recently their has been interest
> in using the loopback driver and Jinshan discussed with me about reviving
> our custom driver which I'm not thrilled about. He was seeing problems
> with Direct I/O above 64K. Do you know the details why that limitation
> exist. Perhaps it can be resolved or maybe we are missing something?
> Thanks for your help.

Hi James, and Jinshan,
 What sort of problems do you see with 64K DIO requests?
 Is it a throughput problem or are you seeing IO errors?
 Would it be easy to demonstrate the problem in a cluster
 comprising a few VMs, or is real hardware needed?  If VMs are OK,
 can you tell me exactly how to duplicate the problem?

 If loop gets a multi-bio request, it will allocate a bvec array
 to hold all the bio_vecs.  If there are more than 256 pages (1Meg)
 in a request, this could easily fail. 5 consecutive 64K requests on a
 machine without much free memory could hit problems here.
 If that is the problem, it should be easy to fix (request the number
 given to blk_queue_max_hw_sectors).

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180326/63e2b2a2/attachment.sig>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-03-26  0:16 ` NeilBrown
@ 2018-03-30 19:12   ` James Simmons
  2018-03-30 20:16     ` Jinshan Xiong
  0 siblings, 1 reply; 12+ messages in thread
From: James Simmons @ 2018-03-30 19:12 UTC (permalink / raw)
  To: lustre-devel


> On Fri, Mar 23 2018, James Simmons wrote:
> 
> > Hi Neil
> >
> >       So once long ago lustre had its own loopback device due to the 
> > upstream loopback device not supporting Direct I/O. Once it did we
> > dropped support for our custom driver. Recently their has been interest
> > in using the loopback driver and Jinshan discussed with me about reviving
> > our custom driver which I'm not thrilled about. He was seeing problems
> > with Direct I/O above 64K. Do you know the details why that limitation
> > exist. Perhaps it can be resolved or maybe we are missing something?
> > Thanks for your help.
> 
> Hi James, and Jinshan,
>  What sort of problems do you see with 64K DIO requests?
>  Is it a throughput problem or are you seeing IO errors?
>  Would it be easy to demonstrate the problem in a cluster
>  comprising a few VMs, or is real hardware needed?  If VMs are OK,
>  can you tell me exactly how to duplicate the problem?
> 
>  If loop gets a multi-bio request, it will allocate a bvec array
>  to hold all the bio_vecs.  If there are more than 256 pages (1Meg)
>  in a request, this could easily fail. 5 consecutive 64K requests on a
>  machine without much free memory could hit problems here.
>  If that is the problem, it should be easy to fix (request the number
>  given to blk_queue_max_hw_sectors).

Jinshan can you post a reproducer so we can see the problem.  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-03-30 19:12   ` James Simmons
@ 2018-03-30 20:16     ` Jinshan Xiong
  2018-04-02 19:43       ` Dilger, Andreas
  2018-05-22 22:55       ` NeilBrown
  0 siblings, 2 replies; 12+ messages in thread
From: Jinshan Xiong @ 2018-03-30 20:16 UTC (permalink / raw)
  To: lustre-devel

+ Andreas.

A few problems:
1. Linux loop device won't work upon Lustre with direct IO mode because
Lustre direct IO has to be pagesize aligned, and there seems no way of
changing sector size to pagesize for Linux loop device;
2. 64KB is not an optimal RPC size for Lustre, so yes eventually we are
going to see throughput issue if the RPC size is limited to 64KB;
3. It's hard to do I/O optimization more with Linux loop device. With
direct I/O by default, it has to wait for the current I/O to complete
before it can send the next one. This is not good. I have revised
llite_lloop driver so that it can do async direct I/O. The performance
boosts significantly by doing so.

I tried to increase the sector size of Linux loop device and also
max_{hw_}sectors_kb but it didn't work. Please let me know if there exists
ways of doing that.

Thanks,
Jinshan

On Fri, Mar 30, 2018 at 12:12 PM, James Simmons <jsimmons@infradead.org>
wrote:

>
> > On Fri, Mar 23 2018, James Simmons wrote:
> >
> > > Hi Neil
> > >
> > >       So once long ago lustre had its own loopback device due to the
> > > upstream loopback device not supporting Direct I/O. Once it did we
> > > dropped support for our custom driver. Recently their has been interest
> > > in using the loopback driver and Jinshan discussed with me about
> reviving
> > > our custom driver which I'm not thrilled about. He was seeing problems
> > > with Direct I/O above 64K. Do you know the details why that limitation
> > > exist. Perhaps it can be resolved or maybe we are missing something?
> > > Thanks for your help.
> >
> > Hi James, and Jinshan,
> >  What sort of problems do you see with 64K DIO requests?
> >  Is it a throughput problem or are you seeing IO errors?
> >  Would it be easy to demonstrate the problem in a cluster
> >  comprising a few VMs, or is real hardware needed?  If VMs are OK,
> >  can you tell me exactly how to duplicate the problem?
> >
> >  If loop gets a multi-bio request, it will allocate a bvec array
> >  to hold all the bio_vecs.  If there are more than 256 pages (1Meg)
> >  in a request, this could easily fail. 5 consecutive 64K requests on a
> >  machine without much free memory could hit problems here.
> >  If that is the problem, it should be easy to fix (request the number
> >  given to blk_queue_max_hw_sectors).
>
> Jinshan can you post a reproducer so we can see the problem.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180330/77f02d8e/attachment.html>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-03-30 20:16     ` Jinshan Xiong
@ 2018-04-02 19:43       ` Dilger, Andreas
  2018-04-02 20:25         ` Jinshan Xiong
  2018-05-22 22:55       ` NeilBrown
  1 sibling, 1 reply; 12+ messages in thread
From: Dilger, Andreas @ 2018-04-02 19:43 UTC (permalink / raw)
  To: lustre-devel

On Mar 30, 2018, at 14:16, Jinshan Xiong <jinshan.xiong@gmail.com> wrote:
> 
> + Andreas.
> 
> A few problems:
> 1. Linux loop device won't work upon Lustre with direct IO mode because Lustre direct IO has to be pagesize aligned, and there seems no way of changing sector size to pagesize for Linux loop device;
> 2. 64KB is not an optimal RPC size for Lustre, so yes eventually we are going to see throughput issue if the RPC size is limited to 64KB;
> 3. It's hard to do I/O optimization more with Linux loop device. With direct I/O by default, it has to wait for the current I/O to complete before it can send the next one. This is not good. I have revised llite_lloop driver so that it can do async direct I/O. The performance boosts significantly by doing so.

Jinshan,
if you have a patch to implement an improved llite_lloop driver, I think it would be useful to share it.  Originally I'd hoped that the kernel loop driver would allow pluggable backends so that they could be replaced as needed, but that wasn't implemented.  I'd think that this would be an approach that might be more acceptable upstream, rather than copying the loop driver from the kernel and only changing the IO interface.

Cheers, Andreas

> I tried to increase the sector size of Linux loop device and also max_{hw_}sectors_kb but it didn't work. Please let me know if there exists ways of doing that.
> 
> Thanks,
> Jinshan
> 
> On Fri, Mar 30, 2018 at 12:12 PM, James Simmons <jsimmons@infradead.org> wrote:
> 
> > On Fri, Mar 23 2018, James Simmons wrote:
> >
> > > Hi Neil
> > >
> > >       So once long ago lustre had its own loopback device due to the
> > > upstream loopback device not supporting Direct I/O. Once it did we
> > > dropped support for our custom driver. Recently their has been interest
> > > in using the loopback driver and Jinshan discussed with me about reviving
> > > our custom driver which I'm not thrilled about. He was seeing problems
> > > with Direct I/O above 64K. Do you know the details why that limitation
> > > exist. Perhaps it can be resolved or maybe we are missing something?
> > > Thanks for your help.
> >
> > Hi James, and Jinshan,
> >  What sort of problems do you see with 64K DIO requests?
> >  Is it a throughput problem or are you seeing IO errors?
> >  Would it be easy to demonstrate the problem in a cluster
> >  comprising a few VMs, or is real hardware needed?  If VMs are OK,
> >  can you tell me exactly how to duplicate the problem?
> >
> >  If loop gets a multi-bio request, it will allocate a bvec array
> >  to hold all the bio_vecs.  If there are more than 256 pages (1Meg)
> >  in a request, this could easily fail. 5 consecutive 64K requests on a
> >  machine without much free memory could hit problems here.
> >  If that is the problem, it should be easy to fix (request the number
> >  given to blk_queue_max_hw_sectors).
> 
> Jinshan can you post a reproducer so we can see the problem.
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-04-02 19:43       ` Dilger, Andreas
@ 2018-04-02 20:25         ` Jinshan Xiong
  2018-04-02 22:37           ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Jinshan Xiong @ 2018-04-02 20:25 UTC (permalink / raw)
  To: lustre-devel

Hi Andreas,

There are still some more work going on, like to apply the changes in
LU-4198.
The other improvement is to make 'lctl blockdev' attach a loop device to
readonly by default, otherwise the virtual block device would be corrupted
if there exist multiple writers.

After that's done, I will be happy to push a patch for review.

Thanks,
Jinshan

On Mon, Apr 2, 2018 at 12:43 PM, Dilger, Andreas <andreas.dilger@intel.com>
wrote:

> On Mar 30, 2018, at 14:16, Jinshan Xiong <jinshan.xiong@gmail.com> wrote:
> >
> > + Andreas.
> >
> > A few problems:
> > 1. Linux loop device won't work upon Lustre with direct IO mode because
> Lustre direct IO has to be pagesize aligned, and there seems no way of
> changing sector size to pagesize for Linux loop device;
> > 2. 64KB is not an optimal RPC size for Lustre, so yes eventually we are
> going to see throughput issue if the RPC size is limited to 64KB;
> > 3. It's hard to do I/O optimization more with Linux loop device. With
> direct I/O by default, it has to wait for the current I/O to complete
> before it can send the next one. This is not good. I have revised
> llite_lloop driver so that it can do async direct I/O. The performance
> boosts significantly by doing so.
>
> Jinshan,
> if you have a patch to implement an improved llite_lloop driver, I think
> it would be useful to share it.  Originally I'd hoped that the kernel loop
> driver would allow pluggable backends so that they could be replaced as
> needed, but that wasn't implemented.  I'd think that this would be an
> approach that might be more acceptable upstream, rather than copying the
> loop driver from the kernel and only changing the IO interface.
>
> Cheers, Andreas
>
> > I tried to increase the sector size of Linux loop device and also
> max_{hw_}sectors_kb but it didn't work. Please let me know if there exists
> ways of doing that.
> >
> > Thanks,
> > Jinshan
> >
> > On Fri, Mar 30, 2018 at 12:12 PM, James Simmons <jsimmons@infradead.org>
> wrote:
> >
> > > On Fri, Mar 23 2018, James Simmons wrote:
> > >
> > > > Hi Neil
> > > >
> > > >       So once long ago lustre had its own loopback device due to the
> > > > upstream loopback device not supporting Direct I/O. Once it did we
> > > > dropped support for our custom driver. Recently their has been
> interest
> > > > in using the loopback driver and Jinshan discussed with me about
> reviving
> > > > our custom driver which I'm not thrilled about. He was seeing
> problems
> > > > with Direct I/O above 64K. Do you know the details why that
> limitation
> > > > exist. Perhaps it can be resolved or maybe we are missing something?
> > > > Thanks for your help.
> > >
> > > Hi James, and Jinshan,
> > >  What sort of problems do you see with 64K DIO requests?
> > >  Is it a throughput problem or are you seeing IO errors?
> > >  Would it be easy to demonstrate the problem in a cluster
> > >  comprising a few VMs, or is real hardware needed?  If VMs are OK,
> > >  can you tell me exactly how to duplicate the problem?
> > >
> > >  If loop gets a multi-bio request, it will allocate a bvec array
> > >  to hold all the bio_vecs.  If there are more than 256 pages (1Meg)
> > >  in a request, this could easily fail. 5 consecutive 64K requests on a
> > >  machine without much free memory could hit problems here.
> > >  If that is the problem, it should be easy to fix (request the number
> > >  given to blk_queue_max_hw_sectors).
> >
> > Jinshan can you post a reproducer so we can see the problem.
> >
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180402/893a3480/attachment.html>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-04-02 20:25         ` Jinshan Xiong
@ 2018-04-02 22:37           ` NeilBrown
  2018-04-03  0:03             ` Jinshan Xiong
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2018-04-02 22:37 UTC (permalink / raw)
  To: lustre-devel

On Mon, Apr 02 2018, Jinshan Xiong wrote:

> Hi Andreas,
>
> There are still some more work going on, like to apply the changes in
> LU-4198.
> The other improvement is to make 'lctl blockdev' attach a loop device to
> readonly by default, otherwise the virtual block device would be corrupted
> if there exist multiple writers.
>
> After that's done, I will be happy to push a patch for review.

If you just posted it now - even though it isn't perfect yet - I could
read it, understand what the problem is that it is trying to fix, and start
looking at how to improve drivers/block/loop.c so that your patch isn't
necessary.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180403/2cf5cf6e/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-04-02 22:37           ` NeilBrown
@ 2018-04-03  0:03             ` Jinshan Xiong
  2018-05-22 23:31               ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Jinshan Xiong @ 2018-04-03  0:03 UTC (permalink / raw)
  To: lustre-devel

Hi Neil,

Sure. Patches are attached for your reference.

The first patch is to bring llite_lloop driver back; the 2nd fixes some
bugs and the 3rd one adds async I/O. The patches are based on 2.7.21, but I
don't think it would be difficult to port them to master. Anyway, it's just
for your reference.

This is a piece of work in progress, please don't use it for production.

Thanks,
Jinshan





On Mon, Apr 2, 2018 at 3:37 PM, NeilBrown <neilb@suse.com> wrote:

> On Mon, Apr 02 2018, Jinshan Xiong wrote:
>
> > Hi Andreas,
> >
> > There are still some more work going on, like to apply the changes in
> > LU-4198.
> > The other improvement is to make 'lctl blockdev' attach a loop device to
> > readonly by default, otherwise the virtual block device would be
> corrupted
> > if there exist multiple writers.
> >
> > After that's done, I will be happy to push a patch for review.
>
> If you just posted it now - even though it isn't perfect yet - I could
> read it, understand what the problem is that it is trying to fix, and start
> looking at how to improve drivers/block/loop.c so that your patch isn't
> necessary.
>
> NeilBrown
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180402/6f56ab16/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Revert-LU-8844-llite-delete-lloop.patch
Type: application/octet-stream
Size: 43945 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180402/6f56ab16/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-cleanup-and-bugfix.patch
Type: application/octet-stream
Size: 21050 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180402/6f56ab16/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-async-IO.patch
Type: application/octet-stream
Size: 14128 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180402/6f56ab16/attachment-0005.obj>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-03-30 20:16     ` Jinshan Xiong
  2018-04-02 19:43       ` Dilger, Andreas
@ 2018-05-22 22:55       ` NeilBrown
  2018-05-23 21:03         ` Jinshan Xiong
  1 sibling, 1 reply; 12+ messages in thread
From: NeilBrown @ 2018-05-22 22:55 UTC (permalink / raw)
  To: lustre-devel

On Fri, Mar 30 2018, Jinshan Xiong wrote:

> + Andreas.
>
> A few problems:

Sorry that it has been 7 weeks, but I've finally scheduled time to
have a proper look at this.

> 1. Linux loop device won't work upon Lustre with direct IO mode because
> Lustre direct IO has to be pagesize aligned, and there seems no way of
> changing sector size to pagesize for Linux loop device;

The sector size for a loop device can be set with the --sector-size
argument to losetup (or the LOOP_SET_BLOCK_SIZE ioctl).  This is done
from user-space, not from in the  lustre module of course.
open(O_DIRECT) is documented as having size/alignment restrictions,
so I think a good case could be made to change the handling of
"losetup --raw" to imply a "--sector-size" setting, if we could
determine an appropriate size automatically.
The XFS_IOC_DIOINFO ioctl (see man xfsctl) can be used to ask a
filesystem about alignment requirements, but is currently only
supported for XFS.  If we added support to lustre, and asked util-linux
to use it to help configure a loop device, I suspect we could get
success.

There would probably be a request to hoist the ioctl out of xfs and add
it to VFS. I cannot predict how that would go, but I think it would be
good to pursue this approach.

You could try it in you own testing by using
  losetup -r --sector=size=4096 /dev/loopX  filename
to create a loop device.

> 2. 64KB is not an optimal RPC size for Lustre, so yes eventually we are
> going to see throughput issue if the RPC size is limited to 64KB;

So let's find out where the 64KB limit is imposed, and raise it.
Maybe it comes from
	lo->tag_set.queue_depth = 128;
combined with the default sector size of 512.
If so, then increasing the sector size to 4K should raise the RPC
size 512K.

> 3. It's hard to do I/O optimization more with Linux loop device. With
> direct I/O by default, it has to wait for the current I/O to complete
> before it can send the next one. This is not good. I have revised
> llite_lloop driver so that it can do async direct I/O. The performance
> boosts significantly by doing so.

This surprises me.  Looking at the code in loop.c, I see a field
->use_aio which is set when direct_io is used (->use_dio), except
for FLUSH DISCARD and WRITE_ZEROES.
->use_dio is disabled if the filesystem has a block device
(->i_sb->s_bdev != NULL) and alignment doesn't match, but that
wouldn't apply to lustre.
Linux gained aio in loop in Linux 4.4.  What kernel version were you
looking at?

>
> I tried to increase the sector size of Linux loop device and also
> max_{hw_}sectors_kb but it didn't work. Please let me know if there exists
> ways of doing that.

If --sector-size option to losetup doesn't work, we will have to make it
work.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180523/e13c4fbe/attachment.sig>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-04-03  0:03             ` Jinshan Xiong
@ 2018-05-22 23:31               ` NeilBrown
  2018-05-23 21:08                 ` Jinshan Xiong
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2018-05-22 23:31 UTC (permalink / raw)
  To: lustre-devel

On Mon, Apr 02 2018, Jinshan Xiong wrote:

> Hi Neil,
>
> Sure. Patches are attached for your reference.
>
> The first patch is to bring llite_lloop driver back; the 2nd fixes some
> bugs and the 3rd one adds async I/O. The patches are based on 2.7.21, but I
> don't think it would be difficult to port them to master. Anyway, it's just
> for your reference.
>
> This is a piece of work in progress, please don't use it for production.

Thanks,
just one quick comment at this stage:


>  .PP
> +.SS Virtual Block Device Operation
> +Lustre is able to emulate a virtual block device upon regular file. It is necessary to be used when you are trying to setup a swap space via file.

We should fix this properly.  Creating a loop device just to provide
swap is not the best approach.
The preferred approach for swapping to a networked filesystem can be
seen by examining the swap_activate address_space_operation in nfs.
If a file passed to swap_on has a swap_activate operation, it will be
called and then ->readpage will be used to read from swap, and
->direct_IO will be used to write.

swap_activate needs to ensure that the direct_IO calls will never block
waiting for memory allocation.
For NFS, all that it does is calls sk_set_memalloc() on all network
sockets that might be used.  This allows TCP etc to use the reserve
memory pools.
Lustre might need to pre-allocate other things, or make use PF_MEMALLOC
in other contexts, I don't know.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180523/95d7e381/attachment.sig>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-05-22 22:55       ` NeilBrown
@ 2018-05-23 21:03         ` Jinshan Xiong
  0 siblings, 0 replies; 12+ messages in thread
From: Jinshan Xiong @ 2018-05-23 21:03 UTC (permalink / raw)
  To: lustre-devel

It turned out that I was looking at a 4.11.x kernel where it only has
'--direct-io' but there was no '--sector-size' supported.

With latest kernel, we can do AIO+DIO so there is no changes necessary from
kernel. Patch https://review.whamcloud.com/32416 attempts to accomplish
that. Since AIO is used, the I/O size from loop device is no longer
important because the I/Os will be merged at OSC layer.

If everything works eventually, we can set up loop device over lustre like:
losetup --direct-io -f --sector-size=4096 <lustre_reg_file>

Thanks,
Jinshan

On Tue, May 22, 2018 at 3:55 PM, NeilBrown <neilb@suse.com> wrote:

> On Fri, Mar 30 2018, Jinshan Xiong wrote:
>
> > + Andreas.
> >
> > A few problems:
>
> Sorry that it has been 7 weeks, but I've finally scheduled time to
> have a proper look at this.
>
> > 1. Linux loop device won't work upon Lustre with direct IO mode because
> > Lustre direct IO has to be pagesize aligned, and there seems no way of
> > changing sector size to pagesize for Linux loop device;
>
> The sector size for a loop device can be set with the --sector-size
> argument to losetup (or the LOOP_SET_BLOCK_SIZE ioctl).  This is done
> from user-space, not from in the  lustre module of course.
> open(O_DIRECT) is documented as having size/alignment restrictions,
> so I think a good case could be made to change the handling of
> "losetup --raw" to imply a "--sector-size" setting, if we could
> determine an appropriate size automatically.
> The XFS_IOC_DIOINFO ioctl (see man xfsctl) can be used to ask a
> filesystem about alignment requirements, but is currently only
> supported for XFS.  If we added support to lustre, and asked util-linux
> to use it to help configure a loop device, I suspect we could get
> success.
>
> There would probably be a request to hoist the ioctl out of xfs and add
> it to VFS. I cannot predict how that would go, but I think it would be
> good to pursue this approach.
>
> You could try it in you own testing by using
>   losetup -r --sector=size=4096 /dev/loopX  filename
> to create a loop device.
>
> > 2. 64KB is not an optimal RPC size for Lustre, so yes eventually we are
> > going to see throughput issue if the RPC size is limited to 64KB;
>
> So let's find out where the 64KB limit is imposed, and raise it.
> Maybe it comes from
>         lo->tag_set.queue_depth = 128;
> combined with the default sector size of 512.
> If so, then increasing the sector size to 4K should raise the RPC
> size 512K.
>
> > 3. It's hard to do I/O optimization more with Linux loop device. With
> > direct I/O by default, it has to wait for the current I/O to complete
> > before it can send the next one. This is not good. I have revised
> > llite_lloop driver so that it can do async direct I/O. The performance
> > boosts significantly by doing so.
>
> This surprises me.  Looking at the code in loop.c, I see a field
> ->use_aio which is set when direct_io is used (->use_dio), except
> for FLUSH DISCARD and WRITE_ZEROES.
> ->use_dio is disabled if the filesystem has a block device
> (->i_sb->s_bdev != NULL) and alignment doesn't match, but that
> wouldn't apply to lustre.
> Linux gained aio in loop in Linux 4.4.  What kernel version were you
> looking at?
>
> >
> > I tried to increase the sector size of Linux loop device and also
> > max_{hw_}sectors_kb but it didn't work. Please let me know if there
> exists
> > ways of doing that.
>
> If --sector-size option to losetup doesn't work, we will have to make it
> work.
>
> Thanks,
> NeilBrown
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180523/44e7017c/attachment.html>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [lustre-devel] lustre and loopback device
  2018-05-22 23:31               ` NeilBrown
@ 2018-05-23 21:08                 ` Jinshan Xiong
  0 siblings, 0 replies; 12+ messages in thread
From: Jinshan Xiong @ 2018-05-23 21:08 UTC (permalink / raw)
  To: lustre-devel

See inserted.

On Tue, May 22, 2018 at 4:31 PM, NeilBrown <neilb@suse.com> wrote:

> On Mon, Apr 02 2018, Jinshan Xiong wrote:
>
> > Hi Neil,
> >
> > Sure. Patches are attached for your reference.
> >
> > The first patch is to bring llite_lloop driver back; the 2nd fixes some
> > bugs and the 3rd one adds async I/O. The patches are based on 2.7.21,
> but I
> > don't think it would be difficult to port them to master. Anyway, it's
> just
> > for your reference.
> >
> > This is a piece of work in progress, please don't use it for production.
>
> Thanks,
> just one quick comment at this stage:
>
>
> >  .PP
> > +.SS Virtual Block Device Operation
> > +Lustre is able to emulate a virtual block device upon regular file. It
> is necessary to be used when you are trying to setup a swap space via file.
>
> We should fix this properly.  Creating a loop device just to provide
> swap is not the best approach.
> The preferred approach for swapping to a networked filesystem can be
> seen by examining the swap_activate address_space_operation in nfs.
> If a file passed to swap_on has a swap_activate operation, it will be
> called and then ->readpage will be used to read from swap, and
> ->direct_IO will be used to write.
>
> swap_activate needs to ensure that the direct_IO calls will never block
> waiting for memory allocation.
> For NFS, all that it does is calls sk_set_memalloc() on all network
> sockets that might be used.  This allows TCP etc to use the reserve
> memory pools.
> Lustre might need to pre-allocate other things, or make use PF_MEMALLOC
> in other contexts, I don't know.
>

That was a major problem when I worked on loop back device initially.
Lustre allocates memory from too many places to write something to OSTs, so
it would take huge effort to reserve memory on the writeback path.

Jinshan


>
> Thanks,
> NeilBrown
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180523/83458448/attachment.html>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-05-23 21:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-23  3:26 [lustre-devel] lustre and loopback device James Simmons
2018-03-26  0:16 ` NeilBrown
2018-03-30 19:12   ` James Simmons
2018-03-30 20:16     ` Jinshan Xiong
2018-04-02 19:43       ` Dilger, Andreas
2018-04-02 20:25         ` Jinshan Xiong
2018-04-02 22:37           ` NeilBrown
2018-04-03  0:03             ` Jinshan Xiong
2018-05-22 23:31               ` NeilBrown
2018-05-23 21:08                 ` Jinshan Xiong
2018-05-22 22:55       ` NeilBrown
2018-05-23 21:03         ` Jinshan Xiong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.