All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] Design proposal for client-side compression
@ 2017-01-09 13:07 Anna Fuchs
  2017-01-09 18:05 ` Xiong, Jinshan
  0 siblings, 1 reply; 27+ messages in thread
From: Anna Fuchs @ 2017-01-09 13:07 UTC (permalink / raw)
  To: lustre-devel

Dear all,?

a couple of months ago we started the IPCC-L project about compression
within Lustre [0]. Currently we are focusing on client-side compression
and I would like to present you our plans and discuss them. Any
comments are very welcome.

General design:?

The feature will introduce transparent compression within the Lustre
filesystem on the client and server side (in the future).
Due to existing infrastructure for compressed blocks within the ZFS
backend filesystem, at first, only ZFS will be supported. ldiskfs as
backend is not principally discarded, though it requires wide-ranging
changes of its infrastructure which might follow once the workflow is
proven. All communication between the MDS and any other components
remains uncompressed. That means, metadata will not be compressed at
any time.

The client will compress the data per stripe, while every stripe is
divided into chunks based on the ZFS record size. Those chunks can be
compressed independently and in parallel.
To be able to decompress the data later we need to store the algorithm
type and the original chunk size. We want to store it per chunk/record
for several reasons. When decompressing, we need to know the required
buffer size for the uncompressed piece of data. Moreover, later it will
be possible to have different chunk sizes within one stripe and use
different algorithms for every chunk. The storing of that additional
metadata is up to ZFS and will not affect the MDS.?

The compressed stripes will go to the Lustre server including the
metadata within the RPC. The server, at first, will just pass the data
to ZFS. ZFS for its part will make use of its internal data structure,
which already handles compressed blocks. The pre-compressed blocks will
be stored within the trees like they have been compressed by ZFS
itself. We expect ZFS then to achieve better read-ahead performance as
if storing the data like common data blocks (which would produce
"holes" within the file). Since ZFS has knowledge about the original
data bounds, it can iterate over the records like they were logically
contiguous.

Implementation details:?

We need to make changes on the Lustre client, server, RPC protocol and
ZFS backend.

Client:?

We thought to introduce the changes close to GSS within the PtlRPC
layer.
In the future, the infrastructure might be reused for client-side
encryption also. All the infrastructure changes are independent from
specific compression algorithms, except for the requirement that the
data size does not grow. It will be possible to change the algorithms;
the missing libraries would be deployed and built together with Lustre
code if the specific kernel does not support them.?
There is a wrapping layer sptlrpc (standing for security ptlrpc?).
Analogously, we could introduce a cptlrpc (for compression) or just put
both together in a tptlrpc (transform) layer.

We would also like to reuse the bd_enc_vec structures for compressed
bulk data. What do you think about that?

RPC:?

We would extend the niobuf_remote structure with the logical size and
the algorithm type. Each compressed record would be a separate niobuf.
We first had the idea to reuse the free bits of the flavor flag to mask
the algorithm type, though we need it per niobuf, but the flavor is per
RPC. The read/write RPC with patched niobufs arrives at the server and
it can then get the data in the correct amount to pass it through to
ZFS.?


ZFS:

Newest ZFS features include compressed send and receive operations to
stream data in compressed form and save CPU efforts. During the course
of these changes, the ZFS-to-Lustre interface will be extended by
additional parameters lsize and the algorithm type needed for
decompression. lsize is the logical size which is the original,
uncompressed, user written data size. In contrast the physical size is
the actual compressed data size.
The current interface will coexist and is not going to be fully
exchanged by the extended one to save the ability to write/read data
unaffected by compression. Those changes affect at least the two I/O
paths "dmu_write" and "dmu_assign_arcbuf".?

At first, the use of ZFS?s internal compression will be skipped and
possible only with disabled Lustre compression. Though a possible
feature is to enable ZFS to decompress the data so that one could
access data without Lustre.


[0]?https://software.intel.com/articles/intel-parallel-computing-center
-at-university-of-hamburg-scientific-computing


Best regards,
Anna


--?
Anna Fuchs
https://wr.informatik.uni-hamburg.de/people/anna_fuchs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-01-09 13:07 [lustre-devel] Design proposal for client-side compression Anna Fuchs
@ 2017-01-09 18:05 ` Xiong, Jinshan
  2017-01-12 12:15   ` Anna Fuchs
  0 siblings, 1 reply; 27+ messages in thread
From: Xiong, Jinshan @ 2017-01-09 18:05 UTC (permalink / raw)
  To: lustre-devel

Hi Anna,

I assume the purpose of this proposal is to fully utilize the CPU cycles on the client nodes to compress and decompress data, because there are much more client nodes than server nodes. After data is compressed, it will need less network bandwidth to transfer it to server and write them back to storage.

There would be more changes to implement this feature:
1. I guess dmu_read() needs change as well to transfer compressed data back to client, otherwise how it would improve readahead performance. Please let me know if I overlooked something;
2. read-modify-write on client chunks - if only partial chunk is modified on the client side, the OSC will have to read the chunk back, uncompress it, and modify the data in chunk, and compress it again to get ready for write back. We may have to maintain a separate chunk cache on the OSC layer;
3. the OST should grant LDLM lock to align with ZFS block size otherwise it will be very complex if the OSC has to request locks to do RMW;
4. OSD-ZFS can dynamically extend the block size by the write pattern, so we need to disable it to accommodate this feature;
5. ZFS has supported a new feature called compressed ARC. If clients are already provided compressed data, probably we can get rid of dmu buffer and fulfill the ARC buffer with compressed data directly, but I don?t know much work it would need on ZFS side.

Thanks,
Jinshan

> On Jan 9, 2017, at 5:07 AM, Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de> wrote:
> 
> Dear all, 
> 
> a couple of months ago we started the IPCC-L project about compression
> within Lustre [0]. Currently we are focusing on client-side compression
> and I would like to present you our plans and discuss them. Any
> comments are very welcome.
> 
> General design: 
> 
> The feature will introduce transparent compression within the Lustre
> filesystem on the client and server side (in the future).
> Due to existing infrastructure for compressed blocks within the ZFS
> backend filesystem, at first, only ZFS will be supported. ldiskfs as
> backend is not principally discarded, though it requires wide-ranging
> changes of its infrastructure which might follow once the workflow is
> proven. All communication between the MDS and any other components
> remains uncompressed. That means, metadata will not be compressed at
> any time.
> 
> The client will compress the data per stripe, while every stripe is
> divided into chunks based on the ZFS record size. Those chunks can be
> compressed independently and in parallel.
> To be able to decompress the data later we need to store the algorithm
> type and the original chunk size. We want to store it per chunk/record
> for several reasons. When decompressing, we need to know the required
> buffer size for the uncompressed piece of data. Moreover, later it will
> be possible to have different chunk sizes within one stripe and use
> different algorithms for every chunk. The storing of that additional
> metadata is up to ZFS and will not affect the MDS. 
> 
> The compressed stripes will go to the Lustre server including the
> metadata within the RPC. The server, at first, will just pass the data
> to ZFS. ZFS for its part will make use of its internal data structure,
> which already handles compressed blocks. The pre-compressed blocks will
> be stored within the trees like they have been compressed by ZFS
> itself. We expect ZFS then to achieve better read-ahead performance as
> if storing the data like common data blocks (which would produce
> "holes" within the file). Since ZFS has knowledge about the original
> data bounds, it can iterate over the records like they were logically
> contiguous.
> 
> Implementation details: 
> 
> We need to make changes on the Lustre client, server, RPC protocol and
> ZFS backend.
> 
> Client: 
> 
> We thought to introduce the changes close to GSS within the PtlRPC
> layer.
> In the future, the infrastructure might be reused for client-side
> encryption also. All the infrastructure changes are independent from
> specific compression algorithms, except for the requirement that the
> data size does not grow. It will be possible to change the algorithms;
> the missing libraries would be deployed and built together with Lustre
> code if the specific kernel does not support them. 
> There is a wrapping layer sptlrpc (standing for security ptlrpc?).
> Analogously, we could introduce a cptlrpc (for compression) or just put
> both together in a tptlrpc (transform) layer.
> 
> We would also like to reuse the bd_enc_vec structures for compressed
> bulk data. What do you think about that?
> 
> RPC: 
> 
> We would extend the niobuf_remote structure with the logical size and
> the algorithm type. Each compressed record would be a separate niobuf.
> We first had the idea to reuse the free bits of the flavor flag to mask
> the algorithm type, though we need it per niobuf, but the flavor is per
> RPC. The read/write RPC with patched niobufs arrives at the server and
> it can then get the data in the correct amount to pass it through to
> ZFS. 
> 
> 
> ZFS:
> 
> Newest ZFS features include compressed send and receive operations to
> stream data in compressed form and save CPU efforts. During the course
> of these changes, the ZFS-to-Lustre interface will be extended by
> additional parameters lsize and the algorithm type needed for
> decompression. lsize is the logical size which is the original,
> uncompressed, user written data size. In contrast the physical size is
> the actual compressed data size.
> The current interface will coexist and is not going to be fully
> exchanged by the extended one to save the ability to write/read data
> unaffected by compression. Those changes affect at least the two I/O
> paths "dmu_write" and "dmu_assign_arcbuf". 
> 
> At first, the use of ZFS?s internal compression will be skipped and
> possible only with disabled Lustre compression. Though a possible
> feature is to enable ZFS to decompress the data so that one could
> access data without Lustre.
> 
> 
> [0] https://software.intel.com/articles/intel-parallel-computing-center
> -at-university-of-hamburg-scientific-computing
> 
> 
> Best regards,
> Anna
> 
> 
> -- 
> Anna Fuchs
> https://wr.informatik.uni-hamburg.de/people/anna_fuchs
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-01-09 18:05 ` Xiong, Jinshan
@ 2017-01-12 12:15   ` Anna Fuchs
  2017-01-17 19:51     ` Xiong, Jinshan
  0 siblings, 1 reply; 27+ messages in thread
From: Anna Fuchs @ 2017-01-12 12:15 UTC (permalink / raw)
  To: lustre-devel

Hello all,

thank you for the responses.?


Jinshan,

> 
> I assume the purpose of this proposal is to fully utilize the CPU
> cycles on the client nodes to compress and decompress data, because
> there are much more client nodes than server nodes. After data is
> compressed, it will need less network bandwidth to transfer it to
> server and write them back to storage.

Yes, that is our goal for the moment.

> 
> There would be more changes to implement this feature:
> 1. I guess dmu_read() needs change as well to transfer compressed
> data back to client, otherwise how it would improve readahead
> performance. Please let me know if I overlooked something;

Sure, I might have shortened my explanation too much. The read path
will be affected for providing compressed data and record-wise
"metadata" back to the client. The client will then decompress it.

> 2. read-modify-write on client chunks - if only partial chunk is
> modified on the client side, the OSC will have to read the chunk
> back, uncompress it, and modify the data in chunk, and compress it
> again to get ready for write back. We may have to maintain a separate
> chunk cache on the OSC layer;

We keep the rmw problem in mind and will definitely need to work on
optimization once the basic functionality is done. When compressing
only sub-stripes (record size), we already hope to reduce the
performance loss since we do not need to transfer and decompress the
whole stripe anymore.?
We would want to keep the compressed data within bd_enc_vec and
uncompressed in the normal vector. The space for that vector is
allocated in sptlrpc_enc_pool_get_pages. Are those not cached? Could
you give me some hints for the approach and what to look at? Is it a
right place at all?
Though, the naive prototype I am currently working on is very memory
intensive anyway (additional buffers, many copies). There is much work
for me until I can dive into optimizations...


> 3. the OST should grant LDLM lock to align with ZFS block size
> otherwise it will be very complex if the OSC has to request locks to
> do RMW;

I am not very familiar with the locking in Lustre yet.?
You mean, once we want to modify part of the data on OST, we want to
have a lock for the complete chunk (record), right? Currently, Lustre
can do byte-range locks, instead we wanted record-ranged in this case?


> 4. OSD-ZFS can dynamically extend the block size by the write
> pattern, so we need to disable it to accommodate this feature;

We thought to set the sizes from Lustre (client or later server) and
force ZFS to use them. ZFS itself will not be able to change any
layouts.

Matt,?

>?
> > a possible feature is to enable ZFS to decompress the data
>?
> I would recommend that you plan to integrate this compression with
> ZFS from the beginning, by using compression formats that ZFS already
> supports (e.g. lz4), or by adding support in ZFS for the algorithm
> you will use for Lustre.??This will provide better flexibility and
> compatibility.

We currently experiment with lz4 fast, which our students try to submit
to the linux kernel. The ZFS patch for that will hopefully follow soon.
We thought it would be nice to have the opportunity to use some brand
new algorithms on the client within Lustre even if they are not yet
supported by ZFS. Though it is great that the ZFS community is open to
integrate new features so we probably could completely match our needs.

>?
> Also, I agree with what Jinshan said below.??Assuming that you want
> to do compressed read as well, you will need to add a compressed read
> function to the DMU.??For compressed send/receive we only added
> compressed write to the DMU, because zfs send reads directly from the
> ARC (which can do compressed read).

We are working on it right now, the functionality should be similar to
the write case or do I miss some fundamental issues??


Best regards,
Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-01-12 12:15   ` Anna Fuchs
@ 2017-01-17 19:51     ` Xiong, Jinshan
  2017-01-18 14:19       ` Anna Fuchs
  0 siblings, 1 reply; 27+ messages in thread
From: Xiong, Jinshan @ 2017-01-17 19:51 UTC (permalink / raw)
  To: lustre-devel

Hi Anna,

Please see inserted lines.

On Jan 12, 2017, at 4:15 AM, Anna Fuchs <anna.fuchs at informatik.uni-hamburg.de<mailto:anna.fuchs@informatik.uni-hamburg.de>> wrote:

Hello all,

thank you for the responses.


Jinshan,


I assume the purpose of this proposal is to fully utilize the CPU
cycles on the client nodes to compress and decompress data, because
there are much more client nodes than server nodes. After data is
compressed, it will need less network bandwidth to transfer it to
server and write them back to storage.

Yes, that is our goal for the moment.

cool. This should be a good approach.



There would be more changes to implement this feature:
1. I guess dmu_read() needs change as well to transfer compressed
data back to client, otherwise how it would improve readahead
performance. Please let me know if I overlooked something;

Sure, I might have shortened my explanation too much. The read path
will be affected for providing compressed data and record-wise
"metadata" back to the client. The client will then decompress it.

2. read-modify-write on client chunks - if only partial chunk is
modified on the client side, the OSC will have to read the chunk
back, uncompress it, and modify the data in chunk, and compress it
again to get ready for write back. We may have to maintain a separate
chunk cache on the OSC layer;

We keep the rmw problem in mind and will definitely need to work on
optimization once the basic functionality is done. When compressing
only sub-stripes (record size), we already hope to reduce the
performance loss since we do not need to transfer and decompress the
whole stripe anymore.
We would want to keep the compressed data within bd_enc_vec and
uncompressed in the normal vector. The space for that vector is
allocated in sptlrpc_enc_pool_get_pages. Are those not cached? Could
you give me some hints for the approach and what to look at? Is it a
right place at all?

I don?t think sptlrpc page pool will cache any data. However, it?s the right place to do compress in the sptlrpc layer, you just extend the sptlrpc for a new flavor.

With that being said, we?re going to have two options to support partial block write:

1. In the OSC I/O engine, it only submits ZFS block size aligned plain data to ptlrpc layer and it does compress in the new flavor of sptlrpc. When partial blocks are written, the OSC will have to issue read RPC if the corresponding data belonging to the same block are not cached;

2. Or we can just disable this optimization that means plain data will be issued to the sever for partial block written. It only do compress for full blocks.

I feel the option 2 would be much simpler but needs some requirements to the workload to take full advantage, e.g. if applications are writing bulk and sequential data.

Though, the naive prototype I am currently working on is very memory
intensive anyway (additional buffers, many copies). There is much work
for me until I can dive into optimizations...


3. the OST should grant LDLM lock to align with ZFS block size
otherwise it will be very complex if the OSC has to request locks to
do RMW;

I am not very familiar with the locking in Lustre yet.
You mean, once we want to modify part of the data on OST, we want to
have a lock for the complete chunk (record), right? Currently, Lustre
can do byte-range locks, instead we wanted record-ranged in this case?

Right now Lustre aligns BRW lock by the page size on the client side. Please check the code and comments in function ldlm_extent_internal_policy_fixup(). Since client doesn?t provide the page size to server explicitly, the code just guess it by the req_end.

In the new code with this feature supported, the LDLM lock should be aligned to MAX(zfs_block_size, req_align).



4. OSD-ZFS can dynamically extend the block size by the write
pattern, so we need to disable it to accommodate this feature;

We thought to set the sizes from Lustre (client or later server) and
force ZFS to use them. ZFS itself will not be able to change any
layouts.

Sounds good to me. There is a work in progress to support setting block size from client side in LU-8591.


Matt,


a possible feature is to enable ZFS to decompress the data

I would recommend that you plan to integrate this compression with
ZFS from the beginning, by using compression formats that ZFS already
supports (e.g. lz4), or by adding support in ZFS for the algorithm
you will use for Lustre.  This will provide better flexibility and
compatibility.

We currently experiment with lz4 fast, which our students try to submit
to the linux kernel. The ZFS patch for that will hopefully follow soon.
We thought it would be nice to have the opportunity to use some brand
new algorithms on the client within Lustre even if they are not yet
supported by ZFS. Though it is great that the ZFS community is open to
integrate new features so we probably could completely match our needs.

That?ll be cool. I think ZFS community should be open to accept new algorithm. Please make a patch and submit it to https://github.com/zfsonlinux/zfs/issues



Also, I agree with what Jinshan said below.  Assuming that you want
to do compressed read as well, you will need to add a compressed read
function to the DMU.  For compressed send/receive we only added
compressed write to the DMU, because zfs send reads directly from the
ARC (which can do compressed read).

We are working on it right now, the functionality should be similar to
the write case or do I miss some fundamental issues?

It should be similar to write case, i.e., to bypass the layer of dmu buffer.

Jinshan



Best regards,
Anna

_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170117/ba32431b/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-01-17 19:51     ` Xiong, Jinshan
@ 2017-01-18 14:19       ` Anna Fuchs
  2017-02-16 14:15         ` Anna Fuchs
  0 siblings, 1 reply; 27+ messages in thread
From: Anna Fuchs @ 2017-01-18 14:19 UTC (permalink / raw)
  To: lustre-devel

Hello,?

thanks again.?

> I don?t think sptlrpc page pool will cache any data. 

Ok, I will look at the caching within OSC in more detail.

> However, it?s the right place to do compress in the sptlrpc layer,
> you just extend the sptlrpc for a new flavor.

Well, during my master's thesis, I have already tried to introduce
compression in form of a new flavor for GSS. Unfortunately, I had huge
problems to get GSS to work at all (later 2.7 or first of 2.8
versions). The plain or null flavors for testing didn't work and any
other approach required Kerberos (which I also couldn't make work :( ).
Probably I just missed something, but it led me to the idea to make it
close to, but yet independent from GSS.

As far I understood, currently only the RPC but not the data is handled
by GSS on the client side??
And from what I have seen, when using GSS, the number of niobufs within
the RPC is restricted to 1; for compression we would need more. Also,
the flavor is set per RPC, we need it per niobuf. Though it might be
just implementation details, wouldn't you currently prefer to keep it
separate to avoid mixing up bugs from our new feature and changes to
GSS? Currently I try my change within?sptlrpc_cli_wrap_bulk in sec.c
just before wrapping the bulks with GSS mechanisms.?

> 
> With that being said, we?re going to have two options to support
> partial block write:
> 
> 1. In the OSC I/O engine, it only submits ZFS block size aligned
> plain data to ptlrpc layer and it does compress in the new flavor of
> sptlrpc. When partial blocks are written, the OSC will have to issue
> read RPC if the corresponding data belonging to the same block are
> not cached;
> 
> 2. Or we can just disable this optimization that means plain data
> will be issued to the sever for partial block written. It only do
> compress for full blocks.
> 
> I feel the option 2 would be much simpler but needs some requirements
> to the workload to take full advantage, e.g. if applications are
> writing bulk and sequential data.

Sounds good. In my thesis I have already spent some thoughts to the RMW
issues. In some cases it might be the best to let the server decompress
the specific data chunks (which is planned in the future anyway) and
skip compression for partial writes.?

> 
> Right now Lustre aligns BRW lock by the page size on the client side.
> Please check the code and comments in function
> ldlm_extent_internal_policy_fixup(). Since client doesn?t provide the
> page size to server explicitly, the code just guess it by the
> req_end.
> 
> In the new code with this feature supported, the LDLM lock should be
> aligned to MAX(zfs_block_size, req_align).

> 
> Sounds good to me. There is a work in progress to support setting
> block size from client side in LU-8591.

Thanks for the hints!

> > > ?
> > > Also, I agree with what Jinshan said below.??Assuming that you
> > > want
> > > to do compressed read as well, you will need to add a compressed
> > > read
> > > function to the DMU.??For compressed send/receive we only added
> > > compressed write to the DMU, because zfs send reads directly from
> > > the
> > > ARC (which can do compressed read).
> > ?
> > We are working on it right now, the functionality should be similar
> > to
> > the write case or do I miss some fundamental issues??
> 
> It should be similar to write case, i.e., to bypass the layer of dmu
> buffer.

Our first approach for read is currently the following:?
Once compression is enabled, for OSTs we call the modified dmu_read
with the logical (uncompressed) data size. ZFS notices, the requested
data is compressed and delivers the physical (compressed) data size,
the used algorithm and the actual data of size psize. Lustre gets the
psize somehow together with the data. Psize would be used to ensure
that the "received" amount of data, which differs from requested
logical size, is correct.?

Bypass the dbufs - you mean to transfer the data directly from ARC to
the client?


> Jinshan
Be
> > 
> st regards,
Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-01-18 14:19       ` Anna Fuchs
@ 2017-02-16 14:15         ` Anna Fuchs
  2017-02-17 19:15           ` Xiong, Jinshan
  2017-07-21 15:15           ` Anna Fuchs
  0 siblings, 2 replies; 27+ messages in thread
From: Anna Fuchs @ 2017-02-16 14:15 UTC (permalink / raw)
  To: lustre-devel

Dear all,?

I would like to update you about my progress on the project.?
Unfortunately, I can not publish a complete design of the feature,
since it changes very much during the development.?

First the work related to the client changes:?

I had to discard my approach to introduce the changes within the
sptlrpc layer for the moment. Compression of the data affects
especially the resulting number of pages and therefore number and size
of niobufs, size and structure of the descriptor and request, size of
the bulk kiov, checksums and in the end the async arguments. Actually
it affects everything, that is set within the?osc_brw_prep_request
function in osc_request.c. When entering the sptlrpc layer, most of
that parameters are already set and I would need to update everything.
That causes double work and requires a lot of code duplication from the
osc module.?

My current dirty prototype invokes compression just at the beginning of
that function, before niocount is calculated. I need to have a separate
bunch of pages to store compressed data so that I would not overwrite
the content of the original pages, which may be exposed to the
userspace process.?
The original pages would be freed and the compressed pages processed
for the request and finally also freed.?

I also reconsidered the idea to do compression niobuf-wise. Due to the
file layout, compression should be done record-wise. Since a niobuf is
a technical requirement for the pages to be contiguous, a record (e.g.
128KB) is a logical unit. In my understanding, it can happen, that one
record contains of several niobufs whenever we do not have enough
contiguous pages for a complete record. For that reason, I would like
to leave the niobuf structure as is it and introduce a record structure
on top of it. That record structure will hold the logical(uncompressed)
and physical(compressed) data sizes and the algorithm used for
compression. Initially we wanted to extend the niobuf struct by those
fields. I think that change would affect the RPC request structure very
much since the first Lustre message fields will not be followed by an
array of niobufs, but by an array of records, which can contain an
array of niobufs.?
On the server/storage side, the different niobufs must be then
associated with the same record and provided to ZFS.?

Server changes:?

Since we work on the Lustre/ZFS interface, we think it would be the
best to let Lustre compose the header information for every record
(psize and algorithm, maybe also the checksum in the future). We will
store these values at the beginning of every record in 4 Bytes each.?
Currently, when ZFS does compression itself, the compressed size is
stored only within the compressed data. Some algorithms get it when
starting the decompression, for lz4 it is stored at the beginning. With
our approach, we would unify the record-metadata for any algorithm, but
at the moment it would not be accessible by ZFS without changes to ZFS
structures.?

ZFS will also hold an extra variable whether the data is compressed at
all. When reading and the data is compressed, it is up to Lustre to get
the original size and algorithm, to decompress the data and put it into
page structure.?


Any comments or ideas are very welcome!?

Regards,
Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-02-16 14:15         ` Anna Fuchs
@ 2017-02-17 19:15           ` Xiong, Jinshan
  2017-02-17 20:29             ` Dilger, Andreas
  2017-07-21 15:15           ` Anna Fuchs
  1 sibling, 1 reply; 27+ messages in thread
From: Xiong, Jinshan @ 2017-02-17 19:15 UTC (permalink / raw)
  To: lustre-devel

Hi Anna,

Thanks for updating. Please see inserted lines.

On Feb 16, 2017, at 6:15 AM, Anna Fuchs <anna.fuchs at informatik.uni-hamburg.de<mailto:anna.fuchs@informatik.uni-hamburg.de>> wrote:

Dear all,

I would like to update you about my progress on the project.
Unfortunately, I can not publish a complete design of the feature,
since it changes very much during the development.

First the work related to the client changes:

I had to discard my approach to introduce the changes within the
sptlrpc layer for the moment. Compression of the data affects
especially the resulting number of pages and therefore number and size
of niobufs, size and structure of the descriptor and request, size of
the bulk kiov, checksums and in the end the async arguments. Actually
it affects everything, that is set within the osc_brw_prep_request
function in osc_request.c. When entering the sptlrpc layer, most of
that parameters are already set and I would need to update everything.
That causes double work and requires a lot of code duplication from the
osc module.

My current dirty prototype invokes compression just at the beginning of
that function, before niocount is calculated. I need to have a separate
bunch of pages to store compressed data so that I would not overwrite
the content of the original pages, which may be exposed to the
userspace process.
The original pages would be freed and the compressed pages processed
for the request and finally also freed.

Please remember to reserve some pages as emergency pool to avoid the problem that the system memory is in shortage and it needs some free pages for compression to writeback more pages. We may use the same pool to support partial block so it must be greater than the largest ZFS block size(I prefer to not compress data for partial blocks).

After RPC is issued, the pages contain compressed data will be pinned in memory for a while for recovery reasons. Therefore, when emergency pages are used, you will have to issue the RPC in sync mode, so that the server can commit the write trans into persistent storage and client can use the emergency pages for new RPC immediately.


I also reconsidered the idea to do compression niobuf-wise. Due to the
file layout, compression should be done record-wise. Since a niobuf is
a technical requirement for the pages to be contiguous, a record (e.g.
128KB) is a logical unit. In my understanding, it can happen, that one
record contains of several niobufs whenever we do not have enough

We use the terminology ?chunk? as the preferred block size on the OST. Let?s use the same terminology ;-)

contiguous pages for a complete record. For that reason, I would like
to leave the niobuf structure as is it and introduce a record structure
on top of it. That record structure will hold the logical(uncompressed)
and physical(compressed) data sizes and the algorithm used for

hmm? not sure if this is the right approach. I tend to think the client will talk with the OST at connecting time and negotiate the compress algorithm, and after that they should use the same algorithm. There is no need to carry this information in every single RPC.

Yes, it?s reasonable to have chunk descriptors in the RPC. When there are multiple compressed chunks packed in one RPC, the exact bufsize for each chunk will be packed as well. Right now, the LNET doesn?t support partial pages inside niobuf(except the first and last page), so clients have to provide enough information in the chunk descriptor so the server can deduce the padding size for each chunk in the niobuf.

compression. Initially we wanted to extend the niobuf struct by those
fields. I think that change would affect the RPC request structure very
much since the first Lustre message fields will not be followed by an
array of niobufs, but by an array of records, which can contain an
array of niobufs.

We just need a new format of RPC. Please take a look at RQF_OST_BRW_{READ,WRITE}. What we need is probably some thing like RQF_OST_COMP_BRW_{READ,WRITE}, which is basically the same thing but with chunk descriptor:

static const struct req_msg_field *ost_comp_brw_client[] = {
        &RMF_PTLRPC_BODY,
        &RMF_OST_BODY,
        &RMF_OBD_IOOBJ,
        &RMF_NIOBUF_REMOTE,
>>>     &RMF_CHUNK_DESCR,
        &RMF_CAPA1
};

On the server/storage side, the different niobufs must be then
associated with the same record and provided to ZFS.

Server changes:

Since we work on the Lustre/ZFS interface, we think it would be the
best to let Lustre compose the header information for every record
(psize and algorithm, maybe also the checksum in the future). We will

I tend to let ZFS do this job especially for checksum otherwise if Lustre provided wrong data it would affect the consistency of ZFS.

store these values at the beginning of every record in 4 Bytes each.
Currently, when ZFS does compression itself, the compressed size is
stored only within the compressed data. Some algorithms get it when
starting the decompression, for lz4 it is stored at the beginning. With
our approach, we would unify the record-metadata for any algorithm, but

Wait, are you suggesting to store record/chunk-metadata into persistent storage?

at the moment it would not be accessible by ZFS without changes to ZFS
structures.

ZFS will also hold an extra variable whether the data is compressed at
all. When reading and the data is compressed, it is up to Lustre to get
the original size and algorithm, to decompress the data and put it into
page structure.

Yes, the server will check the capability of client to decide if to return compressed data.

I don't look into the corresponding code but matt mentioned before this is pretty much the same interface of ZFS send/recv.

Thanks,
Jinshan



Any comments or ideas are very welcome!

Regards,
Anna





_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170217/6eab1133/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-02-17 19:15           ` Xiong, Jinshan
@ 2017-02-17 20:29             ` Dilger, Andreas
  2017-02-17 21:03               ` Xiong, Jinshan
  0 siblings, 1 reply; 27+ messages in thread
From: Dilger, Andreas @ 2017-02-17 20:29 UTC (permalink / raw)
  To: lustre-devel

On Feb 17, 2017, at 12:15, Xiong, Jinshan <jinshan.xiong@intel.com> wrote:
> 
> Hi Anna,
> 
> Thanks for updating. Please see inserted lines.
> 
>> On Feb 16, 2017, at 6:15 AM, Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de> wrote:
>> 
>> Dear all, 
>> 
>> I would like to update you about my progress on the project. 
>> Unfortunately, I can not publish a complete design of the feature,
>> since it changes very much during the development. 
>> 
>> First the work related to the client changes: 
>> 
>> I had to discard my approach to introduce the changes within the
>> sptlrpc layer for the moment. Compression of the data affects
>> especially the resulting number of pages and therefore number and size
>> of niobufs, size and structure of the descriptor and request, size of
>> the bulk kiov, checksums and in the end the async arguments. Actually
>> it affects everything, that is set within the osc_brw_prep_request
>> function in osc_request.c. When entering the sptlrpc layer, most of
>> that parameters are already set and I would need to update everything.
>> That causes double work and requires a lot of code duplication from the
>> osc module. 
>> 
>> My current dirty prototype invokes compression just at the beginning of
>> that function, before niocount is calculated. I need to have a separate
>> bunch of pages to store compressed data so that I would not overwrite
>> the content of the original pages, which may be exposed to the
>> userspace process. 
>> The original pages would be freed and the compressed pages processed
>> for the request and finally also freed. 
> 
> Please remember to reserve some pages as emergency pool to avoid the problem that the system memory is in shortage and it needs some free pages for compression to writeback more pages. We may use the same pool to support partial block so it must be greater than the largest ZFS block size(I prefer to not compress data for partial blocks). 
> 
> After RPC is issued, the pages contain compressed data will be pinned in memory for a while for recovery reasons. Therefore, when emergency pages are used, you will have to issue the RPC in sync mode, so that the server can commit the write trans into persistent storage and client can use the emergency pages for new RPC immediately.
> 
>> 
>> I also reconsidered the idea to do compression niobuf-wise. Due to the
>> file layout, compression should be done record-wise. Since a niobuf is
>> a technical requirement for the pages to be contiguous, a record (e.g.
>> 128KB) is a logical unit. In my understanding, it can happen, that one
>> record contains of several niobufs whenever we do not have enough
> 
> We use the terminology ?chunk? as the preferred block size on the OST. Let?s use the same terminology ;-)
> 
>> contiguous pages for a complete record. For that reason, I would like
>> to leave the niobuf structure as is it and introduce a record structure
>> on top of it. That record structure will hold the logical(uncompressed)
>> and physical(compressed) data sizes and the algorithm used for
> 
> hmm? not sure if this is the right approach. I tend to think the client will talk with the OST at connecting time and negotiate the compress algorithm, and after that they should use the same algorithm. There is no need to carry this information in every single RPC.

I'm not sure I agree.  The benefits of compression may be different on a per-file basis (e.g. .txt vs. .jpg) so there shouldn't be a fixed compression algorithm required for all RPCs.  I could imagine that we don't want to allow a different compression type for each block (which ZFS allows), but one compression type per RPC should be OK.  We do the same for the checksum type.

> Yes, it?s reasonable to have chunk descriptors in the RPC. When there are multiple compressed chunks packed in one RPC, the exact bufsize for each chunk will be packed as well. Right now, the LNET doesn?t support partial pages inside niobuf(except the first and last page), so clients have to provide enough information in the chunk descriptor so the server can deduce the padding size for each chunk in the niobuf.
> 
>> compression. Initially we wanted to extend the niobuf struct by those
>> fields. I think that change would affect the RPC request structure very
>> much since the first Lustre message fields will not be followed by an
>> array of niobufs, but by an array of records, which can contain an
>> array of niobufs. 
> 
> We just need a new format of RPC. Please take a look at RQF_OST_BRW_{READ,WRITE}. What we need is probably some thing like RQF_OST_COMP_BRW_{READ,WRITE}, which is basically the same thing but with chunk descriptor:
> 
> static const struct req_msg_field *ost_comp_brw_client[] = {
>         &RMF_PTLRPC_BODY,
>         &RMF_OST_BODY,
>         &RMF_OBD_IOOBJ,
>         &RMF_NIOBUF_REMOTE,
> >>>     &RMF_CHUNK_DESCR,
>         &RMF_CAPA1
> };
> 
>> On the server/storage side, the different niobufs must be then
>> associated with the same record and provided to ZFS. 
>> 
>> Server changes: 
>> 
>> Since we work on the Lustre/ZFS interface, we think it would be the
>> best to let Lustre compose the header information for every record
>> (psize and algorithm, maybe also the checksum in the future). We will
> 
> I tend to let ZFS do this job especially for checksum otherwise if Lustre provided wrong data it would affect the consistency of ZFS.

We want to allow Lustre clients to use the same ZFS checksum in the future, so there needs to be an interface to pass this.  If ZFS verifies the checksum when the write is first submitted, and returns an error before doing actual filesystem modifications then it can verify the checksum is correct for that block, and we can skip the Lustre RPC checksum.  This would probably work OK with the "zero copy" interface that we use, where data buffers are preallocated for RDMA without actually being attached to a TXG, and then the checksum would be verified by ZFS at submission.

>> store these values at the beginning of every record in 4 Bytes each. 
>> Currently, when ZFS does compression itself, the compressed size is
>> stored only within the compressed data. Some algorithms get it when
>> starting the decompression, for lz4 it is stored at the beginning. With
>> our approach, we would unify the record-metadata for any algorithm, but
> 
> Wait, are you suggesting to store record/chunk-metadata into persistent storage?
> 
>> at the moment it would not be accessible by ZFS without changes to ZFS
>> structures. 
>> 
>> ZFS will also hold an extra variable whether the data is compressed at
>> all. When reading and the data is compressed, it is up to Lustre to get
>> the original size and algorithm, to decompress the data and put it into
>> page structure. 
> 
> Yes, the server will check the capability of client to decide if to return compressed data.
> 
> I don't look into the corresponding code but Matt mentioned before this is pretty much the same interface of ZFS send/recv.
> 
> Thanks,
> Jinshan
> 
>> 
>> 
>> Any comments or ideas are very welcome! 
>> 
>> Regards,
>> Anna
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> lustre-devel mailing list
>> lustre-devel at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-02-17 20:29             ` Dilger, Andreas
@ 2017-02-17 21:03               ` Xiong, Jinshan
  2017-02-17 21:36                 ` Dilger, Andreas
  0 siblings, 1 reply; 27+ messages in thread
From: Xiong, Jinshan @ 2017-02-17 21:03 UTC (permalink / raw)
  To: lustre-devel


On Feb 17, 2017, at 12:29 PM, Dilger, Andreas <andreas.dilger at intel.com<mailto:andreas.dilger@intel.com>> wrote:

On Feb 17, 2017, at 12:15, Xiong, Jinshan <jinshan.xiong at intel.com<mailto:jinshan.xiong@intel.com>> wrote:

Hi Anna,

Thanks for updating. Please see inserted lines.

On Feb 16, 2017, at 6:15 AM, Anna Fuchs <anna.fuchs at informatik.uni-hamburg.de<mailto:anna.fuchs@informatik.uni-hamburg.de>> wrote:

Dear all,

I would like to update you about my progress on the project.
Unfortunately, I can not publish a complete design of the feature,
since it changes very much during the development.

First the work related to the client changes:

I had to discard my approach to introduce the changes within the
sptlrpc layer for the moment. Compression of the data affects
especially the resulting number of pages and therefore number and size
of niobufs, size and structure of the descriptor and request, size of
the bulk kiov, checksums and in the end the async arguments. Actually
it affects everything, that is set within the osc_brw_prep_request
function in osc_request.c. When entering the sptlrpc layer, most of
that parameters are already set and I would need to update everything.
That causes double work and requires a lot of code duplication from the
osc module.

My current dirty prototype invokes compression just at the beginning of
that function, before niocount is calculated. I need to have a separate
bunch of pages to store compressed data so that I would not overwrite
the content of the original pages, which may be exposed to the
userspace process.
The original pages would be freed and the compressed pages processed
for the request and finally also freed.

Please remember to reserve some pages as emergency pool to avoid the problem that the system memory is in shortage and it needs some free pages for compression to writeback more pages. We may use the same pool to support partial block so it must be greater than the largest ZFS block size(I prefer to not compress data for partial blocks).

After RPC is issued, the pages contain compressed data will be pinned in memory for a while for recovery reasons. Therefore, when emergency pages are used, you will have to issue the RPC in sync mode, so that the server can commit the write trans into persistent storage and client can use the emergency pages for new RPC immediately.


I also reconsidered the idea to do compression niobuf-wise. Due to the
file layout, compression should be done record-wise. Since a niobuf is
a technical requirement for the pages to be contiguous, a record (e.g.
128KB) is a logical unit. In my understanding, it can happen, that one
record contains of several niobufs whenever we do not have enough

We use the terminology ?chunk? as the preferred block size on the OST. Let?s use the same terminology ;-)

contiguous pages for a complete record. For that reason, I would like
to leave the niobuf structure as is it and introduce a record structure
on top of it. That record structure will hold the logical(uncompressed)
and physical(compressed) data sizes and the algorithm used for

hmm? not sure if this is the right approach. I tend to think the client will talk with the OST at connecting time and negotiate the compress algorithm, and after that they should use the same algorithm. There is no need to carry this information in every single RPC.

I'm not sure I agree.  The benefits of compression may be different on a per-file basis (e.g. .txt vs. .jpg) so there shouldn't be a fixed compression algorithm required for all RPCs.  I could imagine that we don't want to allow a different compression type for each block (which ZFS allows), but one compression type per RPC should be OK.  We do the same for the checksum type.

The difference between checksum and compression is that different types of checksum should produce the same results, therefore the clients can pick any checksum algorithm at its own discretion.

As for your example, I think it?s more likely that the OSC will decide to turn off compression for the .jpg file after trying to compress few chunks and figure out there is no benefit by doing that.

Jinshan


Yes, it?s reasonable to have chunk descriptors in the RPC. When there are multiple compressed chunks packed in one RPC, the exact bufsize for each chunk will be packed as well. Right now, the LNET doesn?t support partial pages inside niobuf(except the first and last page), so clients have to provide enough information in the chunk descriptor so the server can deduce the padding size for each chunk in the niobuf.

compression. Initially we wanted to extend the niobuf struct by those
fields. I think that change would affect the RPC request structure very
much since the first Lustre message fields will not be followed by an
array of niobufs, but by an array of records, which can contain an
array of niobufs.

We just need a new format of RPC. Please take a look at RQF_OST_BRW_{READ,WRITE}. What we need is probably some thing like RQF_OST_COMP_BRW_{READ,WRITE}, which is basically the same thing but with chunk descriptor:

static const struct req_msg_field *ost_comp_brw_client[] = {
       &RMF_PTLRPC_BODY,
       &RMF_OST_BODY,
       &RMF_OBD_IOOBJ,
       &RMF_NIOBUF_REMOTE,
   &RMF_CHUNK_DESCR,
       &RMF_CAPA1
};

On the server/storage side, the different niobufs must be then
associated with the same record and provided to ZFS.

Server changes:

Since we work on the Lustre/ZFS interface, we think it would be the
best to let Lustre compose the header information for every record
(psize and algorithm, maybe also the checksum in the future). We will

I tend to let ZFS do this job especially for checksum otherwise if Lustre provided wrong data it would affect the consistency of ZFS.

We want to allow Lustre clients to use the same ZFS checksum in the future, so there needs to be an interface to pass this.  If ZFS verifies the checksum when the write is first submitted, and returns an error before doing actual filesystem modifications then it can verify the checksum is correct for that block, and we can skip the Lustre RPC checksum.  This would probably work OK with the "zero copy" interface that we use, where data buffers are preallocated for RDMA without actually being attached to a TXG, and then the checksum would be verified by ZFS at submission.

store these values at the beginning of every record in 4 Bytes each.
Currently, when ZFS does compression itself, the compressed size is
stored only within the compressed data. Some algorithms get it when
starting the decompression, for lz4 it is stored at the beginning. With
our approach, we would unify the record-metadata for any algorithm, but

Wait, are you suggesting to store record/chunk-metadata into persistent storage?

at the moment it would not be accessible by ZFS without changes to ZFS
structures.

ZFS will also hold an extra variable whether the data is compressed at
all. When reading and the data is compressed, it is up to Lustre to get
the original size and algorithm, to decompress the data and put it into
page structure.

Yes, the server will check the capability of client to decide if to return compressed data.

I don't look into the corresponding code but Matt mentioned before this is pretty much the same interface of ZFS send/recv.

Thanks,
Jinshan



Any comments or ideas are very welcome!

Regards,
Anna





_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170217/1481a52b/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-02-17 21:03               ` Xiong, Jinshan
@ 2017-02-17 21:36                 ` Dilger, Andreas
  0 siblings, 0 replies; 27+ messages in thread
From: Dilger, Andreas @ 2017-02-17 21:36 UTC (permalink / raw)
  To: lustre-devel

On Feb 17, 2017, at 14:03, Xiong, Jinshan <jinshan.xiong@intel.com> wrote:
> 
>> 
>> On Feb 17, 2017, at 12:29 PM, Dilger, Andreas <andreas.dilger@intel.com> wrote:
>> 
>> On Feb 17, 2017, at 12:15, Xiong, Jinshan <jinshan.xiong@intel.com> wrote:
>>> 
>>> Hi Anna,
>>> 
>>> Thanks for updating. Please see inserted lines.
>>> 
>>>> On Feb 16, 2017, at 6:15 AM, Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de> wrote:
>>>> 
>>>> Dear all, 
>>>> 
>>>> I would like to update you about my progress on the project. 
>>>> Unfortunately, I can not publish a complete design of the feature,
>>>> since it changes very much during the development. 
>>>> 
>>>> First the work related to the client changes: 
>>>> 
>>>> I had to discard my approach to introduce the changes within the
>>>> sptlrpc layer for the moment. Compression of the data affects
>>>> especially the resulting number of pages and therefore number and size
>>>> of niobufs, size and structure of the descriptor and request, size of
>>>> the bulk kiov, checksums and in the end the async arguments. Actually
>>>> it affects everything, that is set within the osc_brw_prep_request
>>>> function in osc_request.c. When entering the sptlrpc layer, most of
>>>> that parameters are already set and I would need to update everything.
>>>> That causes double work and requires a lot of code duplication from the
>>>> osc module. 
>>>> 
>>>> My current dirty prototype invokes compression just at the beginning of
>>>> that function, before niocount is calculated. I need to have a separate
>>>> bunch of pages to store compressed data so that I would not overwrite
>>>> the content of the original pages, which may be exposed to the
>>>> userspace process. 
>>>> The original pages would be freed and the compressed pages processed
>>>> for the request and finally also freed. 
>>> 
>>> Please remember to reserve some pages as emergency pool to avoid the problem that the system memory is in shortage and it needs some free pages for compression to writeback more pages. We may use the same pool to support partial block so it must be greater than the largest ZFS block size(I prefer to not compress data for partial blocks). 
>>> 
>>> After RPC is issued, the pages contain compressed data will be pinned in memory for a while for recovery reasons. Therefore, when emergency pages are used, you will have to issue the RPC in sync mode, so that the server can commit the write trans into persistent storage and client can use the emergency pages for new RPC immediately.
>>> 
>>>> 
>>>> I also reconsidered the idea to do compression niobuf-wise. Due to the
>>>> file layout, compression should be done record-wise. Since a niobuf is
>>>> a technical requirement for the pages to be contiguous, a record (e.g.
>>>> 128KB) is a logical unit. In my understanding, it can happen, that one
>>>> record contains of several niobufs whenever we do not have enough
>>> 
>>> We use the terminology ?chunk? as the preferred block size on the OST. Let?s use the same terminology ;-)
>>> 
>>>> contiguous pages for a complete record. For that reason, I would like
>>>> to leave the niobuf structure as is it and introduce a record structure
>>>> on top of it. That record structure will hold the logical(uncompressed)
>>>> and physical(compressed) data sizes and the algorithm used for
>>> 
>>> hmm? not sure if this is the right approach. I tend to think the client will talk with the OST at connecting time and negotiate the compress algorithm, and after that they should use the same algorithm. There is no need to carry this information in every single RPC.
>> 
>> I'm not sure I agree.  The benefits of compression may be different on a per-file basis (e.g. .txt vs. .jpg) so there shouldn't be a fixed compression algorithm required for all RPCs.  I could imagine that we don't want to allow a different compression type for each block (which ZFS allows), but one compression type per RPC should be OK.  We do the same for the checksum type.
> 
> The difference between checksum and compression is that different types of checksum should produce the same results, therefore the clients can pick any checksum algorithm at its own discretion.
> 
> As for your example, I think it?s more likely that the OSC will decide to turn off compression for the .jpg file after trying to compress few chunks and figure out there is no benefit by doing that.

Actually, Anna's research group did some testing with dynamic compression at the ZFS level on a per-block basis (which I would love to see submitted upstream to ZFS) so that the node can balance CPU usage vs. compression ratio for each file or potentially each block.  I don't want to bake in a connect-time compression algorithm into the network protocol, even if we don't implement dynamic compression selection at runtime immediately.  I'm sure we can find some space in the RPC for the compression algorithm, or even in a new niobuf_remote if the RPC format is changing anyway.

If there were only a handful of compression algorithms we could encode the compression algorithm into 4 bits of rnb_flags.  I don't _think_ we need to also specify the compression level or other parameters, just the algorithm (e.g. gzip, lz4), but I don't want to be too limiting if the number of compression algorithms continues to grow so a separate 32-bit field with a 32-bit padding would be safer if the protocol is already being changed.  It would also be good to add some padding to obd_ioobj for future use while we are in there.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-02-16 14:15         ` Anna Fuchs
  2017-02-17 19:15           ` Xiong, Jinshan
@ 2017-07-21 15:15           ` Anna Fuchs
  2017-07-21 16:43             ` Patrick Farrell
  2017-07-21 19:12             ` Xiong, Jinshan
  1 sibling, 2 replies; 27+ messages in thread
From: Anna Fuchs @ 2017-07-21 15:15 UTC (permalink / raw)
  To: lustre-devel

Dear all, 

for compression within the osc module we need a bunch of pages for the
compressed output (at most the same size like original data), and few
pages for working memory of the algorithms. Since allocating (and later
freeing) the pages every time we enter the compression loop might be
expensive and annoying, we thought about a pool of pages, which is
present exclusively for compression purposes.

We would create that pool at file system start (when loading the osc
module) and destroy at file system stop (when unloading the osc
module). The condition is, of course, the configure option --enable-
compression. The pool would be a queue of page bunches where a thread
can pop pages for compression and put them back after the compressed
portion was transferred. The page content will not be visible to anyone
outside and will also not be cached after the transmission.

We would like to make the pool static since we think, we do not need a
lot of memory. However it depends on the number of stripes or MBs, that
one client can handle at the same time. E.g. for 32 stripes of 1MB
processed at the same time, we need at most 32 MB + few MB for
overhead. Where can I find the exact number or how can I estimate how
many stripes there are at most at the same time? Another limitation is
the number of threads, which can work in parallel on compression at the
same time. We think to exclusively reserve not more than 50 MB for the
compression page pool per client. Do you think it might hurt the
performance?

Once there are not enough pages, for whatever reason, we wouldn't wait,
but just skip the compression for the respective chunk. 

Are there any problems you see in that approach? 

Regards,
Anna

--
Anna Fuchs
https://wr.informatik.uni-hamburg.de/people/anna_fuchs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-21 15:15           ` Anna Fuchs
@ 2017-07-21 16:43             ` Patrick Farrell
  2017-07-21 19:19               ` Xiong, Jinshan
  2017-07-25 14:25               ` Anna Fuchs
  2017-07-21 19:12             ` Xiong, Jinshan
  1 sibling, 2 replies; 27+ messages in thread
From: Patrick Farrell @ 2017-07-21 16:43 UTC (permalink / raw)
  To: lustre-devel

I think basing this on the maximum number of stripes it too simple, and maybe not necessary.

Apologies in advance if what I say below rests on a misunderstanding of the compression design, I should know it better than I do.


But, here goes.

About based on maximum stripe count, there are a number of 1000 OST systems in the world today.  Imagine one of them with 16 MiB stripes, that's ~16 GiB of memory for this.  I think that's clearly too large.  But a global (rather than per OSC) pool could be tricky too, leading to contention on getting and returning pages.

You mention later a 50 MiB pool per client.  As a per OST pre-allocated pool, that would likely be too large.  As a global pool, it seems small...


But why use a global pool?  It sounds like the compression would be handled by the thread putting the data on the wire (Sorry if I've got that wrong).  So - What about a per-thread block of pages, for each ptlrpcd thread?  If the idea is that this compressed data is not retained for replay (instead, you would re-compress), then we only need a block of max rpc size for each thread (You could just use the largest RPC size supported by the client), so it can send that compressed data.

The overhead of compression for replay is probably not something we need to worry about.


Or even per-CPU blocks of pages.  That would probably be better still (less total memory if there are more ptlrpcds than CPUs), if we can guarantee not sleeping during the time the pool is in use.  (I'm not sure.)

Also, you mention limiting the # of threads.  Why is limiting the number of threads doing compression of interest?  What are you specifically trying to avoid with that?

________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Sent: Friday, July 21, 2017 10:15:30 AM
To: Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression

Dear all,

for compression within the osc module we need a bunch of pages for the
compressed output (at most the same size like original data), and few
pages for working memory of the algorithms. Since allocating (and later
freeing) the pages every time we enter the compression loop might be
expensive and annoying, we thought about a pool of pages, which is
present exclusively for compression purposes.

We would create that pool at file system start (when loading the osc
module) and destroy at file system stop (when unloading the osc
module). The condition is, of course, the configure option --enable-
compression. The pool would be a queue of page bunches where a thread
can pop pages for compression and put them back after the compressed
portion was transferred. The page content will not be visible to anyone
outside and will also not be cached after the transmission.

We would like to make the pool static since we think, we do not need a
lot of memory. However it depends on the number of stripes or MBs, that
one client can handle at the same time. E.g. for 32 stripes of 1MB
processed at the same time, we need at most 32 MB + few MB for
overhead. Where can I find the exact number or how can I estimate how
many stripes there are at most at the same time? Another limitation is
the number of threads, which can work in parallel on compression at the
same time. We think to exclusively reserve not more than 50 MB for the
compression page pool per client. Do you think it might hurt the
performance?

Once there are not enough pages, for whatever reason, we wouldn't wait,
but just skip the compression for the respective chunk.

Are there any problems you see in that approach?

Regards,
Anna

--
Anna Fuchs
https://wr.informatik.uni-hamburg.de/people/anna_fuchs
_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170721/4099f4fc/attachment.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-21 15:15           ` Anna Fuchs
  2017-07-21 16:43             ` Patrick Farrell
@ 2017-07-21 19:12             ` Xiong, Jinshan
  1 sibling, 0 replies; 27+ messages in thread
From: Xiong, Jinshan @ 2017-07-21 19:12 UTC (permalink / raw)
  To: lustre-devel

Please see inserted lines.

-----Original Message-----
From: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Date: Friday, July 21, 2017 at 8:15 AM
To: "Xiong, Jinshan" <jinshan.xiong@intel.com>
Cc: Matthew Ahrens <mahrens@delphix.com>, "Zhuravlev, Alexey" <alexey.zhuravlev@intel.com>, lustre-devel <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] Design proposal for client-side compression

    Dear all, 
    
    for compression within the osc module we need a bunch of pages for the
    compressed output (at most the same size like original data), and few
    pages for working memory of the algorithms. Since allocating (and later
    freeing) the pages every time we enter the compression loop might be
    expensive and annoying, we thought about a pool of pages, which is
  present exclusively for compression purposes.
    
    We would create that pool at file system start (when loading the osc
    module) and destroy at file system stop (when unloading the osc
    module). The condition is, of course, the configure option --enable-
  compression. The pool would be a queue of page bunches where a thread

Is it possible to enable this by writing to a sysfs or procfs entry? So that users can try this out without having to recompile Lustre.

    can pop pages for compression and put them back after the compressed
    portion was transferred. The page content will not be visible to anyone
    outside and will also not be cached after the transmission.
    
    We would like to make the pool static since we think, we do not need a
    lot of memory. However it depends on the number of stripes or MBs, that
    one client can handle at the same time. E.g. for 32 stripes of 1MB
  processed at the same time, we need at most 32 MB + few MB for

Actually, we have increased the default RPC size to be 4MB so this assumption is no longer true.

    overhead. Where can I find the exact number or how can I estimate how
  many stripes there are at most at the same time? Another limitation is
  
It?s not scalable to have a pool per OSC because Lustre can support up to 2000 stripes. However, we don?t need to worry about wide stripe problem because no one can write a full stripe with even 1MB stripe size, because that means application has to issue 2GB size of write.
  
    the number of threads, which can work in parallel on compression at the
    same time. We think to exclusively reserve not more than 50 MB for the
    compression page pool per client. Do you think it might hurt the
  performance?

Yes, it?s reasonable to have a global pool for each client node. Let?s start from this number but please make it adjustable via sysfs or procfs.

Jinshan
    
    Once there are not enough pages, for whatever reason, we wouldn't wait,
    but just skip the compression for the respective chunk. 
    
    Are there any problems you see in that approach? 
    
    Regards,
    Anna
    
    --
    Anna Fuchs
    https://wr.informatik.uni-hamburg.de/people/anna_fuchs
    

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-21 16:43             ` Patrick Farrell
@ 2017-07-21 19:19               ` Xiong, Jinshan
  2017-07-25 14:25               ` Anna Fuchs
  1 sibling, 0 replies; 27+ messages in thread
From: Xiong, Jinshan @ 2017-07-21 19:19 UTC (permalink / raw)
  To: lustre-devel



From: Patrick Farrell <paf@cray.com>
Date: Friday, July 21, 2017 at 9:44 AM
To: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>, "Xiong, Jinshan" <jinshan.xiong@intel.com>
Cc: Matthew Ahrens <mahrens@delphix.com>, "Zhuravlev, Alexey" <alexey.zhuravlev@intel.com>, lustre-devel <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] Design proposal for client-side compression


I think basing this on the maximum number of stripes it too simple, and maybe not necessary.

Apologies in advance if what I say below rests on a misunderstanding of the compression design, I should know it better than I do.



But, here goes.

About based on maximum stripe count, there are a number of 1000 OST systems in the world today.  Imagine one of them with 16 MiB stripes, that's ~16 GiB of memory for this.  I think that's clearly too large.  But a global (rather than per OSC) pool could be tricky too, leading to contention on getting and returning pages.

You mention later a 50 MiB pool per client.  As a per OST pre-allocated pool, that would likely be too large.  As a global pool, it seems small...



But why use a global pool?  It sounds like the compression would be handled by the thread putting the data on the wire (Sorry if I've got that wrong).  So - What about a per-thread block of pages, for each ptlrpcd thread?  If the idea is that this compressed data is not retained for replay (instead, you would re-compress), then we only need a block of max rpc size for each thread (You could just use the largest RPC size supported by the client), so it can send that compressed data.



The writing thread would also need to issue RPC in its own process context, but it can be revised to use ptlrpcd thread. I tend to think using a global ptlrpc thread would be reasonable for now because compression should be slow so I don?t expect there would be a lot of lock contention for the global pool.



Jinshan

The overhead of compression for replay is probably not something we need to worry about.



Or even per-CPU blocks of pages.  That would probably be better still (less total memory if there are more ptlrpcds than CPUs), if we can guarantee not sleeping during the time the pool is in use.  (I'm not sure.)

Also, you mention limiting the # of threads.  Why is limiting the number of threads doing compression of interest?  What are you specifically trying to avoid with that?

________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Sent: Friday, July 21, 2017 10:15:30 AM
To: Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression

Dear all,

for compression within the osc module we need a bunch of pages for the
compressed output (at most the same size like original data), and few
pages for working memory of the algorithms. Since allocating (and later
freeing) the pages every time we enter the compression loop might be
expensive and annoying, we thought about a pool of pages, which is
present exclusively for compression purposes.

We would create that pool at file system start (when loading the osc
module) and destroy at file system stop (when unloading the osc
module). The condition is, of course, the configure option --enable-
compression. The pool would be a queue of page bunches where a thread
can pop pages for compression and put them back after the compressed
portion was transferred. The page content will not be visible to anyone
outside and will also not be cached after the transmission.

We would like to make the pool static since we think, we do not need a
lot of memory. However it depends on the number of stripes or MBs, that
one client can handle at the same time. E.g. for 32 stripes of 1MB
processed at the same time, we need at most 32 MB + few MB for
overhead. Where can I find the exact number or how can I estimate how
many stripes there are at most at the same time? Another limitation is
the number of threads, which can work in parallel on compression at the
same time. We think to exclusively reserve not more than 50 MB for the
compression page pool per client. Do you think it might hurt the
performance?

Once there are not enough pages, for whatever reason, we wouldn't wait,
but just skip the compression for the respective chunk.

Are there any problems you see in that approach?

Regards,
Anna

--
Anna Fuchs
https://wr.informatik.uni-hamburg.de/people/anna_fuchs
_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170721/204968dc/attachment.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-21 16:43             ` Patrick Farrell
  2017-07-21 19:19               ` Xiong, Jinshan
@ 2017-07-25 14:25               ` Anna Fuchs
  2017-07-26 18:26                 ` Patrick Farrell
  1 sibling, 1 reply; 27+ messages in thread
From: Anna Fuchs @ 2017-07-25 14:25 UTC (permalink / raw)
  To: lustre-devel

Thank you for your responses. 

Patrick, 

On Fri, 2017-07-21 at 16:43 +0000, Patrick Farrell wrote:
> I think basing this on the maximum number of stripes it too simple,
> and maybe not necessary.
> 
> Apologies in advance if what I say below rests on a misunderstanding
> of the compression design, I should know it better than I do.

Probably I still don't get all the relevant internals of Lustre to
clearly describe what we are planning and what we need.

> About based on maximum stripe count, there are a number of 1000 OST
> systems in the world today.  Imagine one of them with 16 MiB stripes,
> that's ~16 GiB of memory for this.  I think that's clearly too
> large.  But a global (rather than per OSC) pool could be tricky too,
> leading to contention on getting and returning pages.

Well, does that mean that one Lustre client handles all the 16GiB of
stripes at the same time or does it somehow iterate over the stripes?
How much memory do we need at most at the same time? If the client
first processes 100 stripes, we need enough memory to compress 100
stripes at the same time. So the question is not about the maximum
stripe count, but the maximum, let me call it queue portion of stripes,
which can be processed at the same time within one client.

> 
> You mention later a 50 MiB pool per client.  As a per OST pre-
> allocated pool, that would likely be too large.  As a global pool, it
> seems small...
> 
> But why use a global pool?  It sounds like the compression would be
> handled by the thread putting the data on the wire (Sorry if I've got
> that wrong).  So - What about a per-thread block of pages, for each
> ptlrpcd thread?  If the idea is that this compressed data is not
> retained for replay (instead, you would re-compress), then we only
> need a block of max rpc size for each thread (You could just use the
> largest RPC size supported by the client), so it can send that
> compressed data.

Yes, we don't really need a very global pool, but still we need to
know, how many threads can there be at the same time? Is there one
thread per stripe or per RPC? And how many in total?

> 
> The overhead of compression for replay is probably not something we
> need to worry about.
> 
> Or even per-CPU blocks of pages.  That would probably be better still
> (less total memory if there are more ptlrpcds than CPUs), if we can
> guarantee not sleeping during the time the pool is in use.  (I'm not
> sure.)
> 
> Also, you mention limiting the # of threads.  Why is limiting the
> number of threads doing compression of interest?  What are you
> specifically trying to avoid with that?

I mean that the number of threads available for compression is somehow
limited. If we have 100 stripes at the same time, we still can compress
with #cores threads, which might be less than 100. So if there are more
stripes in flight than we have resources for compression (since it is
slower), we need to decide whether to slow down everything or to skip
compression of some stripes for overall better performance. 


Jinshan, 

> 
> Is it possible to enable this by writing to a sysfs or procfs entry?
> So that users can try this out without having to recompile Lustre.

The size should be controllable dynamically, but for the feature Lustre
has to be recompiled anyway. 


>   
> It?s not scalable to have a pool per OSC because Lustre can support
> up to 2000 stripes. However, we don?t need to worry about wide stripe
> problem because no one can write a full stripe with even 1MB stripe
> size, because that means application has to issue 2GB size of write.

Does that mean, we have 2000 stripes and we have 2000 messages/RPCs at
the same time? And we need to be able to compress all 2000 stripes at
the same time to avoid blocking? Is there not any limit how many one
client can have at one point of time? 

> Yes, it?s reasonable to have a global pool for each client node.
> Let?s start from this number but please make it adjustable via sysfs
> or procfs.

I am still not sure how large we can do it. Once we really need 16 GiB
for that pool to quickly serve the compression threads, it is not
doable and we have to think different. Once we have a smaller pool than
we need, we have to block or skip compression, which is undesirable.
But I don't know how to determine the required size. 

Thanks! 

Best regards,
Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-25 14:25               ` Anna Fuchs
@ 2017-07-26 18:26                 ` Patrick Farrell
  2017-07-26 20:17                   ` Xiong, Jinshan
  2017-07-27  8:26                   ` Anna Fuchs
  0 siblings, 2 replies; 27+ messages in thread
From: Patrick Farrell @ 2017-07-26 18:26 UTC (permalink / raw)
  To: lustre-devel

Anna,

Having reread your LAD presentation (I was there, but it's been a while...), I think you've got a good architecture.

A few thoughts.

1. Jinshan was just suggesting including in the code a switch to enable/disable the feature at runtime, for an example, see his fast read patch:
https://review.whamcloud.com/#/c/20255/
Especially the proc section:
https://review.whamcloud.com/#/c/20255/7/lustre/llite/lproc_llite.c
The effect of that is a file in proc that one can use to disable/enable the feature by echoing 0 or 1.
(I think there is probably a place for tuning beyond that, but that's separate.)
This is great for features that may have complex impacts, and also for people who want to test a feature to see how it changes things.
2. Lustre clients iterate over the stripes, basically.

Here's an explanation of the write path on the client that should help.  This explanation is heavily simplified and incorrect in some of the details, but should be accurate enough for your question.
The I/O model on the client (for buffered I/O, direct I/O is different) is that the writing process (userspace process) starts an I/O, then identifies which parts of the I/O go to which stripes, gets the locks it needs, then copies the data through the page cache...  Once the data is copied to the page cache, Lustre then works on writing out that data.  In general, it does it asynchronously, where the userspace process returns and then data write-out is handled by the ptlrpcd (daemon) threads, but in various exceptional conditions it may do the write-out in the userspace process.

In general, the write out is going to happen in parallel (to different OSTs) with different ptlrpcd threads taking different chunks of data and putting them on the wire, and sometimes the userspace thread doing that work for some of the data as well.

So "How much memory do we need at most at the same time?" is not a question with an easy answer.  When doing a bulk RPC, generally, the sender sends an RPC announcing the bulk data is ready, then the recipient copies the data (RDMA) (or the sender sends it over to a buffer if no RDMA) and announces to the client it has done so.  I'm not 100% clear on the sequencing here, but the key thing is there's a time where we've sent the RPC but we aren't done with the buffer.  So we can send another RPC before that buffer is retired.  (If I've got this badly wrong, I hope someone will correct me.

So the total amount of memory required to do this is going to depend on how fast data is being sent, rather than on the # of OSTs or any other constant.

There *is* a per OST limit to how many RPCs a client can have in flight at once, but it's generally set so the client can get good performance to one OST.  Allocating data for max_rpcs_in_flight*num OSTs would be far too much, because in the 1000 OST case, a client can probably only have a few hundred RPCs in flight (if that...) at once on a normal network.

But if we are writing from one client to many OSTs, how many RPCs are in flight at once is going to depend more on how fast our network is (or, possibly, CPU on the client if the network is fast and/or CPU is slow) than any explicit limits.  The explicit limits are much higher than we will hit in practice.

Does that make sense?  It doesn't make your problem any easier...

It actually seems like maybe a global pool of pages *is* the right answer.  The question is how big to make it...
What about making it grow on demand up to a configurable upper limit?

The allocation code for encryption is here (it's pretty complicated and it works on the assumption that it must get pages or return ENOMEM - The compression code doesn't absolutely have to get pages, so it could be changed):
sptlrpc_enc_pool_get_pages

It seems like maybe that code could be adjusted to serve both the encryption case (must not fail, if it can't get memory, return -ENOMEM to cause retries), and the compression case (can fail, if it fails, should not do compression...  Maybe should consume less memory)

About thread counts:
Encryption is handled in the ptlrpc code, and your presentation noted the plan is to mimic that, which sounds good to me.  That means there's no reason for you to explicitly control the number of threads doing compression, the same number of threads doing sending will be doing compression, which seems fine.  (Unless there's some point of contention in the compression code, but that seems unlikely...)

Hope that helps a bit.

- Patrick



________________________________
From: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Sent: Tuesday, July 25, 2017 9:25:40 AM
To: Patrick Farrell; Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression

Thank you for your responses.

Patrick,

On Fri, 2017-07-21 at 16:43 +0000, Patrick Farrell wrote:
> I think basing this on the maximum number of stripes it too simple,
> and maybe not necessary.
>
> Apologies in advance if what I say below rests on a misunderstanding
> of the compression design, I should know it better than I do.

Probably I still don't get all the relevant internals of Lustre to
clearly describe what we are planning and what we need.

> About based on maximum stripe count, there are a number of 1000 OST
> systems in the world today.  Imagine one of them with 16 MiB stripes,
> that's ~16 GiB of memory for this.  I think that's clearly too
> large.  But a global (rather than per OSC) pool could be tricky too,
> leading to contention on getting and returning pages.

Well, does that mean that one Lustre client handles all the 16GiB of
stripes at the same time or does it somehow iterate over the stripes?
How much memory do we need at most at the same time? If the client
first processes 100 stripes, we need enough memory to compress 100
stripes at the same time. So the question is not about the maximum
stripe count, but the maximum, let me call it queue portion of stripes,
which can be processed at the same time within one client.

>
> You mention later a 50 MiB pool per client.  As a per OST pre-
> allocated pool, that would likely be too large.  As a global pool, it
> seems small...
>
> But why use a global pool?  It sounds like the compression would be
> handled by the thread putting the data on the wire (Sorry if I've got
> that wrong).  So - What about a per-thread block of pages, for each
> ptlrpcd thread?  If the idea is that this compressed data is not
> retained for replay (instead, you would re-compress), then we only
> need a block of max rpc size for each thread (You could just use the
> largest RPC size supported by the client), so it can send that
> compressed data.

Yes, we don't really need a very global pool, but still we need to
know, how many threads can there be at the same time? Is there one
thread per stripe or per RPC? And how many in total?

>
> The overhead of compression for replay is probably not something we
> need to worry about.
>
> Or even per-CPU blocks of pages.  That would probably be better still
> (less total memory if there are more ptlrpcds than CPUs), if we can
> guarantee not sleeping during the time the pool is in use.  (I'm not
> sure.)
>
> Also, you mention limiting the # of threads.  Why is limiting the
> number of threads doing compression of interest?  What are you
> specifically trying to avoid with that?

I mean that the number of threads available for compression is somehow
limited. If we have 100 stripes at the same time, we still can compress
with #cores threads, which might be less than 100. So if there are more
stripes in flight than we have resources for compression (since it is
slower), we need to decide whether to slow down everything or to skip
compression of some stripes for overall better performance.


Jinshan,

>
> Is it possible to enable this by writing to a sysfs or procfs entry?
> So that users can try this out without having to recompile Lustre.

The size should be controllable dynamically, but for the feature Lustre
has to be recompiled anyway.


>
> It?s not scalable to have a pool per OSC because Lustre can support
> up to 2000 stripes. However, we don?t need to worry about wide stripe
> problem because no one can write a full stripe with even 1MB stripe
> size, because that means application has to issue 2GB size of write.

Does that mean, we have 2000 stripes and we have 2000 messages/RPCs at
the same time? And we need to be able to compress all 2000 stripes at
the same time to avoid blocking? Is there not any limit how many one
client can have at one point of time?

> Yes, it?s reasonable to have a global pool for each client node.
> Let?s start from this number but please make it adjustable via sysfs
> or procfs.

I am still not sure how large we can do it. Once we really need 16 GiB
for that pool to quickly serve the compression threads, it is not
doable and we have to think different. Once we have a smaller pool than
we need, we have to block or skip compression, which is undesirable.
But I don't know how to determine the required size.

Thanks!

Best regards,
Anna


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170726/80e79b5b/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-26 18:26                 ` Patrick Farrell
@ 2017-07-26 20:17                   ` Xiong, Jinshan
  2017-07-27  8:26                     ` Anna Fuchs
  2017-07-27  8:26                   ` Anna Fuchs
  1 sibling, 1 reply; 27+ messages in thread
From: Xiong, Jinshan @ 2017-07-26 20:17 UTC (permalink / raw)
  To: lustre-devel

Thanks for the detailed explanation from Patrick.

?Does that mean, ? Is there not any limit how many one
client can have at one point of time??

In theory, it?s possible that there exist that many active RPCs at one time, which is why I think it?s not feasible to have per-OSC page pool.

?? Once we have a smaller pool than
we need, we have to block or skip compression, which is undesirable.
But I don't know how to determine the required size.?

It?s probably not good to skip compression once it runs out of pages in pool, instead it should be blocked waiting for pages to be available. It will spend some time on waiting for the available pages, but at the end it will transfer less data over the network, and OST will also write less data to disk, so that it can still be performant.
Of course, we can make it smarter by checking if there are too many threads waiting for available pages, and in that case we decide to not compress some RPCs. But this work can be deferred to the time after we have the code running and tune it by actual workload.

In order to decide the size of the pool, we should consider the number of CPUs on the client node, and the default RPC size. Let?s start with MAX(32, number_of_cpus) * default_RPC_size, and the default RPC size is 4MB in 2.10+ releases.

Jinshan

From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Patrick Farrell <paf@cray.com>
Date: Wednesday, July 26, 2017 at 11:27 AM
To: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>, "Xiong, Jinshan" <jinshan.xiong@intel.com>
Cc: Matthew Ahrens <mahrens@delphix.com>, "Zhuravlev, Alexey" <alexey.zhuravlev@intel.com>, lustre-devel <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] Design proposal for client-side compression

Anna,

Having reread your LAD presentation (I was there, but it's been a while...), I think you've got a good architecture.

A few thoughts.

1. Jinshan was just suggesting including in the code a switch to enable/disable the feature at runtime, for an example, see his fast read patch:
https://review.whamcloud.com/#/c/20255/
Especially the proc section:
https://review.whamcloud.com/#/c/20255/7/lustre/llite/lproc_llite.c
The effect of that is a file in proc that one can use to disable/enable the feature by echoing 0 or 1.
(I think there is probably a place for tuning beyond that, but that's separate.)
This is great for features that may have complex impacts, and also for people who want to test a feature to see how it changes things.
2. Lustre clients iterate over the stripes, basically.

Here's an explanation of the write path on the client that should help.  This explanation is heavily simplified and incorrect in some of the details, but should be accurate enough for your question.
The I/O model on the client (for buffered I/O, direct I/O is different) is that the writing process (userspace process) starts an I/O, then identifies which parts of the I/O go to which stripes, gets the locks it needs, then copies the data through the page cache...  Once the data is copied to the page cache, Lustre then works on writing out that data.  In general, it does it asynchronously, where the userspace process returns and then data write-out is handled by the ptlrpcd (daemon) threads, but in various exceptional conditions it may do the write-out in the userspace process.

In general, the write out is going to happen in parallel (to different OSTs) with different ptlrpcd threads taking different chunks of data and putting them on the wire, and sometimes the userspace thread doing that work for some of the data as well.

So "How much memory do we need at most at the same time?" is not a question with an easy answer.  When doing a bulk RPC, generally, the sender sends an RPC announcing the bulk data is ready, then the recipient copies the data (RDMA) (or the sender sends it over to a buffer if no RDMA) and announces to the client it has done so.  I'm not 100% clear on the sequencing here, but the key thing is there's a time where we've sent the RPC but we aren't done with the buffer.  So we can send another RPC before that buffer is retired.  (If I've got this badly wrong, I hope someone will correct me.

So the total amount of memory required to do this is going to depend on how fast data is being sent, rather than on the # of OSTs or any other constant.

There *is* a per OST limit to how many RPCs a client can have in flight at once, but it's generally set so the client can get good performance to one OST.  Allocating data for max_rpcs_in_flight*num OSTs would be far too much, because in the 1000 OST case, a client can probably only have a few hundred RPCs in flight (if that...) at once on a normal network.

But if we are writing from one client to many OSTs, how many RPCs are in flight at once is going to depend more on how fast our network is (or, possibly, CPU on the client if the network is fast and/or CPU is slow) than any explicit limits.  The explicit limits are much higher than we will hit in practice.

Does that make sense?  It doesn't make your problem any easier...

It actually seems like maybe a global pool of pages *is* the right answer.  The question is how big to make it...
What about making it grow on demand up to a configurable upper limit?

The allocation code for encryption is here (it's pretty complicated and it works on the assumption that it must get pages or return ENOMEM - The compression code doesn't absolutely have to get pages, so it could be changed):
sptlrpc_enc_pool_get_pages

It seems like maybe that code could be adjusted to serve both the encryption case (must not fail, if it can't get memory, return -ENOMEM to cause retries), and the compression case (can fail, if it fails, should not do compression...  Maybe should consume less memory)

About thread counts:
Encryption is handled in the ptlrpc code, and your presentation noted the plan is to mimic that, which sounds good to me.  That means there's no reason for you to explicitly control the number of threads doing compression, the same number of threads doing sending will be doing compression, which seems fine.  (Unless there's some point of contention in the compression code, but that seems unlikely...)

Hope that helps a bit.

- Patrick


________________________________
From: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Sent: Tuesday, July 25, 2017 9:25:40 AM
To: Patrick Farrell; Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression

Thank you for your responses.

Patrick,

On Fri, 2017-07-21 at 16:43 +0000, Patrick Farrell wrote:
> I think basing this on the maximum number of stripes it too simple,
> and maybe not necessary.
>
> Apologies in advance if what I say below rests on a misunderstanding
> of the compression design, I should know it better than I do.

Probably I still don't get all the relevant internals of Lustre to
clearly describe what we are planning and what we need.

> About based on maximum stripe count, there are a number of 1000 OST
> systems in the world today.  Imagine one of them with 16 MiB stripes,
> that's ~16 GiB of memory for this.  I think that's clearly too
> large.  But a global (rather than per OSC) pool could be tricky too,
> leading to contention on getting and returning pages.

Well, does that mean that one Lustre client handles all the 16GiB of
stripes at the same time or does it somehow iterate over the stripes?
How much memory do we need at most at the same time? If the client
first processes 100 stripes, we need enough memory to compress 100
stripes at the same time. So the question is not about the maximum
stripe count, but the maximum, let me call it queue portion of stripes,
which can be processed at the same time within one client.

>
> You mention later a 50 MiB pool per client.  As a per OST pre-
> allocated pool, that would likely be too large.  As a global pool, it
> seems small...
>
> But why use a global pool?  It sounds like the compression would be
> handled by the thread putting the data on the wire (Sorry if I've got
> that wrong).  So - What about a per-thread block of pages, for each
> ptlrpcd thread?  If the idea is that this compressed data is not
> retained for replay (instead, you would re-compress), then we only
> need a block of max rpc size for each thread (You could just use the
> largest RPC size supported by the client), so it can send that
> compressed data.

Yes, we don't really need a very global pool, but still we need to
know, how many threads can there be at the same time? Is there one
thread per stripe or per RPC? And how many in total?

>
> The overhead of compression for replay is probably not something we
> need to worry about.
>
> Or even per-CPU blocks of pages.  That would probably be better still
> (less total memory if there are more ptlrpcds than CPUs), if we can
> guarantee not sleeping during the time the pool is in use.  (I'm not
> sure.)
>
> Also, you mention limiting the # of threads.  Why is limiting the
> number of threads doing compression of interest?  What are you
> specifically trying to avoid with that?

I mean that the number of threads available for compression is somehow
limited. If we have 100 stripes at the same time, we still can compress
with #cores threads, which might be less than 100. So if there are more
stripes in flight than we have resources for compression (since it is
slower), we need to decide whether to slow down everything or to skip
compression of some stripes for overall better performance.


Jinshan,

>
> Is it possible to enable this by writing to a sysfs or procfs entry?
> So that users can try this out without having to recompile Lustre.

The size should be controllable dynamically, but for the feature Lustre
has to be recompiled anyway.


>
> It?s not scalable to have a pool per OSC because Lustre can support
> up to 2000 stripes. However, we don?t need to worry about wide stripe
> problem because no one can write a full stripe with even 1MB stripe
> size, because that means application has to issue 2GB size of write.

Does that mean, we have 2000 stripes and we have 2000 messages/RPCs at
the same time? And we need to be able to compress all 2000 stripes at
the same time to avoid blocking? Is there not any limit how many one
client can have at one point of time?

> Yes, it?s reasonable to have a global pool for each client node.
> Let?s start from this number but please make it adjustable via sysfs
> or procfs.

I am still not sure how large we can do it. Once we really need 16 GiB
for that pool to quickly serve the compression threads, it is not
doable and we have to think different. Once we have a smaller pool than
we need, we have to block or skip compression, which is undesirable.
But I don't know how to determine the required size.

Thanks!

Best regards,
Anna

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170726/0b1021f4/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-26 18:26                 ` Patrick Farrell
  2017-07-26 20:17                   ` Xiong, Jinshan
@ 2017-07-27  8:26                   ` Anna Fuchs
  2017-07-27 19:22                     ` Patrick Farrell
  1 sibling, 1 reply; 27+ messages in thread
From: Anna Fuchs @ 2017-07-27  8:26 UTC (permalink / raw)
  To: lustre-devel

Patrick, 

> Having reread your LAD presentation (I was there, but it's been a
> while...), I think you've got a good architecture.

There have been some changes since that, but the general things should
be the same.

> A few thoughts.
> 
> 1. Jinshan was just suggesting including in the code a switch to
> enable/disable the feature at runtime, for an example, see his fast
> read patch:
> https://review.whamcloud.com/#/c/20255/
> Especially the proc section:
> https://review.whamcloud.com/#/c/20255/7/lustre/llite/lproc_llite.c
> The effect of that is a file in proc that one can use to
> disable/enable the feature by echoing 0 or 1.
> (I think there is probably a place for tuning beyond that, but that's
> separate.)
> This is great for features that may have complex impacts, and also
> for people who want to test a feature to see how it changes things.

Oh, I misunderstood Jinshan last time, sorry. Yes, it would be much
easier for users and should be possible. Thank you for references!

> 2. Lustre clients iterate over the stripes, basically.
> 
> Here's an explanation of the write path on the client that should
> help.  This explanation is heavily simplified and incorrect in some
> of the details, but should be accurate enough for your question.
> The I/O model on the client (for buffered I/O, direct I/O is
> different) is that the writing process (userspace process) starts an
> I/O, then identifies which parts of the I/O go to which stripes, gets
> the locks it needs, then copies the data through the page cache... 
> Once the data is copied to the page cache, Lustre then works on
> writing out that data.  In general, it does it asynchronously, where
> the userspace process returns and then data write-out is handled by
> the ptlrpcd (daemon) threads, but in various exceptional conditions
> it may do the write-out in the userspace process.
> 
> In general, the write out is going to happen in parallel (to
> different OSTs) with different ptlrpcd threads taking different
> chunks of data and putting them on the wire, and sometimes the
> userspace thread doing that work for some of the data as well.
> 
> So "How much memory do we need at most at the same time?" is not a
> question with an easy answer.  When doing a bulk RPC, generally, the
> sender sends an RPC announcing the bulk data is ready, then the
> recipient copies the data (RDMA) (or the sender sends it over to a
> buffer if no RDMA) and announces to the client it has done so.  I'm
> not 100% clear on the sequencing here, but the key thing is there's a
> time where we've sent the RPC but we aren't done with the buffer.  So
> we can send another RPC before that buffer is retired.  (If I've got
> this badly wrong, I hope someone will correct me.
> 
> So the total amount of memory required to do this is going to depend
> on how fast data is being sent, rather than on the # of OSTs or any
> other constant.
> 
> There *is* a per OST limit to how many RPCs a client can have in
> flight at once, but it's generally set so the client can get good
> performance to one OST.  Allocating data for max_rpcs_in_flight*num
> OSTs would be far too much, because in the 1000 OST case, a client
> can probably only have a few hundred RPCs in flight (if that...) at
> once on a normal network.
> 
> But if we are writing from one client to many OSTs, how many RPCs are
> in flight at once is going to depend more on how fast our network is
> (or, possibly, CPU on the client if the network is fast and/or CPU is
> slow) than any explicit limits.  The explicit limits are much higher
> than we will hit in practice.
> 
> Does that make sense?  It doesn't make your problem any easier...

Totally, and you are right, it is more complex than I hoped. 

> 
> It actually seems like maybe a global pool of pages *is* the right
> answer.  The question is how big to make it...
> What about making it grow on demand up to a configurable upper limit?
> 
> The allocation code for encryption is here (it's pretty complicated
> and it works on the assumption that it must get pages or return
> ENOMEM - The compression code doesn't absolutely have to get pages,
> so it could be changed):
> sptlrpc_enc_pool_get_pages
> 
> It seems like maybe that code could be adjusted to serve both the
> encryption case (must not fail, if it can't get memory, return
> -ENOMEM to cause retries), and the compression case (can fail, if it
> fails, should not do compression...  Maybe should consume less
> memory)

Currently we are not very close to the sptlrpc layer and do not use any
of the encryption structures (it was initially planned, but turned out
differently). But we have already looked at those pools.

> 
> About thread counts:
> Encryption is handled in the ptlrpc code, and your presentation noted
> the plan is to mimic that, which sounds good to me.  That means
> there's no reason for you to explicitly control the number of threads
> doing compression, the same number of threads doing sending will be
> doing compression, which seems fine.  (Unless there's some point of
> contention in the compression code, but that seems unlikely...)

We currently intervene before the request is created
(osc_brw_prep_request) but still we don't do anything explicitly with
threads, just put some more tasks to the existing ones. Limited
resources is more the later part where we will optimize, tune and
introduce the adaptive part. 

> 
> Hope that helps a bit.

It helps a lot! Thank you!

Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-26 20:17                   ` Xiong, Jinshan
@ 2017-07-27  8:26                     ` Anna Fuchs
  0 siblings, 0 replies; 27+ messages in thread
From: Anna Fuchs @ 2017-07-27  8:26 UTC (permalink / raw)
  To: lustre-devel

Jinshan,

On Wed, 2017-07-26 at 20:17 +0000, Xiong, Jinshan wrote:
> Thanks for the detailed explanation from Patrick.
>  
> ?Does that mean, ? Is there not any limit how many one
> client can have at one point of time??
>  
> In theory, it?s possible that there exist that many active RPCs at
> one time, which is why I think it?s not feasible to have per-OSC page
> pool.

I see and agree, that could become a problem.

>  
> ?? Once we have a smaller pool than
> we need, we have to block or skip compression, which is undesirable.
> But I don't know how to determine the required size.?
>  
> It?s probably not good to skip compression once it runs out of pages
> in pool, instead it should be blocked waiting for pages to be
> available. It will spend some time on waiting for the available
> pages, but at the end it will transfer less data over the network,
> and OST will also write less data to disk, so that it can still be
> performant.
> Of course, we can make it smarter by checking if there are too many
> threads waiting for available pages, and in that case we decide to
> not compress some RPCs. But this work can be deferred to the time
> after we have the code running and tune it by actual workload.

You are totally right, the smart tuning will be the second part of our
work, when the infrastructure is done. To start with something, we can
currently just block. 

> In order to decide the size of the pool, we should consider the
> number of CPUs on the client node, and the default RPC size. Let?s
> start with MAX(32, number_of_cpus) * default_RPC_size, and the
> default RPC size is 4MB in 2.10+ releases.

Thank you very much for the suggestions and explanations!

Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-27  8:26                   ` Anna Fuchs
@ 2017-07-27 19:22                     ` Patrick Farrell
  2017-07-28  9:57                       ` Anna Fuchs
  0 siblings, 1 reply; 27+ messages in thread
From: Patrick Farrell @ 2017-07-27 19:22 UTC (permalink / raw)
  To: lustre-devel

Ann,


I would be happy to help with review, etc, on this once it's ready to be posted.


In the meantime, I am curious about how you handled the compression and the discontiguous set of pages problem.  Did you use scatter-gather lists like the encryption code does, or some other solution?


Are you willing/able to share the current code, perhaps even off list?  I certainly understand if not, but I am curious to see how it will work and explore the performance implications.


- Patrick

________________________________
From: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Sent: Thursday, July 27, 2017 3:26:00 AM
To: Patrick Farrell; Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression

Patrick,

> Having reread your LAD presentation (I was there, but it's been a
> while...), I think you've got a good architecture.

There have been some changes since that, but the general things should
be the same.

> A few thoughts.
>
> 1. Jinshan was just suggesting including in the code a switch to
> enable/disable the feature at runtime, for an example, see his fast
> read patch:
> https://review.whamcloud.com/#/c/20255/
> Especially the proc section:
> https://review.whamcloud.com/#/c/20255/7/lustre/llite/lproc_llite.c
> The effect of that is a file in proc that one can use to
> disable/enable the feature by echoing 0 or 1.
> (I think there is probably a place for tuning beyond that, but that's
> separate.)
> This is great for features that may have complex impacts, and also
> for people who want to test a feature to see how it changes things.

Oh, I misunderstood Jinshan last time, sorry. Yes, it would be much
easier for users and should be possible. Thank you for references!

> 2. Lustre clients iterate over the stripes, basically.
>
> Here's an explanation of the write path on the client that should
> help.  This explanation is heavily simplified and incorrect in some
> of the details, but should be accurate enough for your question.
> The I/O model on the client (for buffered I/O, direct I/O is
> different) is that the writing process (userspace process) starts an
> I/O, then identifies which parts of the I/O go to which stripes, gets
> the locks it needs, then copies the data through the page cache...
> Once the data is copied to the page cache, Lustre then works on
> writing out that data.  In general, it does it asynchronously, where
> the userspace process returns and then data write-out is handled by
> the ptlrpcd (daemon) threads, but in various exceptional conditions
> it may do the write-out in the userspace process.
>
> In general, the write out is going to happen in parallel (to
> different OSTs) with different ptlrpcd threads taking different
> chunks of data and putting them on the wire, and sometimes the
> userspace thread doing that work for some of the data as well.
>
> So "How much memory do we need at most at the same time?" is not a
> question with an easy answer.  When doing a bulk RPC, generally, the
> sender sends an RPC announcing the bulk data is ready, then the
> recipient copies the data (RDMA) (or the sender sends it over to a
> buffer if no RDMA) and announces to the client it has done so.  I'm
> not 100% clear on the sequencing here, but the key thing is there's a
> time where we've sent the RPC but we aren't done with the buffer.  So
> we can send another RPC before that buffer is retired.  (If I've got
> this badly wrong, I hope someone will correct me.
>
> So the total amount of memory required to do this is going to depend
> on how fast data is being sent, rather than on the # of OSTs or any
> other constant.
>
> There *is* a per OST limit to how many RPCs a client can have in
> flight at once, but it's generally set so the client can get good
> performance to one OST.  Allocating data for max_rpcs_in_flight*num
> OSTs would be far too much, because in the 1000 OST case, a client
> can probably only have a few hundred RPCs in flight (if that...) at
> once on a normal network.
>
> But if we are writing from one client to many OSTs, how many RPCs are
> in flight at once is going to depend more on how fast our network is
> (or, possibly, CPU on the client if the network is fast and/or CPU is
> slow) than any explicit limits.  The explicit limits are much higher
> than we will hit in practice.
>
> Does that make sense?  It doesn't make your problem any easier...

Totally, and you are right, it is more complex than I hoped.

>
> It actually seems like maybe a global pool of pages *is* the right
> answer.  The question is how big to make it...
> What about making it grow on demand up to a configurable upper limit?
>
> The allocation code for encryption is here (it's pretty complicated
> and it works on the assumption that it must get pages or return
> ENOMEM - The compression code doesn't absolutely have to get pages,
> so it could be changed):
> sptlrpc_enc_pool_get_pages
>
> It seems like maybe that code could be adjusted to serve both the
> encryption case (must not fail, if it can't get memory, return
> -ENOMEM to cause retries), and the compression case (can fail, if it
> fails, should not do compression...  Maybe should consume less
> memory)

Currently we are not very close to the sptlrpc layer and do not use any
of the encryption structures (it was initially planned, but turned out
differently). But we have already looked at those pools.

>
> About thread counts:
> Encryption is handled in the ptlrpc code, and your presentation noted
> the plan is to mimic that, which sounds good to me.  That means
> there's no reason for you to explicitly control the number of threads
> doing compression, the same number of threads doing sending will be
> doing compression, which seems fine.  (Unless there's some point of
> contention in the compression code, but that seems unlikely...)

We currently intervene before the request is created
(osc_brw_prep_request) but still we don't do anything explicitly with
threads, just put some more tasks to the existing ones. Limited
resources is more the later part where we will optimize, tune and
introduce the adaptive part.

>
> Hope that helps a bit.

It helps a lot! Thank you!

Anna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170727/4cbace68/attachment.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-27 19:22                     ` Patrick Farrell
@ 2017-07-28  9:57                       ` Anna Fuchs
  2017-07-28 13:46                         ` Patrick Farrell
  0 siblings, 1 reply; 27+ messages in thread
From: Anna Fuchs @ 2017-07-28  9:57 UTC (permalink / raw)
  To: lustre-devel

Patrick,

On Thu, 2017-07-27 at 19:22 +0000, Patrick Farrell wrote:
> Ann,
> 
> I would be happy to help with review, etc, on this once it's ready to
> be posted.

thanks for that! 

> In the meantime, I am curious about how you handled the compression
> and the discontiguous set of pages problem.  Did you use scatter-
> gather lists like the encryption code does, or some other solution?

I am mainly working on the infrastructure and Lustre/ZFS integration
regardless the concrete algorithm, but we faced this problem very
early. In my prototype I still have the very costly approach of
allocating three contiguous buffers (src, dst, wrkmem), allocating
additional destination pages, copying original pages to void* src
buffer, compressing to void* dst buffer and again copying to dst page
buffer. A lot of expensive copies and memory wasting. But with the
original Kernel-LZ4 there is no other way. I can send you the
corresponding code part, but it is totally boring - alloc, alloc,
alloc... copy, copy, ... copy.

In parallel to my work site, we assigned a student to adopt LZ4 to the
page structure. Our first idea has also been scatter-lists seen in the
encryption code. Since scatter-lists use linked lists it somehow turned
out to be very inefficient for traversing the data. The corresponding
Bachelor's thesis will be submitted soon (within a month?), so we have
to proofread it in detail. However, the student implemented another
version of LZ4, which works directly on pages (code party online, full
version will follow).
It is tested, but might be not in the productive stage now (will
hopefully be after submission and reviewing). This version shows a
little lower compression ratio but comparable or better speed. We will
see how we can use it to avoid the memory and copy overhead. It seemed 
there is no good way how to change only the data structure in a clean
way without changing the de-/compressor's logic. 

Another interesting thing is the newest LZ4m [0], which is similar to
the work of our student in many aspects, but still differs (waiting for
final thesis). 

However, for the LZ4 we see good chances to get rid of that overhead by
using which ever modification. But since we want and should support
more algorithms, we still do not have any universal solution. E.g. for
zstd, which also seems to be suitable for our needs (another thesis..),
we would need to make same efforts again or pay the overhead. 

[0] http://csl.skku.edu/papers/icce17.pdf

If anyone has other ideas, please, let me know!

Regards,
Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-28  9:57                       ` Anna Fuchs
@ 2017-07-28 13:46                         ` Patrick Farrell
  2017-07-28 15:12                           ` Anna Fuchs
  0 siblings, 1 reply; 27+ messages in thread
From: Patrick Farrell @ 2017-07-28 13:46 UTC (permalink / raw)
  To: lustre-devel


Ah.  As it turns out, much more complicated than I anticipated.  Thanks for explaining...


I have no expertise in compression algorithms, so that I will have to just watch from the sidelines.  Good luck.

When you are further along, I remain interested in helping out with the Lustre side of things.


One more question - Do you have a plan to make this work *without* the ZFS integration as well, for those using ldiskfs?  That seems straightforward enough - compress/decompress at send and recieve time - even if the benefits would be smaller, but not everyone (Cray, f.x.) is using ZFS, so I'm very interested in something that would help ldiskfs as well.  (Which is not to say don't do the deeper integration with ZFS.  Just that we'd like something available for ldiskfs too.)


Regards,

- Patrick

________________________________
From: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Sent: Friday, July 28, 2017 4:57:27 AM
To: Patrick Farrell; Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression

Patrick,

On Thu, 2017-07-27 at 19:22 +0000, Patrick Farrell wrote:
> Ann,
>
> I would be happy to help with review, etc, on this once it's ready to
> be posted.

thanks for that!

> In the meantime, I am curious about how you handled the compression
> and the discontiguous set of pages problem.  Did you use scatter-
> gather lists like the encryption code does, or some other solution?

I am mainly working on the infrastructure and Lustre/ZFS integration
regardless the concrete algorithm, but we faced this problem very
early. In my prototype I still have the very costly approach of
allocating three contiguous buffers (src, dst, wrkmem), allocating
additional destination pages, copying original pages to void* src
buffer, compressing to void* dst buffer and again copying to dst page
buffer. A lot of expensive copies and memory wasting. But with the
original Kernel-LZ4 there is no other way. I can send you the
corresponding code part, but it is totally boring - alloc, alloc,
alloc... copy, copy, ... copy.

In parallel to my work site, we assigned a student to adopt LZ4 to the
page structure. Our first idea has also been scatter-lists seen in the
encryption code. Since scatter-lists use linked lists it somehow turned
out to be very inefficient for traversing the data. The corresponding
Bachelor's thesis will be submitted soon (within a month?), so we have
to proofread it in detail. However, the student implemented another
version of LZ4, which works directly on pages (code party online, full
version will follow).
It is tested, but might be not in the productive stage now (will
hopefully be after submission and reviewing). This version shows a
little lower compression ratio but comparable or better speed. We will
see how we can use it to avoid the memory and copy overhead. It seemed
there is no good way how to change only the data structure in a clean
way without changing the de-/compressor's logic.

Another interesting thing is the newest LZ4m [0], which is similar to
the work of our student in many aspects, but still differs (waiting for
final thesis).

However, for the LZ4 we see good chances to get rid of that overhead by
using which ever modification. But since we want and should support
more algorithms, we still do not have any universal solution. E.g. for
zstd, which also seems to be suitable for our needs (another thesis..),
we would need to make same efforts again or pay the overhead.

[0] http://csl.skku.edu/papers/icce17.pdf
LZ4m: A Fast Compression Algorithm for In-Memory Data<http://csl.skku.edu/papers/icce17.pdf>
csl.skku.edu
LZ4m: A Fast Compression Algorithm for In-Memory Data Se-Jun Kwon ?, Sang-Hoon Kim?, Hyeong-Jun Kim?, and Jin-Soo Kim ?College of Information and ...




If anyone has other ideas, please, let me know!

Regards,
Anna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170728/38ce7e74/attachment.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-28 13:46                         ` Patrick Farrell
@ 2017-07-28 15:12                           ` Anna Fuchs
  2017-07-28 16:53                             ` Patrick Farrell
  0 siblings, 1 reply; 27+ messages in thread
From: Anna Fuchs @ 2017-07-28 15:12 UTC (permalink / raw)
  To: lustre-devel


> Ah.  As it turns out, much more complicated than I anticipated.
>  Thanks for explaining...
> 
> I have no expertise in compression algorithms, so that I will have to
> just watch from the sidelines.  Good luck.
> 
> When you are further along, I remain interested in helping out with
> the Lustre side of things.
> 
> One more question - Do you have a plan to make this work *without*
> the ZFS integration as well, for those using ldiskfs?  That seems
> straightforward enough - compress/decompress at send and recieve time
> - even if the benefits would be smaller, but not everyone (Cray,
> f.x.) is using ZFS, so I'm very interested in something that would
> help ldiskfs as well.  (Which is not to say don't do the deeper
> integration with ZFS.  Just that we'd like something available for
> ldiskfs too.)

I fear it is also much more complicated :)

At the very beginning of the project proposal we hoped we wouldn't need
to touch the server so much. It turned out wrong, moreover we have to
modify not only the Lustre server, but also pretty much the backend
itself. We chose ZFS since it already provides a lot of infrastructure
that we would need to implement completely new in ldiskfs. Since, at
least for me, it is a research project, ldiskfs is out of scope. Once
we proved the concept, one could re-implement the whole compression
stack for ldiskfs. So it is not impossible, but not our focus for this
project. 

Nevertheless we tried to keep our changes as far as possible not very
backend specific. For example we need some additional information to be
stored per compressed chunk. One possibility would be to change the
block pointer of ZFS and add those fields, but I don't think anyone
except of us would like the BP to be modified :) So we decided to store
them as a header for every chunk. For ldiskfs, since one would need to
implement everything from scratch anyway, one might not need that
header, but take the required fields into account from the beginning
and add them to ldiskfs' "block pointer". For that reason, we wanted to
leave the compressed data "headerless" on client-side, and add the
header only on the server side if the corresponding backend requires
it. 

Well, we did it, and it even works sometimes, but it looks horrible 
and is really counterintuitive. We send less data from client than
lands on the OST, recalculate offsets, since we add the header during
receiving on server side, recalculate the sent and received sizes,
shift buffers by offsets and so on. The only advantage of this approach
is client's independence from backend. We decided the price is too
high. So now, I will construct the chunk with the header just after
compressing the data on client-side, get rid of all those offset stuff
on the server. But ldiskfs will have to deal with that ZFS-motivated
details. 

However, a light version of compression could work with smaller changes
to ldiskfs, when we only allow a completely compressed or not
compressed files and allow potential performance drops for broken read-
ahead (due to gaps within the data). 

Hope it is somehow more clear now.

Regards,
Anna

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-28 15:12                           ` Anna Fuchs
@ 2017-07-28 16:53                             ` Patrick Farrell
  2017-07-31 10:20                               ` Anna Fuchs
  2017-08-03 13:41                               ` Dilger, Andreas
  0 siblings, 2 replies; 27+ messages in thread
From: Patrick Farrell @ 2017-07-28 16:53 UTC (permalink / raw)
  To: lustre-devel

Ah, OK.  Reading this, I understand now that your intention is to keep the data compressed on disk - I hadn't thought through the implications of that fully.  There's obviously a lot of benefit from that.

That said, it seems like it would be relatively straightforward to make a version of this that uncompressed the data on arrival at the server, simply unpacking that buffer before writing it to disk.  (Straightforward, that is, once the actual compression/decompression code is ready...)

That obviously takes more CPU on the server side and does not reduce the space required, but...


If you don't mind, when you consider the performant version of the compression code to be ready for at least testing, I'd like to see the code so I can try out the on-the-wire-only compression idea.  It might have significant benefits for a case of interest to me, and if it worked well, it could (long term) probably coexist with the larger on-disk compression idea.  (Since who knows if we'll ever implement the whole thing for ldiskfs.)


Thanks again for engaging with me on this.


- Patrick

________________________________
From: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
Sent: Friday, July 28, 2017 10:12:16 AM
To: Patrick Farrell; Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression


> Ah.  As it turns out, much more complicated than I anticipated.
>  Thanks for explaining...
>
> I have no expertise in compression algorithms, so that I will have to
> just watch from the sidelines.  Good luck.
>
> When you are further along, I remain interested in helping out with
> the Lustre side of things.
>
> One more question - Do you have a plan to make this work *without*
> the ZFS integration as well, for those using ldiskfs?  That seems
> straightforward enough - compress/decompress at send and recieve time
> - even if the benefits would be smaller, but not everyone (Cray,
> f.x.) is using ZFS, so I'm very interested in something that would
> help ldiskfs as well.  (Which is not to say don't do the deeper
> integration with ZFS.  Just that we'd like something available for
> ldiskfs too.)

I fear it is also much more complicated :)

At the very beginning of the project proposal we hoped we wouldn't need
to touch the server so much. It turned out wrong, moreover we have to
modify not only the Lustre server, but also pretty much the backend
itself. We chose ZFS since it already provides a lot of infrastructure
that we would need to implement completely new in ldiskfs. Since, at
least for me, it is a research project, ldiskfs is out of scope. Once
we proved the concept, one could re-implement the whole compression
stack for ldiskfs. So it is not impossible, but not our focus for this
project.

Nevertheless we tried to keep our changes as far as possible not very
backend specific. For example we need some additional information to be
stored per compressed chunk. One possibility would be to change the
block pointer of ZFS and add those fields, but I don't think anyone
except of us would like the BP to be modified :) So we decided to store
them as a header for every chunk. For ldiskfs, since one would need to
implement everything from scratch anyway, one might not need that
header, but take the required fields into account from the beginning
and add them to ldiskfs' "block pointer". For that reason, we wanted to
leave the compressed data "headerless" on client-side, and add the
header only on the server side if the corresponding backend requires
it.

Well, we did it, and it even works sometimes, but it looks horrible
and is really counterintuitive. We send less data from client than
lands on the OST, recalculate offsets, since we add the header during
receiving on server side, recalculate the sent and received sizes,
shift buffers by offsets and so on. The only advantage of this approach
is client's independence from backend. We decided the price is too
high. So now, I will construct the chunk with the header just after
compressing the data on client-side, get rid of all those offset stuff
on the server. But ldiskfs will have to deal with that ZFS-motivated
details.

However, a light version of compression could work with smaller changes
to ldiskfs, when we only allow a completely compressed or not
compressed files and allow potential performance drops for broken read-
ahead (due to gaps within the data).

Hope it is somehow more clear now.

Regards,
Anna

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170728/9c1db15e/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-28 16:53                             ` Patrick Farrell
@ 2017-07-31 10:20                               ` Anna Fuchs
  2017-08-03 13:41                               ` Dilger, Andreas
  1 sibling, 0 replies; 27+ messages in thread
From: Anna Fuchs @ 2017-07-31 10:20 UTC (permalink / raw)
  To: lustre-devel

On Fri, 2017-07-28 at 16:53 +0000, Patrick Farrell wrote:
> Ah, OK.  Reading this, I understand now that your intention is to
> keep the data compressed on disk - I hadn't thought through the
> implications of that fully.  There's obviously a lot of benefit from
> that.
> 
> That said, it seems like it would be relatively straightforward to
> make a version of this that uncompressed the data on arrival at the
> server, simply unpacking that buffer before writing it to disk. 
> (Straightforward, that is, once the actual compression/decompression
> code is ready...)

Sure, this should be easily doable independent from backend, although I
don't see many use cases when the efforts would pay off. 

> 
> That obviously takes more CPU on the server side and does not reduce
> the space required, but...
> 
> If you don't mind, when you consider the performant version of the
> compression code to be ready for at least testing, I'd like to see
> the code so I can try out the on-the-wire-only compression idea.  It
> might have significant benefits for a case of interest to me, and if
> it worked well, it could (long term) probably coexist with the larger
> on-disk compression idea.  (Since who knows if we'll ever implement
> the whole thing for ldiskfs.)

Sure, the client part should be stable very soon and I will share it. 

> Thanks again for engaging with me on this.
> 
> - Patrick

Same to you.

Anna
> From: Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de>
> Sent: Friday, July 28, 2017 10:12:16 AM
> To: Patrick Farrell; Xiong, Jinshan
> Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
> Subject: Re: [lustre-devel] Design proposal for client-side
> compression
>  
> 
> > Ah.  As it turns out, much more complicated than I anticipated.
> >  Thanks for explaining...
> > 
> > I have no expertise in compression algorithms, so that I will have
> to
> > just watch from the sidelines.  Good luck.
> > 
> > When you are further along, I remain interested in helping out with
> > the Lustre side of things.
> > 
> > One more question - Do you have a plan to make this work *without*
> > the ZFS integration as well, for those using ldiskfs?  That seems
> > straightforward enough - compress/decompress at send and recieve
> time
> > - even if the benefits would be smaller, but not everyone (Cray,
> > f.x.) is using ZFS, so I'm very interested in something that would
> > help ldiskfs as well.  (Which is not to say don't do the deeper
> > integration with ZFS.  Just that we'd like something available for
> > ldiskfs too.)
> 
> I fear it is also much more complicated :)
> 
> At the very beginning of the project proposal we hoped we wouldn't
> need
> to touch the server so much. It turned out wrong, moreover we have to
> modify not only the Lustre server, but also pretty much the backend
> itself. We chose ZFS since it already provides a lot of
> infrastructure
> that we would need to implement completely new in ldiskfs. Since, at
> least for me, it is a research project, ldiskfs is out of scope. Once
> we proved the concept, one could re-implement the whole compression
> stack for ldiskfs. So it is not impossible, but not our focus for
> this
> project. 
> 
> Nevertheless we tried to keep our changes as far as possible not very
> backend specific. For example we need some additional information to
> be
> stored per compressed chunk. One possibility would be to change the
> block pointer of ZFS and add those fields, but I don't think anyone
> except of us would like the BP to be modified :) So we decided to
> store
> them as a header for every chunk. For ldiskfs, since one would need
> to
> implement everything from scratch anyway, one might not need that
> header, but take the required fields into account from the beginning
> and add them to ldiskfs' "block pointer". For that reason, we wanted
> to
> leave the compressed data "headerless" on client-side, and add the
> header only on the server side if the corresponding backend requires
> it. 
> 
> Well, we did it, and it even works sometimes, but it looks horrible 
> and is really counterintuitive. We send less data from client than
> lands on the OST, recalculate offsets, since we add the header during
> receiving on server side, recalculate the sent and received sizes,
> shift buffers by offsets and so on. The only advantage of this
> approach
> is client's independence from backend. We decided the price is too
> high. So now, I will construct the chunk with the header just after
> compressing the data on client-side, get rid of all those offset
> stuff
> on the server. But ldiskfs will have to deal with that ZFS-motivated
> details. 
> 
> However, a light version of compression could work with smaller
> changes
> to ldiskfs, when we only allow a completely compressed or not
> compressed files and allow potential performance drops for broken
> read-
> ahead (due to gaps within the data). 
> 
> Hope it is somehow more clear now.
> 
> Regards,
> Anna
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-07-28 16:53                             ` Patrick Farrell
  2017-07-31 10:20                               ` Anna Fuchs
@ 2017-08-03 13:41                               ` Dilger, Andreas
  2017-08-03 15:55                                 ` Patrick Farrell
  1 sibling, 1 reply; 27+ messages in thread
From: Dilger, Andreas @ 2017-08-03 13:41 UTC (permalink / raw)
  To: lustre-devel

Patrick,
If you consider over-the-wire compression only, you could probably treat it the same as a crypto algorithm that decompresses the data at the server?

Unfortunately, I haven't looked at either the crypto or compression code closely, but one would think that they could share a fair amount of infrastructure.

Cheers, Andreas

On Jul 28, 2017, at 18:53, Patrick Farrell <paf at cray.com<mailto:paf@cray.com>> wrote:


Ah, OK.  Reading this, I understand now that your intention is to keep the data compressed on disk - I hadn't thought through the implications of that fully.  There's obviously a lot of benefit from that.

That said, it seems like it would be relatively straightforward to make a version of this that uncompressed the data on arrival at the server, simply unpacking that buffer before writing it to disk.  (Straightforward, that is, once the actual compression/decompression code is ready...)

That obviously takes more CPU on the server side and does not reduce the space required, but...


If you don't mind, when you consider the performant version of the compression code to be ready for at least testing, I'd like to see the code so I can try out the on-the-wire-only compression idea.  It might have significant benefits for a case of interest to me, and if it worked well, it could (long term) probably coexist with the larger on-disk compression idea.  (Since who knows if we'll ever implement the whole thing for ldiskfs.)


Thanks again for engaging with me on this.


- Patrick

________________________________
From: Anna Fuchs <anna.fuchs at informatik.uni-hamburg.de<mailto:anna.fuchs@informatik.uni-hamburg.de>>
Sent: Friday, July 28, 2017 10:12:16 AM
To: Patrick Farrell; Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression


> Ah.  As it turns out, much more complicated than I anticipated.
>  Thanks for explaining...
>
> I have no expertise in compression algorithms, so that I will have to
> just watch from the sidelines.  Good luck.
>
> When you are further along, I remain interested in helping out with
> the Lustre side of things.
>
> One more question - Do you have a plan to make this work *without*
> the ZFS integration as well, for those using ldiskfs?  That seems
> straightforward enough - compress/decompress at send and recieve time
> - even if the benefits would be smaller, but not everyone (Cray,
> f.x.) is using ZFS, so I'm very interested in something that would
> help ldiskfs as well.  (Which is not to say don't do the deeper
> integration with ZFS.  Just that we'd like something available for
> ldiskfs too.)

I fear it is also much more complicated :)

At the very beginning of the project proposal we hoped we wouldn't need
to touch the server so much. It turned out wrong, moreover we have to
modify not only the Lustre server, but also pretty much the backend
itself. We chose ZFS since it already provides a lot of infrastructure
that we would need to implement completely new in ldiskfs. Since, at
least for me, it is a research project, ldiskfs is out of scope. Once
we proved the concept, one could re-implement the whole compression
stack for ldiskfs. So it is not impossible, but not our focus for this
project.

Nevertheless we tried to keep our changes as far as possible not very
backend specific. For example we need some additional information to be
stored per compressed chunk. One possibility would be to change the
block pointer of ZFS and add those fields, but I don't think anyone
except of us would like the BP to be modified :) So we decided to store
them as a header for every chunk. For ldiskfs, since one would need to
implement everything from scratch anyway, one might not need that
header, but take the required fields into account from the beginning
and add them to ldiskfs' "block pointer". For that reason, we wanted to
leave the compressed data "headerless" on client-side, and add the
header only on the server side if the corresponding backend requires
it.

Well, we did it, and it even works sometimes, but it looks horrible
and is really counterintuitive. We send less data from client than
lands on the OST, recalculate offsets, since we add the header during
receiving on server side, recalculate the sent and received sizes,
shift buffers by offsets and so on. The only advantage of this approach
is client's independence from backend. We decided the price is too
high. So now, I will construct the chunk with the header just after
compressing the data on client-side, get rid of all those offset stuff
on the server. But ldiskfs will have to deal with that ZFS-motivated
details.

However, a light version of compression could work with smaller changes
to ldiskfs, when we only allow a completely compressed or not
compressed files and allow potential performance drops for broken read-
ahead (due to gaps within the data).

Hope it is somehow more clear now.

Regards,
Anna

_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170803/9c8fee84/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [lustre-devel] Design proposal for client-side compression
  2017-08-03 13:41                               ` Dilger, Andreas
@ 2017-08-03 15:55                                 ` Patrick Farrell
  0 siblings, 0 replies; 27+ messages in thread
From: Patrick Farrell @ 2017-08-03 15:55 UTC (permalink / raw)
  To: lustre-devel

Yes, agreed.  After digging in to it a bit, though, the crypto code proved harder to adjust/add to than I expected, so I've tentatively decided I'd like to piggyback off of what Anna and company do.  At the very least, I would like their version of the compression algorithm, so it seemed best to wait and see what form their code takes.


If it's easily tweakable to do what I'd like, then I will likely just do that. (If nothing else, I badly need to prototype this idea to see how workable it is before getting serious, so I will do that in the easiest way possible...)  If their code isn't easily adjustable to this, I plan to take their improved algorithm and go back to using the crypto infrastructure (effectively adding a new crypto mode).

________________________________
From: Dilger, Andreas <andreas.dilger@intel.com>
Sent: Thursday, August 3, 2017 8:41:15 AM
To: Patrick Farrell
Cc: Anna Fuchs; Xiong, Jinshan; Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression

Patrick,
If you consider over-the-wire compression only, you could probably treat it the same as a crypto algorithm that decompresses the data at the server?

Unfortunately, I haven't looked at either the crypto or compression code closely, but one would think that they could share a fair amount of infrastructure.

Cheers, Andreas

On Jul 28, 2017, at 18:53, Patrick Farrell <paf at cray.com<mailto:paf@cray.com>> wrote:


Ah, OK.  Reading this, I understand now that your intention is to keep the data compressed on disk - I hadn't thought through the implications of that fully.  There's obviously a lot of benefit from that.

That said, it seems like it would be relatively straightforward to make a version of this that uncompressed the data on arrival at the server, simply unpacking that buffer before writing it to disk.  (Straightforward, that is, once the actual compression/decompression code is ready...)

That obviously takes more CPU on the server side and does not reduce the space required, but...


If you don't mind, when you consider the performant version of the compression code to be ready for at least testing, I'd like to see the code so I can try out the on-the-wire-only compression idea.  It might have significant benefits for a case of interest to me, and if it worked well, it could (long term) probably coexist with the larger on-disk compression idea.  (Since who knows if we'll ever implement the whole thing for ldiskfs.)


Thanks again for engaging with me on this.


- Patrick

________________________________
From: Anna Fuchs <anna.fuchs at informatik.uni-hamburg.de<mailto:anna.fuchs@informatik.uni-hamburg.de>>
Sent: Friday, July 28, 2017 10:12:16 AM
To: Patrick Farrell; Xiong, Jinshan
Cc: Matthew Ahrens; Zhuravlev, Alexey; lustre-devel
Subject: Re: [lustre-devel] Design proposal for client-side compression


> Ah.  As it turns out, much more complicated than I anticipated.
>  Thanks for explaining...
>
> I have no expertise in compression algorithms, so that I will have to
> just watch from the sidelines.  Good luck.
>
> When you are further along, I remain interested in helping out with
> the Lustre side of things.
>
> One more question - Do you have a plan to make this work *without*
> the ZFS integration as well, for those using ldiskfs?  That seems
> straightforward enough - compress/decompress at send and recieve time
> - even if the benefits would be smaller, but not everyone (Cray,
> f.x.) is using ZFS, so I'm very interested in something that would
> help ldiskfs as well.  (Which is not to say don't do the deeper
> integration with ZFS.  Just that we'd like something available for
> ldiskfs too.)

I fear it is also much more complicated :)

At the very beginning of the project proposal we hoped we wouldn't need
to touch the server so much. It turned out wrong, moreover we have to
modify not only the Lustre server, but also pretty much the backend
itself. We chose ZFS since it already provides a lot of infrastructure
that we would need to implement completely new in ldiskfs. Since, at
least for me, it is a research project, ldiskfs is out of scope. Once
we proved the concept, one could re-implement the whole compression
stack for ldiskfs. So it is not impossible, but not our focus for this
project.

Nevertheless we tried to keep our changes as far as possible not very
backend specific. For example we need some additional information to be
stored per compressed chunk. One possibility would be to change the
block pointer of ZFS and add those fields, but I don't think anyone
except of us would like the BP to be modified :) So we decided to store
them as a header for every chunk. For ldiskfs, since one would need to
implement everything from scratch anyway, one might not need that
header, but take the required fields into account from the beginning
and add them to ldiskfs' "block pointer". For that reason, we wanted to
leave the compressed data "headerless" on client-side, and add the
header only on the server side if the corresponding backend requires
it.

Well, we did it, and it even works sometimes, but it looks horrible
and is really counterintuitive. We send less data from client than
lands on the OST, recalculate offsets, since we add the header during
receiving on server side, recalculate the sent and received sizes,
shift buffers by offsets and so on. The only advantage of this approach
is client's independence from backend. We decided the price is too
high. So now, I will construct the chunk with the header just after
compressing the data on client-side, get rid of all those offset stuff
on the server. But ldiskfs will have to deal with that ZFS-motivated
details.

However, a light version of compression could work with smaller changes
to ldiskfs, when we only allow a completely compressed or not
compressed files and allow potential performance drops for broken read-
ahead (due to gaps within the data).

Hope it is somehow more clear now.

Regards,
Anna

_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170803/51c964a2/attachment.htm>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-08-03 15:55 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-09 13:07 [lustre-devel] Design proposal for client-side compression Anna Fuchs
2017-01-09 18:05 ` Xiong, Jinshan
2017-01-12 12:15   ` Anna Fuchs
2017-01-17 19:51     ` Xiong, Jinshan
2017-01-18 14:19       ` Anna Fuchs
2017-02-16 14:15         ` Anna Fuchs
2017-02-17 19:15           ` Xiong, Jinshan
2017-02-17 20:29             ` Dilger, Andreas
2017-02-17 21:03               ` Xiong, Jinshan
2017-02-17 21:36                 ` Dilger, Andreas
2017-07-21 15:15           ` Anna Fuchs
2017-07-21 16:43             ` Patrick Farrell
2017-07-21 19:19               ` Xiong, Jinshan
2017-07-25 14:25               ` Anna Fuchs
2017-07-26 18:26                 ` Patrick Farrell
2017-07-26 20:17                   ` Xiong, Jinshan
2017-07-27  8:26                     ` Anna Fuchs
2017-07-27  8:26                   ` Anna Fuchs
2017-07-27 19:22                     ` Patrick Farrell
2017-07-28  9:57                       ` Anna Fuchs
2017-07-28 13:46                         ` Patrick Farrell
2017-07-28 15:12                           ` Anna Fuchs
2017-07-28 16:53                             ` Patrick Farrell
2017-07-31 10:20                               ` Anna Fuchs
2017-08-03 13:41                               ` Dilger, Andreas
2017-08-03 15:55                                 ` Patrick Farrell
2017-07-21 19:12             ` Xiong, Jinshan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.