All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph write path optimization
@ 2015-07-28 21:08 Somnath Roy
  2015-07-28 21:46 ` Łukasz Redynk
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Somnath Roy @ 2015-07-28 21:08 UTC (permalink / raw)
  To: ceph-devel

Hi,
Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.

Objective:
-----------

1. Is to saturate SSD write bandwidth with ceph + filestore.
     Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?

2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?


Findings/Optimization so far..
------------------------------------

1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.

2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).

3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.

4. Existing throttle scheme is very difficult to tune.

5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).

The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.

1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.

2. Each filestore Op threads is now doing O_DSYNC write followed by posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.

4. Introduced a new throttle scheme that will throttle journal write based on the % space left.

5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.

6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.

7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.

8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.

9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.

10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.

11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.

12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.

       https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing

13. I still need to try this out by increasing number of OSDs.

14. Also, need to see how this scheme is helping both data/journal on the same SSD.

15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.

Mark,
If we have time, we can discuss this result in tomorrow's performance meeting.

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph write path optimization
  2015-07-28 21:08 Ceph write path optimization Somnath Roy
@ 2015-07-28 21:46 ` Łukasz Redynk
  2015-07-28 22:03   ` Somnath Roy
                     ` (2 more replies)
  2015-07-29  2:17 ` Haomai Wang
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 13+ messages in thread
From: Łukasz Redynk @ 2015-07-28 21:46 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

Hi,

Have you tried to tune XFS mkfs options? From mkfs.xfs(8)
a) (log section, -l)
lazy-count=value // by default is 0

This changes the method of logging various persistent counters in the
superblock. Under metadata intensive workloads, these counters are
updated and logged frequently enough that the superblock updates
become a serialisation point in the filesystem. The value can be
either 0 or 1.

and b) (data section, -d)

agcount=value // by default is 2 (?)

This is used to specify the number of allocation groups. The data
section of the filesystem is divided into allocation groups to improve
the performance of XFS. More allocation groups imply that more
parallelism can be achieved when allocating blocks and inodes. The
minimum allocation group size is 16 MiB; the maximum size is just
under 1 TiB. The data section of the filesystem is divided into value
allocation groups (default value is scaled automatically based on the
underlying device size).

Lately I was experimenting with this two and appeared to setting
lazy-count to 1 and increasing agcount shows positive impact on IOPS,
but unfortunately I don't have any performance numbers on this.

-Lukas


2015-07-28 23:08 GMT+02:00 Somnath Roy <Somnath.Roy@sandisk.com>:
> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
>
> Objective:
> -----------
>
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
>
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
>
>
> Findings/Optimization so far..
> ------------------------------------
>
> 1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.
>
> 2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).
>
> 3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.
>
> 4. Existing throttle scheme is very difficult to tune.
>
> 5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).
>
> The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.
>
> 1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.
>
> 2. Each filestore Op threads is now doing O_DSYNC write followed by posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);
>
> 3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.
>
> 4. Introduced a new throttle scheme that will throttle journal write based on the % space left.
>
> 5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.
>
> 6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.
>
> 7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.
>
> 8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.
>
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.
>
> 10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.
>
> 11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.
>
> 12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.
>
>        https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing
>
> 13. I still need to try this out by increasing number of OSDs.
>
> 14. Also, need to see how this scheme is helping both data/journal on the same SSD.
>
> 15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.
>
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
>
> Thanks & Regards
> Somnath
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Ceph write path optimization
  2015-07-28 21:46 ` Łukasz Redynk
@ 2015-07-28 22:03   ` Somnath Roy
  2015-07-28 23:07   ` Somnath Roy
  2015-07-29  6:57   ` Christoph Hellwig
  2 siblings, 0 replies; 13+ messages in thread
From: Somnath Roy @ 2015-07-28 22:03 UTC (permalink / raw)
  To: Łukasz Redynk; +Cc: ceph-devel

Thanks Lukas for the response.
I didn't try with lazy-count , but, I tried with agcount. I saw a post that *reducing* agcount and directory size may alleviate xfsaild effect. I have ~7.5 TB drive, so, agcount is min 7..I moved to 4 TB partition and made agcount 4, but didn't help me much.
Also, I tried to put xfs journal log to different device and that didn't help me either (that's may be because it is all about syncing metadata to the same device)..
But, I will try with lazy-count=1 and increased agcount and keep you posted.

Regards
Somnath


-----Original Message-----
From: mr.erdk@gmail.com [mailto:mr.erdk@gmail.com] On Behalf Of Lukasz Redynk
Sent: Tuesday, July 28, 2015 2:46 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph write path optimization

Hi,

Have you tried to tune XFS mkfs options? From mkfs.xfs(8)
a) (log section, -l)
lazy-count=value // by default is 0

This changes the method of logging various persistent counters in the superblock. Under metadata intensive workloads, these counters are updated and logged frequently enough that the superblock updates become a serialisation point in the filesystem. The value can be either 0 or 1.

and b) (data section, -d)

agcount=value // by default is 2 (?)

This is used to specify the number of allocation groups. The data section of the filesystem is divided into allocation groups to improve the performance of XFS. More allocation groups imply that more parallelism can be achieved when allocating blocks and inodes. The minimum allocation group size is 16 MiB; the maximum size is just under 1 TiB. The data section of the filesystem is divided into value allocation groups (default value is scaled automatically based on the underlying device size).

Lately I was experimenting with this two and appeared to setting lazy-count to 1 and increasing agcount shows positive impact on IOPS, but unfortunately I don't have any performance numbers on this.

-Lukas


2015-07-28 23:08 GMT+02:00 Somnath Roy <Somnath.Roy@sandisk.com>:
> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
>
> Objective:
> -----------
>
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
>
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
>
>
> Findings/Optimization so far..
> ------------------------------------
>
> 1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.
>
> 2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).
>
> 3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.
>
> 4. Existing throttle scheme is very difficult to tune.
>
> 5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).
>
> The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.
>
> 1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.
>
> 2. Each filestore Op threads is now doing O_DSYNC write followed by 
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);
>
> 3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.
>
> 4. Introduced a new throttle scheme that will throttle journal write based on the % space left.
>
> 5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.
>
> 6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.
>
> 7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.
>
> 8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.
>
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.
>
> 10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.
>
> 11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.
>
> 12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.
>
>        
> https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjW
> G8bkNwKY8/edit?usp=sharing
>
> 13. I still need to try this out by increasing number of OSDs.
>
> 14. Also, need to see how this scheme is helping both data/journal on the same SSD.
>
> 15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.
>
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
>
> Thanks & Regards
> Somnath
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Ceph write path optimization
  2015-07-28 21:46 ` Łukasz Redynk
  2015-07-28 22:03   ` Somnath Roy
@ 2015-07-28 23:07   ` Somnath Roy
  2015-07-29  6:57   ` Christoph Hellwig
  2 siblings, 0 replies; 13+ messages in thread
From: Somnath Roy @ 2015-07-28 23:07 UTC (permalink / raw)
  To: Łukasz Redynk; +Cc: ceph-devel

Hi Lukas,
According to (http://linux.die.net/man/8/mkfs.xfs)  lazy-count is by default set to 1 not 0 with newer kernel. I am using 3.16.0-41-generic, so, should be fine.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy 
Sent: Tuesday, July 28, 2015 3:04 PM
To: 'Łukasz Redynk'
Cc: ceph-devel@vger.kernel.org
Subject: RE: Ceph write path optimization

Thanks Lukas for the response.
I didn't try with lazy-count , but, I tried with agcount. I saw a post that *reducing* agcount and directory size may alleviate xfsaild effect. I have ~7.5 TB drive, so, agcount is min 7..I moved to 4 TB partition and made agcount 4, but didn't help me much.
Also, I tried to put xfs journal log to different device and that didn't help me either (that's may be because it is all about syncing metadata to the same device)..
But, I will try with lazy-count=1 and increased agcount and keep you posted.

Regards
Somnath


-----Original Message-----
From: mr.erdk@gmail.com [mailto:mr.erdk@gmail.com] On Behalf Of Lukasz Redynk
Sent: Tuesday, July 28, 2015 2:46 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph write path optimization

Hi,

Have you tried to tune XFS mkfs options? From mkfs.xfs(8)
a) (log section, -l)
lazy-count=value // by default is 0

This changes the method of logging various persistent counters in the superblock. Under metadata intensive workloads, these counters are updated and logged frequently enough that the superblock updates become a serialisation point in the filesystem. The value can be either 0 or 1.

and b) (data section, -d)

agcount=value // by default is 2 (?)

This is used to specify the number of allocation groups. The data section of the filesystem is divided into allocation groups to improve the performance of XFS. More allocation groups imply that more parallelism can be achieved when allocating blocks and inodes. The minimum allocation group size is 16 MiB; the maximum size is just under 1 TiB. The data section of the filesystem is divided into value allocation groups (default value is scaled automatically based on the underlying device size).

Lately I was experimenting with this two and appeared to setting lazy-count to 1 and increasing agcount shows positive impact on IOPS, but unfortunately I don't have any performance numbers on this.

-Lukas


2015-07-28 23:08 GMT+02:00 Somnath Roy <Somnath.Roy@sandisk.com>:
> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
>
> Objective:
> -----------
>
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
>
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
>
>
> Findings/Optimization so far..
> ------------------------------------
>
> 1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.
>
> 2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).
>
> 3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.
>
> 4. Existing throttle scheme is very difficult to tune.
>
> 5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).
>
> The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.
>
> 1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.
>
> 2. Each filestore Op threads is now doing O_DSYNC write followed by 
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);
>
> 3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.
>
> 4. Introduced a new throttle scheme that will throttle journal write based on the % space left.
>
> 5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.
>
> 6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.
>
> 7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.
>
> 8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.
>
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.
>
> 10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.
>
> 11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.
>
> 12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.
>
>        
> https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjW
> G8bkNwKY8/edit?usp=sharing
>
> 13. I still need to try this out by increasing number of OSDs.
>
> 14. Also, need to see how this scheme is helping both data/journal on the same SSD.
>
> 15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.
>
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
>
> Thanks & Regards
> Somnath
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph write path optimization
  2015-07-28 21:08 Ceph write path optimization Somnath Roy
  2015-07-28 21:46 ` Łukasz Redynk
@ 2015-07-29  2:17 ` Haomai Wang
  2015-07-29  4:57   ` Somnath Roy
  2015-07-29  6:57 ` Christoph Hellwig
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Haomai Wang @ 2015-07-29  2:17 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Wed, Jul 29, 2015 at 5:08 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
>
> Objective:
> -----------
>
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
>
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
>
>
> Findings/Optimization so far..
> ------------------------------------
>
> 1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.

Default is 5s I think.

>
> 2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).
>
> 3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.
>
> 4. Existing throttle scheme is very difficult to tune.
>
> 5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).
>
> The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.
>
> 1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.
>
> 2. Each filestore Op threads is now doing O_DSYNC write followed by posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

Maybe we could use AIO+DIO here. BTW always discard page cache isn't a
good idea for reading. If we want to give up page cache, we need to
implement a filestore data buffer cache.

>
> 3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.
>
> 4. Introduced a new throttle scheme that will throttle journal write based on the % space left.
>
> 5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.
>

I think you mean filestore worker need to aware of the capacity of
journal and decide how often  flushed

> 6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.
>
> 7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.
>
> 8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.
>
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.
>
> 10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.
>
> 11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.
>
> 12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.
>
>        https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing
>
> 13. I still need to try this out by increasing number of OSDs.
>
> 14. Also, need to see how this scheme is helping both data/journal on the same SSD.
>
> 15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.
>
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
>
> Thanks & Regards
> Somnath
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Ceph write path optimization
  2015-07-29  2:17 ` Haomai Wang
@ 2015-07-29  4:57   ` Somnath Roy
  0 siblings, 0 replies; 13+ messages in thread
From: Somnath Roy @ 2015-07-29  4:57 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Haomai,

<<in line with [Somnath]..

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com] 
Sent: Tuesday, July 28, 2015 7:18 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph write path optimization

On Wed, Jul 29, 2015 at 5:08 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
>
> Objective:
> -----------
>
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
>
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
>
>
> Findings/Optimization so far..
> ------------------------------------
>
> 1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.

Default is 5s I think.

>
> 2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).
>
> 3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.
>
> 4. Existing throttle scheme is very difficult to tune.
>
> 5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).
>
> The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.
>
> 1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.
>
> 2. Each filestore Op threads is now doing O_DSYNC write followed by 
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

Maybe we could use AIO+DIO here. BTW always discard page cache isn't a good idea for reading. If we want to give up page cache, we need to implement a filestore data buffer cache.

[Somnath] I tried with O_DIRECT + O_DSYNC, but getting similar performance..Didn't try AIO though...Yeah, mixed seq workload could benefit if we don't discard page cache.
>
> 3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.
>
> 4. Introduced a new throttle scheme that will throttle journal write based on the % space left.
>
> 5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.
>

I think you mean filestore worker need to aware of the capacity of journal and decide how often  flushed

[Somnath]No, what I am saying is, this scheme is much faster on completing entire backend flush execution and commit. The syncfs one is probably faster on finishing transaction (since it is buffered write), but the commit mechanism is not efficient. 

> 6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.
>
> 7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.
>
> 8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.
>
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.
>
> 10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.
>
> 11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.
>
> 12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.
>
>        
> https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjW
> G8bkNwKY8/edit?usp=sharing
>
> 13. I still need to try this out by increasing number of OSDs.
>
> 14. Also, need to see how this scheme is helping both data/journal on the same SSD.
>
> 15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.
>
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
>
> Thanks & Regards
> Somnath
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph write path optimization
  2015-07-28 21:08 Ceph write path optimization Somnath Roy
  2015-07-28 21:46 ` Łukasz Redynk
  2015-07-29  2:17 ` Haomai Wang
@ 2015-07-29  6:57 ` Christoph Hellwig
  2015-07-29 15:35   ` Somnath Roy
  2015-07-29  7:49 ` Shu, Xinxin
  2015-07-29 14:58 ` Sage Weil
  4 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2015-07-29  6:57 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Tue, Jul 28, 2015 at 09:08:27PM +0000, Somnath Roy wrote:
> 2. Each filestore Op threads is now doing O_DSYNC write followed by
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

Where aren't you using O_DIRECT | O_DSYNC?

> 15. The main challenge I am facing in both the scheme is XFS metadata 
> flush process (xfsaild) is choking all the processes accessing the disk
> when it is waking up. I can delay it till max 30 sec and if there are
> lot of dirty metadata, there is a performance spike down for very brief
> amount of time. Even if we are acknowledging writes from say NVRAM
> journal write, still the opthreads are doing getattrs on the XFS
> and those threads are getting blocked.

Can you send a more detailed report to the XFS lists?  E.g. which locks
your blocked on and some perf data?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph write path optimization
  2015-07-28 21:46 ` Łukasz Redynk
  2015-07-28 22:03   ` Somnath Roy
  2015-07-28 23:07   ` Somnath Roy
@ 2015-07-29  6:57   ` Christoph Hellwig
  2 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2015-07-29  6:57 UTC (permalink / raw)
  To: ??ukasz Redynk; +Cc: Somnath Roy, ceph-devel

On Tue, Jul 28, 2015 at 11:46:06PM +0200, ??ukasz Redynk wrote:
> Hi,
> 
> Have you tried to tune XFS mkfs options? From mkfs.xfs(8)
> a) (log section, -l)
> lazy-count=value // by default is 0

It's default.  And less AGs arent going to help you here.  Please don't
start micro tuning filesystem options before you understand the problem,
thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Ceph write path optimization
  2015-07-28 21:08 Ceph write path optimization Somnath Roy
                   ` (2 preceding siblings ...)
  2015-07-29  6:57 ` Christoph Hellwig
@ 2015-07-29  7:49 ` Shu, Xinxin
  2015-07-29 16:00   ` Somnath Roy
  2015-07-29 14:58 ` Sage Weil
  4 siblings, 1 reply; 13+ messages in thread
From: Shu, Xinxin @ 2015-07-29  7:49 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Hi Somnath,  any performance data for journal on 128M NVRAM partition with hammer release?

Cheers,
xinxin

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, July 29, 2015 5:08 AM
To: ceph-devel@vger.kernel.org
Subject: Ceph write path optimization

Hi,
Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.

Objective:
-----------

1. Is to saturate SSD write bandwidth with ceph + filestore.
     Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?

2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?


Findings/Optimization so far..
------------------------------------

1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.

2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).

3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.

4. Existing throttle scheme is very difficult to tune.

5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).

The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.

1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.

2. Each filestore Op threads is now doing O_DSYNC write followed by posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.

4. Introduced a new throttle scheme that will throttle journal write based on the % space left.

5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.

6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.

7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.

8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.

9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.

10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.

11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.

12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.

       https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing

13. I still need to try this out by increasing number of OSDs.

14. Also, need to see how this scheme is helping both data/journal on the same SSD.

15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.

Mark,
If we have time, we can discuss this result in tomorrow's performance meeting.

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph write path optimization
  2015-07-28 21:08 Ceph write path optimization Somnath Roy
                   ` (3 preceding siblings ...)
  2015-07-29  7:49 ` Shu, Xinxin
@ 2015-07-29 14:58 ` Sage Weil
  2015-07-29 15:53   ` Somnath Roy
  4 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2015-07-29 14:58 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

Hi Somnath,

A few comments!

The throttling changes you've made sound like they are a big improvement.

I'm a little worried about the op_seq change, as I remember that being 
quite fragile, but if we can in fact eliminate it then that would also be 
a win.  Let us know when you have patches we can look at!

When you are doing O_DSYNC, is this in place of the syncfs(2) or in place 
of the WBThrottler?  We can't remove the syncfs() unless we get extremely 
careful about fsyncing directories and omap too...

sage


On Tue, 28 Jul 2015, Somnath Roy wrote:

> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
> 
> Objective:
> -----------
> 
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
> 
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
> 
> 
> Findings/Optimization so far..
> ------------------------------------
> 
> 1. I saw in flash environment you need to reduce the 
> filestore_max_sync_interval a lot (from default 5min) and thus the 
> benefit of syncfs coalescing and writing is going away.
> 
> 2. We have some logic to determine the max sequence number it can 
> commit. That is adding some latency (>1 ms or so).
> 
> 3. This delay is filling up journals quickly if I remove all throttles 
> from the filestore/journal.
> 
> 4. Existing throttle scheme is very difficult to tune.
> 
> 5. In case of write-ahead journaling the commit file is probably 
> redundant as we can get the last committed seq number from journal 
> headers during next OSD start. The fact that, the sync interval we need 
> to reduce , this extra write will only add more to WA (also one extra 
> fsync).
> 
> The existing scheme is well suited for HDD environment, but, probably 
> not for flash. So, I made the following changes.
> 
> 1. First, I removed the extra commit seq file write and changed the 
> journal replay stuff accordingly.
> 
> 2. Each filestore Op threads is now doing O_DSYNC write followed by 
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);
> 
> 3. I derived an algorithm that each worker thread is executing to 
> determine the max seq it can trim the journal to.
> 
> 4. Introduced a new throttle scheme that will throttle journal write 
> based on the % space left.
> 
> 5. I saw that this scheme is definitely emptying up the journal faster 
> and able to saturate the SSD more.
> 
> 6. But, even if we are not saturating any resources, if we are having 
> both data and journal on the same drive, both writes are suffering 
> latencies.
> 
> 7. Separating out journal to different disk , the same code (and also 
> stock)  is running faster. Not sure about the exact reason, but, 
> something to do with underlying layer. Still investigating.
> 
> 8. Now, if we want to separate out journal, SSD is *not an option*. The 
> reason is, after some point we will be limited by SSD BW and all writes 
> for N osds going to that SSD will wear out that SSD very fast. Also, 
> this will be a very expensive solution considering high end journal SSD.
> 
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). 
> So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM 
> durability is much higher).  The stock code as is (without throttle), 
> the performance is becoming very spiky for obvious reason.
> 
> 10. But, with the above mentioned changes, I am able to make a constant 
> high performance out most of the time.
> 
> 11. I am also trying the existing synfs codebase (without op_seq file) + 
> the throttle scheme I mentioned in this setup to see if we can get out a 
> stable improve performance out or not. This is still under 
> investigation.
> 
> 12. Initial benchmark with single OSD (no replication) looks promising 
> and you can find the draft here.
> 
>        
> https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing
> 
> 13. I still need to try this out by increasing number of OSDs.
> 
> 14. Also, need to see how this scheme is helping both data/journal on 
> the same SSD.
> 
> 15. The main challenge I am facing in both the scheme is XFS metadata 
> flush process (xfsaild) is choking all the processes accessing the disk 
> when it is waking up. I can delay it till max 30 sec and if there are 
> lot of dirty metadata, there is a performance spike down for very brief 
> amount of time. Even if we are acknowledging writes from say NVRAM 
> journal write, still the opthreads are doing getattrs on the XFS and 
> those threads are getting blocked. I tried with ext4 and this problem is 
> not there since it is writing metadata synchronously by default, but, 
> the overall performance of ext4 is much less. I am not an expert on 
> filesystem, so, any help on this is much appreciated.
> 
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
> 
> Thanks & Regards
> Somnath
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Ceph write path optimization
  2015-07-29  6:57 ` Christoph Hellwig
@ 2015-07-29 15:35   ` Somnath Roy
  0 siblings, 0 replies; 13+ messages in thread
From: Somnath Roy @ 2015-07-29 15:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: ceph-devel

<<inline

-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org]
Sent: Tuesday, July 28, 2015 11:57 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph write path optimization

On Tue, Jul 28, 2015 at 09:08:27PM +0000, Somnath Roy wrote:
> 2. Each filestore Op threads is now doing O_DSYNC write followed by
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

Where aren't you using O_DIRECT | O_DSYNC?

[Somnath] We can do that, but, I saw we are not gaining anything by doing that.

> 15. The main challenge I am facing in both the scheme is XFS metadata
> flush process (xfsaild) is choking all the processes accessing the
> disk when it is waking up. I can delay it till max 30 sec and if there
> are lot of dirty metadata, there is a performance spike down for very
> brief amount of time. Even if we are acknowledging writes from say
> NVRAM journal write, still the opthreads are doing getattrs on the XFS
> and those threads are getting blocked.

Can you send a more detailed report to the XFS lists?  E.g. which locks your blocked on and some perf data?

[Somnath] This list could be helpful http://oss.sgi.com/archives/xfs/2015-06/msg00111.html
I think I am hitting similar..

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Ceph write path optimization
  2015-07-29 14:58 ` Sage Weil
@ 2015-07-29 15:53   ` Somnath Roy
  0 siblings, 0 replies; 13+ messages in thread
From: Somnath Roy @ 2015-07-29 15:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage,
Hope I was able to answer the questions below in the meeting.
As discussed, I will try to investigate the syncfs mechanism to work efficiently.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net] 
Sent: Wednesday, July 29, 2015 7:59 AM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph write path optimization

Hi Somnath,

A few comments!

The throttling changes you've made sound like they are a big improvement.

I'm a little worried about the op_seq change, as I remember that being quite fragile, but if we can in fact eliminate it then that would also be a win.  Let us know when you have patches we can look at!

When you are doing O_DSYNC, is this in place of the syncfs(2) or in place of the WBThrottler?  We can't remove the syncfs() unless we get extremely careful about fsyncing directories and omap too...

sage


On Tue, 28 Jul 2015, Somnath Roy wrote:

> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
> 
> Objective:
> -----------
> 
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
> 
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
> 
> 
> Findings/Optimization so far..
> ------------------------------------
> 
> 1. I saw in flash environment you need to reduce the 
> filestore_max_sync_interval a lot (from default 5min) and thus the 
> benefit of syncfs coalescing and writing is going away.
> 
> 2. We have some logic to determine the max sequence number it can 
> commit. That is adding some latency (>1 ms or so).
> 
> 3. This delay is filling up journals quickly if I remove all throttles 
> from the filestore/journal.
> 
> 4. Existing throttle scheme is very difficult to tune.
> 
> 5. In case of write-ahead journaling the commit file is probably 
> redundant as we can get the last committed seq number from journal 
> headers during next OSD start. The fact that, the sync interval we 
> need to reduce , this extra write will only add more to WA (also one 
> extra fsync).
> 
> The existing scheme is well suited for HDD environment, but, probably 
> not for flash. So, I made the following changes.
> 
> 1. First, I removed the extra commit seq file write and changed the 
> journal replay stuff accordingly.
> 
> 2. Each filestore Op threads is now doing O_DSYNC write followed by 
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);
> 
> 3. I derived an algorithm that each worker thread is executing to 
> determine the max seq it can trim the journal to.
> 
> 4. Introduced a new throttle scheme that will throttle journal write 
> based on the % space left.
> 
> 5. I saw that this scheme is definitely emptying up the journal faster 
> and able to saturate the SSD more.
> 
> 6. But, even if we are not saturating any resources, if we are having 
> both data and journal on the same drive, both writes are suffering 
> latencies.
> 
> 7. Separating out journal to different disk , the same code (and also
> stock)  is running faster. Not sure about the exact reason, but, 
> something to do with underlying layer. Still investigating.
> 
> 8. Now, if we want to separate out journal, SSD is *not an option*. 
> The reason is, after some point we will be limited by SSD BW and all 
> writes for N osds going to that SSD will wear out that SSD very fast. 
> Also, this will be a very expensive solution considering high end journal SSD.
> 
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). 
> So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering 
> NVRAM durability is much higher).  The stock code as is (without 
> throttle), the performance is becoming very spiky for obvious reason.
> 
> 10. But, with the above mentioned changes, I am able to make a 
> constant high performance out most of the time.
> 
> 11. I am also trying the existing synfs codebase (without op_seq file) 
> + the throttle scheme I mentioned in this setup to see if we can get 
> out a stable improve performance out or not. This is still under 
> investigation.
> 
> 12. Initial benchmark with single OSD (no replication) looks promising 
> and you can find the draft here.
> 
>        
> https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjW
> G8bkNwKY8/edit?usp=sharing
> 
> 13. I still need to try this out by increasing number of OSDs.
> 
> 14. Also, need to see how this scheme is helping both data/journal on 
> the same SSD.
> 
> 15. The main challenge I am facing in both the scheme is XFS metadata 
> flush process (xfsaild) is choking all the processes accessing the 
> disk when it is waking up. I can delay it till max 30 sec and if there 
> are lot of dirty metadata, there is a performance spike down for very 
> brief amount of time. Even if we are acknowledging writes from say 
> NVRAM journal write, still the opthreads are doing getattrs on the XFS 
> and those threads are getting blocked. I tried with ext4 and this 
> problem is not there since it is writing metadata synchronously by 
> default, but, the overall performance of ext4 is much less. I am not 
> an expert on filesystem, so, any help on this is much appreciated.
> 
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
> 
> Thanks & Regards
> Somnath
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Ceph write path optimization
  2015-07-29  7:49 ` Shu, Xinxin
@ 2015-07-29 16:00   ` Somnath Roy
  0 siblings, 0 replies; 13+ messages in thread
From: Somnath Roy @ 2015-07-29 16:00 UTC (permalink / raw)
  To: Shu, Xinxin, ceph-devel

Xinxin,
I tried that but if you remove all the throttling the performance is very spiky and not usable. The peak performance is definitely more though.
I tried to do throttling based on the existing options and I was able to make constant stable performance out but that performance is low (or similar to existing one today).

Thanks & Regards
Somnath

-----Original Message-----
From: Shu, Xinxin [mailto:xinxin.shu@intel.com] 
Sent: Wednesday, July 29, 2015 12:50 AM
To: Somnath Roy; ceph-devel@vger.kernel.org
Subject: RE: Ceph write path optimization

Hi Somnath,  any performance data for journal on 128M NVRAM partition with hammer release?

Cheers,
xinxin

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, July 29, 2015 5:08 AM
To: ceph-devel@vger.kernel.org
Subject: Ceph write path optimization

Hi,
Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.

Objective:
-----------

1. Is to saturate SSD write bandwidth with ceph + filestore.
     Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?

2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?


Findings/Optimization so far..
------------------------------------

1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.

2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).

3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.

4. Existing throttle scheme is very difficult to tune.

5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).

The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.

1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.

2. Each filestore Op threads is now doing O_DSYNC write followed by posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.

4. Introduced a new throttle scheme that will throttle journal write based on the % space left.

5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.

6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.

7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.

8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.

9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.

10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.

11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.

12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.

       https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing

13. I still need to try this out by increasing number of OSDs.

14. Also, need to see how this scheme is helping both data/journal on the same SSD.

15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.

Mark,
If we have time, we can discuss this result in tomorrow's performance meeting.

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-07-29 16:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-28 21:08 Ceph write path optimization Somnath Roy
2015-07-28 21:46 ` Łukasz Redynk
2015-07-28 22:03   ` Somnath Roy
2015-07-28 23:07   ` Somnath Roy
2015-07-29  6:57   ` Christoph Hellwig
2015-07-29  2:17 ` Haomai Wang
2015-07-29  4:57   ` Somnath Roy
2015-07-29  6:57 ` Christoph Hellwig
2015-07-29 15:35   ` Somnath Roy
2015-07-29  7:49 ` Shu, Xinxin
2015-07-29 16:00   ` Somnath Roy
2015-07-29 14:58 ` Sage Weil
2015-07-29 15:53   ` Somnath Roy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.