* [RFC] writeback and cgroup @ 2012-04-03 18:36 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw) To: Fengguang Wu, Jan Kara, vgoyal, Jens Axboe Cc: linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, guys. So, during LSF, I, Fengguang and Jan had a chance to sit down and talk about how to cgroup support to writeback. Here's what I got from it. Fengguang's opinion is that the throttling algorithm implemented in writeback is good enough and blkcg parameters can be exposed to writeback such that those limits can be applied from writeback. As for reads and direct IOs, Fengguang opined that the algorithm can easily be extended to cover those cases and IIUC all IOs, whether buffered writes, reads or direct IOs can eventually all go through writeback layer which will be the one layer controlling all IOs. Unfortunately, I don't agree with that at all. I think it's a gross layering violation and lacks any longterm design. We have a well working model of applying and propagating resource pressure - we apply the pressure where the resource exists and propagates the back pressure through buffers to upper layers upto the originator. Think about network, the pressure exists or is applied at the in/egress points which gets propagated through socket buffers and eventually throttles the originator. Writeback, without cgroup, isn't different. It consists a part of the pressure propagation chain anchored at the IO device. IO devices these days generate very high pressure, which gets propgated through the IO sched and buffered requests, which in turn creates pressure at writeback. Here, the buffering happens in page cache and pressure at writeback increases the amount of dirty page cache. Propagating this IO pressure to the dirtying task is one of the biggest responsibililties of the writeback code, and this is the underlying design of the whole thing. IIUC, without cgroup, the current writeback code works more or less like this. Throwing in cgroup doesn't really change the fundamental design. Instead of a single pipe going down, we just have multiple pipes to the same device, each of which should be treated separately. Of course, a spinning disk can't be divided that easily and their performance characteristics will be inter-dependent, but the place to solve that problem is where the problem is, the block layer. We may have to look for optimizations and expose some details to improve the overall behavior and such optimizations may require some deviation from the fundamental design, but such optimizations should be justified and such deviations kept at minimum, so, no, I don't think we're gonna be expose blkcg / block / elevator parameters directly to writeback. Unless someone can *really* convince me otherwise, I'll be vetoing any change toward that direction. Let's please keep the layering clear. IO limitations will be applied at the block layer and pressure will be formed there and then propagated upwards eventually to the originator. Sure, exposing the whole information might result in better behavior for certain workloads, but down the road, say, in three or five years, devices which can be shared without worrying too much about seeks might be commonplace and we could be swearing at a disgusting structural mess, and sadly various cgroup support seems to be a prominent source of such design failures. IMHO, treating cgroup - device/bdi pair as a separate device should suffice as the underlying design. After all, blkio cgroup support's ultimate goal is dividing the IO resource into separate bins. Implementation details might change as underlying technology changes and we learn more about how to do it better but that is the goal which we'll always try to keep close to. Writeback should (be able to) treat them as separate devices. We surely will need adjustments and optimizations to make things work at least somewhat reasonably but that is the baseline. In the discussion, for such implementation, the following obstacles were identified. * There are a lot of cases where IOs are issued by a task which isn't the originiator. ie. Writeback issues IOs for pages which are dirtied by some other tasks. So, by the time an IO reaches the block layer, we don't know which cgroup the IO belongs to. Recently, block layer has grown support to attach a task to a bio which causes the bio to be handled as if it were issued by the associated task regardless of the actual issuing task. It currently only allows attaching %current to a bio - bio_associate_current() - but changing it to support other tasks is trivial. We'll need to update the async issuers to tag the IOs they issue but the mechanism is already there. * There's a single request pool shared by all issuers per a request queue. This can lead to priority inversion among cgroups. Note that problem also exists without cgroups. Lower ioprio issuer may be holding a request holding back highprio issuer. We'll need to make request allocation cgroup (and hopefully ioprio) aware. Probably in the form of separate request pools. This will take some work but I don't think this will be too challenging. I'll work on it. * cfq cgroup policy throws all async IOs, which all buffered writes are, into the shared cgroup regardless of the actual cgroup. This behavior is, I believe, mostly historical and changing it isn't difficult. Prolly only few tens of lines of changes. This may cause significant changes to actual IO behavior with cgroups tho. I personally think the previous behavior was too wrong to keep (the weight was completely ignored for buffered writes) but we may want to introduce a switch to toggle between the two behaviors. Note that blk-throttle doesn't have this problem. * Unlike dirty data pages, metadata tends to have strict ordering requirements and thus is susceptible to priority inversion. Two solutions were suggested - 1. allow overdrawl for metadata writes so that low prio metadata writes don't block the whole FS, 2. provide an interface to query and wait for bdi-cgroup congestion which can be called from FS metadata paths to throttle metadata operations before they enter the stream of ordered operations. I think combination of the above two should be enough for solving the problem. I *think* the second can be implemented as part of cgroup aware request allocation update. The first one needs a bit more thinking but there can be easier interim solutions (e.g. throw META writes to the head of the cgroup queue or just plain ignore cgroup limits for META writes) for now. * I'm sure there are a lot of design choices to be made in the writeback implementation but IIUC Jan seems to agree that the simplest would be simply deal different cgroup-bdi pairs as completely separate which shouldn't add too much complexity to the already intricate writeback code. So, I think we have something which sounds like a plan, which at least I can agree with and seems doable without adding a lot of complexity. Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's side and IIUC Fengguang doesn't agree with this approach too much, so please voice your opinions & comments. Thank you. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* [RFC] writeback and cgroup @ 2012-04-03 18:36 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw) To: Fengguang Wu, Jan Kara, vgoyal, Jens Axboe Cc: linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, guys. So, during LSF, I, Fengguang and Jan had a chance to sit down and talk about how to cgroup support to writeback. Here's what I got from it. Fengguang's opinion is that the throttling algorithm implemented in writeback is good enough and blkcg parameters can be exposed to writeback such that those limits can be applied from writeback. As for reads and direct IOs, Fengguang opined that the algorithm can easily be extended to cover those cases and IIUC all IOs, whether buffered writes, reads or direct IOs can eventually all go through writeback layer which will be the one layer controlling all IOs. Unfortunately, I don't agree with that at all. I think it's a gross layering violation and lacks any longterm design. We have a well working model of applying and propagating resource pressure - we apply the pressure where the resource exists and propagates the back pressure through buffers to upper layers upto the originator. Think about network, the pressure exists or is applied at the in/egress points which gets propagated through socket buffers and eventually throttles the originator. Writeback, without cgroup, isn't different. It consists a part of the pressure propagation chain anchored at the IO device. IO devices these days generate very high pressure, which gets propgated through the IO sched and buffered requests, which in turn creates pressure at writeback. Here, the buffering happens in page cache and pressure at writeback increases the amount of dirty page cache. Propagating this IO pressure to the dirtying task is one of the biggest responsibililties of the writeback code, and this is the underlying design of the whole thing. IIUC, without cgroup, the current writeback code works more or less like this. Throwing in cgroup doesn't really change the fundamental design. Instead of a single pipe going down, we just have multiple pipes to the same device, each of which should be treated separately. Of course, a spinning disk can't be divided that easily and their performance characteristics will be inter-dependent, but the place to solve that problem is where the problem is, the block layer. We may have to look for optimizations and expose some details to improve the overall behavior and such optimizations may require some deviation from the fundamental design, but such optimizations should be justified and such deviations kept at minimum, so, no, I don't think we're gonna be expose blkcg / block / elevator parameters directly to writeback. Unless someone can *really* convince me otherwise, I'll be vetoing any change toward that direction. Let's please keep the layering clear. IO limitations will be applied at the block layer and pressure will be formed there and then propagated upwards eventually to the originator. Sure, exposing the whole information might result in better behavior for certain workloads, but down the road, say, in three or five years, devices which can be shared without worrying too much about seeks might be commonplace and we could be swearing at a disgusting structural mess, and sadly various cgroup support seems to be a prominent source of such design failures. IMHO, treating cgroup - device/bdi pair as a separate device should suffice as the underlying design. After all, blkio cgroup support's ultimate goal is dividing the IO resource into separate bins. Implementation details might change as underlying technology changes and we learn more about how to do it better but that is the goal which we'll always try to keep close to. Writeback should (be able to) treat them as separate devices. We surely will need adjustments and optimizations to make things work at least somewhat reasonably but that is the baseline. In the discussion, for such implementation, the following obstacles were identified. * There are a lot of cases where IOs are issued by a task which isn't the originiator. ie. Writeback issues IOs for pages which are dirtied by some other tasks. So, by the time an IO reaches the block layer, we don't know which cgroup the IO belongs to. Recently, block layer has grown support to attach a task to a bio which causes the bio to be handled as if it were issued by the associated task regardless of the actual issuing task. It currently only allows attaching %current to a bio - bio_associate_current() - but changing it to support other tasks is trivial. We'll need to update the async issuers to tag the IOs they issue but the mechanism is already there. * There's a single request pool shared by all issuers per a request queue. This can lead to priority inversion among cgroups. Note that problem also exists without cgroups. Lower ioprio issuer may be holding a request holding back highprio issuer. We'll need to make request allocation cgroup (and hopefully ioprio) aware. Probably in the form of separate request pools. This will take some work but I don't think this will be too challenging. I'll work on it. * cfq cgroup policy throws all async IOs, which all buffered writes are, into the shared cgroup regardless of the actual cgroup. This behavior is, I believe, mostly historical and changing it isn't difficult. Prolly only few tens of lines of changes. This may cause significant changes to actual IO behavior with cgroups tho. I personally think the previous behavior was too wrong to keep (the weight was completely ignored for buffered writes) but we may want to introduce a switch to toggle between the two behaviors. Note that blk-throttle doesn't have this problem. * Unlike dirty data pages, metadata tends to have strict ordering requirements and thus is susceptible to priority inversion. Two solutions were suggested - 1. allow overdrawl for metadata writes so that low prio metadata writes don't block the whole FS, 2. provide an interface to query and wait for bdi-cgroup congestion which can be called from FS metadata paths to throttle metadata operations before they enter the stream of ordered operations. I think combination of the above two should be enough for solving the problem. I *think* the second can be implemented as part of cgroup aware request allocation update. The first one needs a bit more thinking but there can be easier interim solutions (e.g. throw META writes to the head of the cgroup queue or just plain ignore cgroup limits for META writes) for now. * I'm sure there are a lot of design choices to be made in the writeback implementation but IIUC Jan seems to agree that the simplest would be simply deal different cgroup-bdi pairs as completely separate which shouldn't add too much complexity to the already intricate writeback code. So, I think we have something which sounds like a plan, which at least I can agree with and seems doable without adding a lot of complexity. Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's side and IIUC Fengguang doesn't agree with this approach too much, so please voice your opinions & comments. Thank you. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* [RFC] writeback and cgroup @ 2012-04-03 18:36 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw) To: Fengguang Wu, Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, guys. So, during LSF, I, Fengguang and Jan had a chance to sit down and talk about how to cgroup support to writeback. Here's what I got from it. Fengguang's opinion is that the throttling algorithm implemented in writeback is good enough and blkcg parameters can be exposed to writeback such that those limits can be applied from writeback. As for reads and direct IOs, Fengguang opined that the algorithm can easily be extended to cover those cases and IIUC all IOs, whether buffered writes, reads or direct IOs can eventually all go through writeback layer which will be the one layer controlling all IOs. Unfortunately, I don't agree with that at all. I think it's a gross layering violation and lacks any longterm design. We have a well working model of applying and propagating resource pressure - we apply the pressure where the resource exists and propagates the back pressure through buffers to upper layers upto the originator. Think about network, the pressure exists or is applied at the in/egress points which gets propagated through socket buffers and eventually throttles the originator. Writeback, without cgroup, isn't different. It consists a part of the pressure propagation chain anchored at the IO device. IO devices these days generate very high pressure, which gets propgated through the IO sched and buffered requests, which in turn creates pressure at writeback. Here, the buffering happens in page cache and pressure at writeback increases the amount of dirty page cache. Propagating this IO pressure to the dirtying task is one of the biggest responsibililties of the writeback code, and this is the underlying design of the whole thing. IIUC, without cgroup, the current writeback code works more or less like this. Throwing in cgroup doesn't really change the fundamental design. Instead of a single pipe going down, we just have multiple pipes to the same device, each of which should be treated separately. Of course, a spinning disk can't be divided that easily and their performance characteristics will be inter-dependent, but the place to solve that problem is where the problem is, the block layer. We may have to look for optimizations and expose some details to improve the overall behavior and such optimizations may require some deviation from the fundamental design, but such optimizations should be justified and such deviations kept at minimum, so, no, I don't think we're gonna be expose blkcg / block / elevator parameters directly to writeback. Unless someone can *really* convince me otherwise, I'll be vetoing any change toward that direction. Let's please keep the layering clear. IO limitations will be applied at the block layer and pressure will be formed there and then propagated upwards eventually to the originator. Sure, exposing the whole information might result in better behavior for certain workloads, but down the road, say, in three or five years, devices which can be shared without worrying too much about seeks might be commonplace and we could be swearing at a disgusting structural mess, and sadly various cgroup support seems to be a prominent source of such design failures. IMHO, treating cgroup - device/bdi pair as a separate device should suffice as the underlying design. After all, blkio cgroup support's ultimate goal is dividing the IO resource into separate bins. Implementation details might change as underlying technology changes and we learn more about how to do it better but that is the goal which we'll always try to keep close to. Writeback should (be able to) treat them as separate devices. We surely will need adjustments and optimizations to make things work at least somewhat reasonably but that is the baseline. In the discussion, for such implementation, the following obstacles were identified. * There are a lot of cases where IOs are issued by a task which isn't the originiator. ie. Writeback issues IOs for pages which are dirtied by some other tasks. So, by the time an IO reaches the block layer, we don't know which cgroup the IO belongs to. Recently, block layer has grown support to attach a task to a bio which causes the bio to be handled as if it were issued by the associated task regardless of the actual issuing task. It currently only allows attaching %current to a bio - bio_associate_current() - but changing it to support other tasks is trivial. We'll need to update the async issuers to tag the IOs they issue but the mechanism is already there. * There's a single request pool shared by all issuers per a request queue. This can lead to priority inversion among cgroups. Note that problem also exists without cgroups. Lower ioprio issuer may be holding a request holding back highprio issuer. We'll need to make request allocation cgroup (and hopefully ioprio) aware. Probably in the form of separate request pools. This will take some work but I don't think this will be too challenging. I'll work on it. * cfq cgroup policy throws all async IOs, which all buffered writes are, into the shared cgroup regardless of the actual cgroup. This behavior is, I believe, mostly historical and changing it isn't difficult. Prolly only few tens of lines of changes. This may cause significant changes to actual IO behavior with cgroups tho. I personally think the previous behavior was too wrong to keep (the weight was completely ignored for buffered writes) but we may want to introduce a switch to toggle between the two behaviors. Note that blk-throttle doesn't have this problem. * Unlike dirty data pages, metadata tends to have strict ordering requirements and thus is susceptible to priority inversion. Two solutions were suggested - 1. allow overdrawl for metadata writes so that low prio metadata writes don't block the whole FS, 2. provide an interface to query and wait for bdi-cgroup congestion which can be called from FS metadata paths to throttle metadata operations before they enter the stream of ordered operations. I think combination of the above two should be enough for solving the problem. I *think* the second can be implemented as part of cgroup aware request allocation update. The first one needs a bit more thinking but there can be easier interim solutions (e.g. throw META writes to the head of the cgroup queue or just plain ignore cgroup limits for META writes) for now. * I'm sure there are a lot of design choices to be made in the writeback implementation but IIUC Jan seems to agree that the simplest would be simply deal different cgroup-bdi pairs as completely separate which shouldn't add too much complexity to the already intricate writeback code. So, I think we have something which sounds like a plan, which at least I can agree with and seems doable without adding a lot of complexity. Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's side and IIUC Fengguang doesn't agree with this approach too much, so please voice your opinions & comments. Thank you. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-03 18:36 ` Tejun Heo @ 2012-04-04 14:51 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: Hi Tejun, Thanks for the RFC and looking into this issue. Few thoughts inline. [..] > IIUC, without cgroup, the current writeback code works more or less > like this. Throwing in cgroup doesn't really change the fundamental > design. Instead of a single pipe going down, we just have multiple > pipes to the same device, each of which should be treated separately. > Of course, a spinning disk can't be divided that easily and their > performance characteristics will be inter-dependent, but the place to > solve that problem is where the problem is, the block layer. How do you take care of thorottling IO to NFS case in this model? Current throttling logic is tied to block device and in case of NFS, there is no block device. [..] > In the discussion, for such implementation, the following obstacles > were identified. > > * There are a lot of cases where IOs are issued by a task which isn't > the originiator. ie. Writeback issues IOs for pages which are > dirtied by some other tasks. So, by the time an IO reaches the > block layer, we don't know which cgroup the IO belongs to. > > Recently, block layer has grown support to attach a task to a bio > which causes the bio to be handled as if it were issued by the > associated task regardless of the actual issuing task. It currently > only allows attaching %current to a bio - bio_associate_current() - > but changing it to support other tasks is trivial. > > We'll need to update the async issuers to tag the IOs they issue but > the mechanism is already there. Most likely this tagging will take place in "struct page" and I am not sure if we will be allowed to grow size of "struct page" for this reason. > > * There's a single request pool shared by all issuers per a request > queue. This can lead to priority inversion among cgroups. Note > that problem also exists without cgroups. Lower ioprio issuer may > be holding a request holding back highprio issuer. > > We'll need to make request allocation cgroup (and hopefully ioprio) > aware. Probably in the form of separate request pools. This will > take some work but I don't think this will be too challenging. I'll > work on it. This should be doable. I had implemented it long back with single request pool but internal limits for each group. That is block the task in the group if group has enough pending requests allocated from the pool. But separate request pool should work equally well. Just that it conflits a bit with current definition of q->nr_requests. Which specifies number of total outstanding requests on the queue. Once you make the pool per queue, I guess this limit will have to be transformed into per group upper limit. > > * cfq cgroup policy throws all async IOs, which all buffered writes > are, into the shared cgroup regardless of the actual cgroup. This > behavior is, I believe, mostly historical and changing it isn't > difficult. Prolly only few tens of lines of changes. This may > cause significant changes to actual IO behavior with cgroups tho. I > personally think the previous behavior was too wrong to keep (the > weight was completely ignored for buffered writes) but we may want > to introduce a switch to toggle between the two behaviors. I had kept all buffered writes in in same cgroup (root cgroup) for few reasons. - Because of single request descriptor pool for writes, anyway one writer gets backlogged behind other. So creating separate async queues per group is not going to help. - Writeback logic was not cgroup aware. So it might not send enough IO from each writer to maintain parallelism. So creating separate async queues did not make sense till that was fixed. - As you said, it is historical also. We prioritize READS at the expense of writes. Now by putting buffered/async writes in a separate group, we will might end up prioritizing a group's async write over other group's synchronous read. How many people really want that behavior? To me keeping service differentiation among the sync IO matters most. Even if all async IO is treated same, I guess not many people might care. > > Note that blk-throttle doesn't have this problem. I am not sure what are you trying to say here. But primarily blk-throttle will throttle read and direct IO. Buffered writes will go to root cgroup which is typically unthrottled. > > * Unlike dirty data pages, metadata tends to have strict ordering > requirements and thus is susceptible to priority inversion. Two > solutions were suggested - 1. allow overdrawl for metadata writes so > that low prio metadata writes don't block the whole FS, 2. provide > an interface to query and wait for bdi-cgroup congestion which can > be called from FS metadata paths to throttle metadata operations > before they enter the stream of ordered operations. So that probably will mean changing the order of operations also. IIUC, in case of fsync (ordered mode), we opened a meta data transaction first, then tried to flush all the cached data and then flush metadata. So if fsync is throttled, all the metadata operations behind it will get serialized for ext3/ext4. So you seem to be suggesting that we change the design so that metadata operation does not thrown into ordered stream till we have finished writing all the data back to disk? I am not a filesystem developer, so I don't know how feasible this change is. This is just one of the points. In the past while talking to Dave Chinner, he mentioned that in XFS, if two cgroups fall into same allocation group then there were cases where IO of one cgroup can get serialized behind other. In general, the core of the issue is that filesystems are not cgroup aware and if you do throttling below filesystems, then invariably one or other serialization issue will come up and I am concerned that we will be constantly fixing those serialization issues. Or the desgin point could be so central to filesystem design that it can't be changed. In general, if you do throttling deeper in the stakc and build back pressure, then all the layers sitting above should be cgroup aware to avoid problems. Two layers identified so far are writeback and filesystems. Is it really worth the complexity. How about doing throttling in higher layers when IO is entering the kernel and keep proportional IO logic at the lowest level and current mechanism of building pressure continues to work? Why to split. Proportional IO logic is work conserving so even if some serialization happens, that situation should clear up pretty soon as IO from other cgroup will dry up and IO from the group causing serialization will make progress and at max we will lose fairness for certain duration. With throttling limits come from the user and one can put really low artificial limits. So even if the underlying resources are free the IO from throttled cgroup might not make any progress in turn choking every other cgroup which is serialized behind it. So in general throttling at block layer and building back pressure is fine. I am concerned about two cases. - How to handle NFS. - Do filesystem developers agree with this approach and are they willing to address any serialization issues arising due to this design. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 14:51 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: Hi Tejun, Thanks for the RFC and looking into this issue. Few thoughts inline. [..] > IIUC, without cgroup, the current writeback code works more or less > like this. Throwing in cgroup doesn't really change the fundamental > design. Instead of a single pipe going down, we just have multiple > pipes to the same device, each of which should be treated separately. > Of course, a spinning disk can't be divided that easily and their > performance characteristics will be inter-dependent, but the place to > solve that problem is where the problem is, the block layer. How do you take care of thorottling IO to NFS case in this model? Current throttling logic is tied to block device and in case of NFS, there is no block device. [..] > In the discussion, for such implementation, the following obstacles > were identified. > > * There are a lot of cases where IOs are issued by a task which isn't > the originiator. ie. Writeback issues IOs for pages which are > dirtied by some other tasks. So, by the time an IO reaches the > block layer, we don't know which cgroup the IO belongs to. > > Recently, block layer has grown support to attach a task to a bio > which causes the bio to be handled as if it were issued by the > associated task regardless of the actual issuing task. It currently > only allows attaching %current to a bio - bio_associate_current() - > but changing it to support other tasks is trivial. > > We'll need to update the async issuers to tag the IOs they issue but > the mechanism is already there. Most likely this tagging will take place in "struct page" and I am not sure if we will be allowed to grow size of "struct page" for this reason. > > * There's a single request pool shared by all issuers per a request > queue. This can lead to priority inversion among cgroups. Note > that problem also exists without cgroups. Lower ioprio issuer may > be holding a request holding back highprio issuer. > > We'll need to make request allocation cgroup (and hopefully ioprio) > aware. Probably in the form of separate request pools. This will > take some work but I don't think this will be too challenging. I'll > work on it. This should be doable. I had implemented it long back with single request pool but internal limits for each group. That is block the task in the group if group has enough pending requests allocated from the pool. But separate request pool should work equally well. Just that it conflits a bit with current definition of q->nr_requests. Which specifies number of total outstanding requests on the queue. Once you make the pool per queue, I guess this limit will have to be transformed into per group upper limit. > > * cfq cgroup policy throws all async IOs, which all buffered writes > are, into the shared cgroup regardless of the actual cgroup. This > behavior is, I believe, mostly historical and changing it isn't > difficult. Prolly only few tens of lines of changes. This may > cause significant changes to actual IO behavior with cgroups tho. I > personally think the previous behavior was too wrong to keep (the > weight was completely ignored for buffered writes) but we may want > to introduce a switch to toggle between the two behaviors. I had kept all buffered writes in in same cgroup (root cgroup) for few reasons. - Because of single request descriptor pool for writes, anyway one writer gets backlogged behind other. So creating separate async queues per group is not going to help. - Writeback logic was not cgroup aware. So it might not send enough IO from each writer to maintain parallelism. So creating separate async queues did not make sense till that was fixed. - As you said, it is historical also. We prioritize READS at the expense of writes. Now by putting buffered/async writes in a separate group, we will might end up prioritizing a group's async write over other group's synchronous read. How many people really want that behavior? To me keeping service differentiation among the sync IO matters most. Even if all async IO is treated same, I guess not many people might care. > > Note that blk-throttle doesn't have this problem. I am not sure what are you trying to say here. But primarily blk-throttle will throttle read and direct IO. Buffered writes will go to root cgroup which is typically unthrottled. > > * Unlike dirty data pages, metadata tends to have strict ordering > requirements and thus is susceptible to priority inversion. Two > solutions were suggested - 1. allow overdrawl for metadata writes so > that low prio metadata writes don't block the whole FS, 2. provide > an interface to query and wait for bdi-cgroup congestion which can > be called from FS metadata paths to throttle metadata operations > before they enter the stream of ordered operations. So that probably will mean changing the order of operations also. IIUC, in case of fsync (ordered mode), we opened a meta data transaction first, then tried to flush all the cached data and then flush metadata. So if fsync is throttled, all the metadata operations behind it will get serialized for ext3/ext4. So you seem to be suggesting that we change the design so that metadata operation does not thrown into ordered stream till we have finished writing all the data back to disk? I am not a filesystem developer, so I don't know how feasible this change is. This is just one of the points. In the past while talking to Dave Chinner, he mentioned that in XFS, if two cgroups fall into same allocation group then there were cases where IO of one cgroup can get serialized behind other. In general, the core of the issue is that filesystems are not cgroup aware and if you do throttling below filesystems, then invariably one or other serialization issue will come up and I am concerned that we will be constantly fixing those serialization issues. Or the desgin point could be so central to filesystem design that it can't be changed. In general, if you do throttling deeper in the stakc and build back pressure, then all the layers sitting above should be cgroup aware to avoid problems. Two layers identified so far are writeback and filesystems. Is it really worth the complexity. How about doing throttling in higher layers when IO is entering the kernel and keep proportional IO logic at the lowest level and current mechanism of building pressure continues to work? Why to split. Proportional IO logic is work conserving so even if some serialization happens, that situation should clear up pretty soon as IO from other cgroup will dry up and IO from the group causing serialization will make progress and at max we will lose fairness for certain duration. With throttling limits come from the user and one can put really low artificial limits. So even if the underlying resources are free the IO from throttled cgroup might not make any progress in turn choking every other cgroup which is serialized behind it. So in general throttling at block layer and building back pressure is fine. I am concerned about two cases. - How to handle NFS. - Do filesystem developers agree with this approach and are they willing to address any serialization issues arising due to this design. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 15:36 ` Steve French @ 2012-04-04 15:36 ` Steve French 2012-04-07 8:00 ` Jan Kara 2 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > Hi Tejun, > > Thanks for the RFC and looking into this issue. Few thoughts inline. > > [..] >> IIUC, without cgroup, the current writeback code works more or less >> like this. Throwing in cgroup doesn't really change the fundamental >> design. Instead of a single pipe going down, we just have multiple >> pipes to the same device, each of which should be treated separately. >> Of course, a spinning disk can't be divided that easily and their >> performance characteristics will be inter-dependent, but the place to >> solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. Similarly smb2 gets congestion info (number of "credits") returned from the server on every response - but not sure why congestion control is tied to the block device when this would create problems for network file systems -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-04 15:36 ` Steve French 0 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > Hi Tejun, > > Thanks for the RFC and looking into this issue. Few thoughts inline. > > [..] >> IIUC, without cgroup, the current writeback code works more or less >> like this. Throwing in cgroup doesn't really change the fundamental >> design. Instead of a single pipe going down, we just have multiple >> pipes to the same device, each of which should be treated separately. >> Of course, a spinning disk can't be divided that easily and their >> performance characteristics will be inter-dependent, but the place to >> solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. Similarly smb2 gets congestion info (number of "credits") returned from the server on every response - but not sure why congestion control is tied to the block device when this would create problems for network file systems -- Thanks, Steve -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-04 15:36 ` Steve French 0 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > Hi Tejun, > > Thanks for the RFC and looking into this issue. Few thoughts inline. > > [..] >> IIUC, without cgroup, the current writeback code works more or less >> like this. Throwing in cgroup doesn't really change the fundamental >> design. Instead of a single pipe going down, we just have multiple >> pipes to the same device, each of which should be treated separately. >> Of course, a spinning disk can't be divided that easily and their >> performance characteristics will be inter-dependent, but the place to >> solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. Similarly smb2 gets congestion info (number of "credits") returned from the server on every response - but not sure why congestion control is tied to the block device when this would create problems for network file systems -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-04 15:36 ` Steve French @ 2012-04-04 18:56 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw) To: Steve French Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: > > How do you take care of thorottling IO to NFS case in this model? Current > > throttling logic is tied to block device and in case of NFS, there is no > > block device. > > Similarly smb2 gets congestion info (number of "credits") returned from > the server on every response - but not sure why congestion > control is tied to the block device when this would create > problems for network file systems I hope the previous replies answered this. It's about writeback getting pressure from bdi and isn't restricted to block devices. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-04 18:56 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw) To: Steve French Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: > > How do you take care of thorottling IO to NFS case in this model? Current > > throttling logic is tied to block device and in case of NFS, there is no > > block device. > > Similarly smb2 gets congestion info (number of "credits") returned from > the server on every response - but not sure why congestion > control is tied to the block device when this would create > problems for network file systems I hope the previous replies answered this. It's about writeback getting pressure from bdi and isn't restricted to block devices. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-04 18:56 ` Tejun Heo @ 2012-04-04 19:19 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw) To: Tejun Heo Cc: Steve French, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote: > On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: > > > How do you take care of thorottling IO to NFS case in this model? Current > > > throttling logic is tied to block device and in case of NFS, there is no > > > block device. > > > > Similarly smb2 gets congestion info (number of "credits") returned from > > the server on every response - but not sure why congestion > > control is tied to the block device when this would create > > problems for network file systems > > I hope the previous replies answered this. It's about writeback > getting pressure from bdi and isn't restricted to block devices. So the controlling knobs for network filesystems will be very different as current throttling knobs are per device (and not per bdi). So presumably there will be some throttling logic in network layer (network tc), and that should communicate the back pressure. I have tried limiting network traffic on NFS using network controller and tc but that did not help for variety of reasons. - We again have the problem of losing submitter's context down the layer. - We have interesting TCP/IP sequencing issues. I don't have the details but if you throttle traffic from one group, it kind of led to some kind of multiple re-transmissions from server for ack due to some sequence number issues. Sorry, I am short on details as it was long back and nfs guys told me that pNFS might help here. The basic problem seemed to that that if you multiplex traffic from all cgroups on single tcp/ip session and then choke IO suddenly from one of them, that was leading to some sequence number issues and led to really sucky performance. So something to keep in mind while coming up ways for how to implement throttling for network file systems. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-04 19:19 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw) To: Tejun Heo Cc: Steve French, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote: > On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: > > > How do you take care of thorottling IO to NFS case in this model? Current > > > throttling logic is tied to block device and in case of NFS, there is no > > > block device. > > > > Similarly smb2 gets congestion info (number of "credits") returned from > > the server on every response - but not sure why congestion > > control is tied to the block device when this would create > > problems for network file systems > > I hope the previous replies answered this. It's about writeback > getting pressure from bdi and isn't restricted to block devices. So the controlling knobs for network filesystems will be very different as current throttling knobs are per device (and not per bdi). So presumably there will be some throttling logic in network layer (network tc), and that should communicate the back pressure. I have tried limiting network traffic on NFS using network controller and tc but that did not help for variety of reasons. - We again have the problem of losing submitter's context down the layer. - We have interesting TCP/IP sequencing issues. I don't have the details but if you throttle traffic from one group, it kind of led to some kind of multiple re-transmissions from server for ack due to some sequence number issues. Sorry, I am short on details as it was long back and nfs guys told me that pNFS might help here. The basic problem seemed to that that if you multiplex traffic from all cgroups on single tcp/ip session and then choke IO suddenly from one of them, that was leading to some sequence number issues and led to really sucky performance. So something to keep in mind while coming up ways for how to implement throttling for network file systems. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-25 8:47 ` Suresh Jayaraman @ 2012-04-25 8:47 ` Suresh Jayaraman 0 siblings, 0 replies; 261+ messages in thread From: Suresh Jayaraman @ 2012-04-25 8:47 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Steve French, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On 04/05/2012 12:49 AM, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote: >> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: >>>> How do you take care of thorottling IO to NFS case in this model? Current >>>> throttling logic is tied to block device and in case of NFS, there is no >>>> block device. >>> >>> Similarly smb2 gets congestion info (number of "credits") returned from >>> the server on every response - but not sure why congestion >>> control is tied to the block device when this would create >>> problems for network file systems >> >> I hope the previous replies answered this. It's about writeback >> getting pressure from bdi and isn't restricted to block devices. > > So the controlling knobs for network filesystems will be very different > as current throttling knobs are per device (and not per bdi). So > presumably there will be some throttling logic in network layer (network > tc), and that should communicate the back pressure. Tried to figure out potential use-case scenarios for controlling Network I/O resource from netfs POV (which ideally should guide the interfaces). - Is finer grained control of network I/O is desirable/useful or being able to control bandwidth at per server level is sufficient? Consider the case where there are different NFS volumes mounted from the same NFS/CIFS server, /backup /missioncritical_data /apps /documents admin being able to set bandwidth limits to the each of these mounts based on how important would be a useful feature. If we try to build the logic in the network layer using tc then this still wouldn't be possible to limit the tasks that are writing to more than one volumes? (need some logic in netfs as well?). Network filesystem clients typically are not bothered much about the actual device but about the exported share. So it appears that the controlling knobs could be different for netfs. - Provide minimimum guarantees for the Network I/O to keep going irrespective of the overloaded workload situations. i.e. operations that are local to the machine should not hamper Network I/O or operations that are happening on one mount should not impact operations that are happening on another mount. IIRC, while we currently would be able to limit maximum usage, we don't guarantee the minimum quantity of the resource that would be available in general for all controllers. This might be important from QoS guarantee POV. - What are the other use-cases where limiting Network I/O would be useful? > I have tried limiting network traffic on NFS using network controller > and tc but that did not help for variety of reasons. > A quick look at the current net_tls implementation shows that it allows setting priorities but doesn't seem to provide ways to limit the throughput? Or is it still possible? If not did you use a out-of-tree implementation to test this? > - We again have the problem of losing submitter's context down the layer. If the network layer is cgroup aware why this would be a problem? > - We have interesting TCP/IP sequencing issues. I don't have the details > but if you throttle traffic from one group, it kind of led to some > kind of multiple re-transmissions from server for ack due to some > sequence number issues. Sorry, I am short on details as it was long back > and nfs guys told me that pNFS might help here. > > The basic problem seemed to that that if you multiplex traffic from > all cgroups on single tcp/ip session and then choke IO suddenly from > one of them, that was leading to some sequence number issues and led > to really sucky performance. > > So something to keep in mind while coming up ways for how to implement > throttling for network file systems. > Thanks Suresh ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-25 8:47 ` Suresh Jayaraman 0 siblings, 0 replies; 261+ messages in thread From: Suresh Jayaraman @ 2012-04-25 8:47 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Steve French, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On 04/05/2012 12:49 AM, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote: >> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: >>>> How do you take care of thorottling IO to NFS case in this model? Current >>>> throttling logic is tied to block device and in case of NFS, there is no >>>> block device. >>> >>> Similarly smb2 gets congestion info (number of "credits") returned from >>> the server on every response - but not sure why congestion >>> control is tied to the block device when this would create >>> problems for network file systems >> >> I hope the previous replies answered this. It's about writeback >> getting pressure from bdi and isn't restricted to block devices. > > So the controlling knobs for network filesystems will be very different > as current throttling knobs are per device (and not per bdi). So > presumably there will be some throttling logic in network layer (network > tc), and that should communicate the back pressure. Tried to figure out potential use-case scenarios for controlling Network I/O resource from netfs POV (which ideally should guide the interfaces). - Is finer grained control of network I/O is desirable/useful or being able to control bandwidth at per server level is sufficient? Consider the case where there are different NFS volumes mounted from the same NFS/CIFS server, /backup /missioncritical_data /apps /documents admin being able to set bandwidth limits to the each of these mounts based on how important would be a useful feature. If we try to build the logic in the network layer using tc then this still wouldn't be possible to limit the tasks that are writing to more than one volumes? (need some logic in netfs as well?). Network filesystem clients typically are not bothered much about the actual device but about the exported share. So it appears that the controlling knobs could be different for netfs. - Provide minimimum guarantees for the Network I/O to keep going irrespective of the overloaded workload situations. i.e. operations that are local to the machine should not hamper Network I/O or operations that are happening on one mount should not impact operations that are happening on another mount. IIRC, while we currently would be able to limit maximum usage, we don't guarantee the minimum quantity of the resource that would be available in general for all controllers. This might be important from QoS guarantee POV. - What are the other use-cases where limiting Network I/O would be useful? > I have tried limiting network traffic on NFS using network controller > and tc but that did not help for variety of reasons. > A quick look at the current net_tls implementation shows that it allows setting priorities but doesn't seem to provide ways to limit the throughput? Or is it still possible? If not did you use a out-of-tree implementation to test this? > - We again have the problem of losing submitter's context down the layer. If the network layer is cgroup aware why this would be a problem? > - We have interesting TCP/IP sequencing issues. I don't have the details > but if you throttle traffic from one group, it kind of led to some > kind of multiple re-transmissions from server for ack due to some > sequence number issues. Sorry, I am short on details as it was long back > and nfs guys told me that pNFS might help here. > > The basic problem seemed to that that if you multiplex traffic from > all cgroups on single tcp/ip session and then choke IO suddenly from > one of them, that was leading to some sequence number issues and led > to really sucky performance. > > So something to keep in mind while coming up ways for how to implement > throttling for network file systems. > Thanks Suresh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-25 8:47 ` Suresh Jayaraman 0 siblings, 0 replies; 261+ messages in thread From: Suresh Jayaraman @ 2012-04-25 8:47 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Steve French, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On 04/05/2012 12:49 AM, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote: >> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: >>>> How do you take care of thorottling IO to NFS case in this model? Current >>>> throttling logic is tied to block device and in case of NFS, there is no >>>> block device. >>> >>> Similarly smb2 gets congestion info (number of "credits") returned from >>> the server on every response - but not sure why congestion >>> control is tied to the block device when this would create >>> problems for network file systems >> >> I hope the previous replies answered this. It's about writeback >> getting pressure from bdi and isn't restricted to block devices. > > So the controlling knobs for network filesystems will be very different > as current throttling knobs are per device (and not per bdi). So > presumably there will be some throttling logic in network layer (network > tc), and that should communicate the back pressure. Tried to figure out potential use-case scenarios for controlling Network I/O resource from netfs POV (which ideally should guide the interfaces). - Is finer grained control of network I/O is desirable/useful or being able to control bandwidth at per server level is sufficient? Consider the case where there are different NFS volumes mounted from the same NFS/CIFS server, /backup /missioncritical_data /apps /documents admin being able to set bandwidth limits to the each of these mounts based on how important would be a useful feature. If we try to build the logic in the network layer using tc then this still wouldn't be possible to limit the tasks that are writing to more than one volumes? (need some logic in netfs as well?). Network filesystem clients typically are not bothered much about the actual device but about the exported share. So it appears that the controlling knobs could be different for netfs. - Provide minimimum guarantees for the Network I/O to keep going irrespective of the overloaded workload situations. i.e. operations that are local to the machine should not hamper Network I/O or operations that are happening on one mount should not impact operations that are happening on another mount. IIRC, while we currently would be able to limit maximum usage, we don't guarantee the minimum quantity of the resource that would be available in general for all controllers. This might be important from QoS guarantee POV. - What are the other use-cases where limiting Network I/O would be useful? > I have tried limiting network traffic on NFS using network controller > and tc but that did not help for variety of reasons. > A quick look at the current net_tls implementation shows that it allows setting priorities but doesn't seem to provide ways to limit the throughput? Or is it still possible? If not did you use a out-of-tree implementation to test this? > - We again have the problem of losing submitter's context down the layer. If the network layer is cgroup aware why this would be a problem? > - We have interesting TCP/IP sequencing issues. I don't have the details > but if you throttle traffic from one group, it kind of led to some > kind of multiple re-transmissions from server for ack due to some > sequence number issues. Sorry, I am short on details as it was long back > and nfs guys told me that pNFS might help here. > > The basic problem seemed to that that if you multiplex traffic from > all cgroups on single tcp/ip session and then choke IO suddenly from > one of them, that was leading to some sequence number issues and led > to really sucky performance. > > So something to keep in mind while coming up ways for how to implement > throttling for network file systems. > Thanks Suresh ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-25 8:47 ` Suresh Jayaraman 0 siblings, 0 replies; 261+ messages in thread From: Suresh Jayaraman @ 2012-04-25 8:47 UTC (permalink / raw) To: Vivek Goyal Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Steve French, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 04/05/2012 12:49 AM, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote: >> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: >>>> How do you take care of thorottling IO to NFS case in this model? Current >>>> throttling logic is tied to block device and in case of NFS, there is no >>>> block device. >>> >>> Similarly smb2 gets congestion info (number of "credits") returned from >>> the server on every response - but not sure why congestion >>> control is tied to the block device when this would create >>> problems for network file systems >> >> I hope the previous replies answered this. It's about writeback >> getting pressure from bdi and isn't restricted to block devices. > > So the controlling knobs for network filesystems will be very different > as current throttling knobs are per device (and not per bdi). So > presumably there will be some throttling logic in network layer (network > tc), and that should communicate the back pressure. Tried to figure out potential use-case scenarios for controlling Network I/O resource from netfs POV (which ideally should guide the interfaces). - Is finer grained control of network I/O is desirable/useful or being able to control bandwidth at per server level is sufficient? Consider the case where there are different NFS volumes mounted from the same NFS/CIFS server, /backup /missioncritical_data /apps /documents admin being able to set bandwidth limits to the each of these mounts based on how important would be a useful feature. If we try to build the logic in the network layer using tc then this still wouldn't be possible to limit the tasks that are writing to more than one volumes? (need some logic in netfs as well?). Network filesystem clients typically are not bothered much about the actual device but about the exported share. So it appears that the controlling knobs could be different for netfs. - Provide minimimum guarantees for the Network I/O to keep going irrespective of the overloaded workload situations. i.e. operations that are local to the machine should not hamper Network I/O or operations that are happening on one mount should not impact operations that are happening on another mount. IIRC, while we currently would be able to limit maximum usage, we don't guarantee the minimum quantity of the resource that would be available in general for all controllers. This might be important from QoS guarantee POV. - What are the other use-cases where limiting Network I/O would be useful? > I have tried limiting network traffic on NFS using network controller > and tc but that did not help for variety of reasons. > A quick look at the current net_tls implementation shows that it allows setting priorities but doesn't seem to provide ways to limit the throughput? Or is it still possible? If not did you use a out-of-tree implementation to test this? > - We again have the problem of losing submitter's context down the layer. If the network layer is cgroup aware why this would be a problem? > - We have interesting TCP/IP sequencing issues. I don't have the details > but if you throttle traffic from one group, it kind of led to some > kind of multiple re-transmissions from server for ack due to some > sequence number issues. Sorry, I am short on details as it was long back > and nfs guys told me that pNFS might help here. > > The basic problem seemed to that that if you multiplex traffic from > all cgroups on single tcp/ip session and then choke IO suddenly from > one of them, that was leading to some sequence number issues and led > to really sucky performance. > > So something to keep in mind while coming up ways for how to implement > throttling for network file systems. > Thanks Suresh ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> @ 2012-04-04 19:19 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw) To: Tejun Heo Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Steve French, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote: > On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: > > > How do you take care of thorottling IO to NFS case in this model? Current > > > throttling logic is tied to block device and in case of NFS, there is no > > > block device. > > > > Similarly smb2 gets congestion info (number of "credits") returned from > > the server on every response - but not sure why congestion > > control is tied to the block device when this would create > > problems for network file systems > > I hope the previous replies answered this. It's about writeback > getting pressure from bdi and isn't restricted to block devices. So the controlling knobs for network filesystems will be very different as current throttling knobs are per device (and not per bdi). So presumably there will be some throttling logic in network layer (network tc), and that should communicate the back pressure. I have tried limiting network traffic on NFS using network controller and tc but that did not help for variety of reasons. - We again have the problem of losing submitter's context down the layer. - We have interesting TCP/IP sequencing issues. I don't have the details but if you throttle traffic from one group, it kind of led to some kind of multiple re-transmissions from server for ack due to some sequence number issues. Sorry, I am short on details as it was long back and nfs guys told me that pNFS might help here. The basic problem seemed to that that if you multiplex traffic from all cgroups on single tcp/ip session and then choke IO suddenly from one of them, that was leading to some sequence number issues and led to really sucky performance. So something to keep in mind while coming up ways for how to implement throttling for network file systems. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2012-04-04 18:56 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw) To: Steve French Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote: > > How do you take care of thorottling IO to NFS case in this model? Current > > throttling logic is tied to block device and in case of NFS, there is no > > block device. > > Similarly smb2 gets congestion info (number of "credits") returned from > the server on every response - but not sure why congestion > control is tied to the block device when this would create > problems for network file systems I hope the previous replies answered this. It's about writeback getting pressure from bdi and isn't restricted to block devices. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 15:36 ` Steve French @ 2012-04-04 18:49 ` Tejun Heo 2012-04-07 8:00 ` Jan Kara 2 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Vivek. On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > IIUC, without cgroup, the current writeback code works more or less > > like this. Throwing in cgroup doesn't really change the fundamental > > design. Instead of a single pipe going down, we just have multiple > > pipes to the same device, each of which should be treated separately. > > Of course, a spinning disk can't be divided that easily and their > > performance characteristics will be inter-dependent, but the place to > > solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. On principle, I don't think it has be any different. Filesystems's interface to the underlying device is through bdi. If a fs is block backed, block pressure should be propagated through bdi, which should be mostly trivial. If a fs is network backed, we can implement a mechanism for network backed bdis, so that they can relay the pressure from the server side to the local fs users. That said, network filesystems often show different behaviors and use different mechanisms for various reasons and it wouldn't be too surprising if something different would fit them better here or we might need something supplemental to the usual mechanism. > [..] > > In the discussion, for such implementation, the following obstacles > > were identified. > > > > * There are a lot of cases where IOs are issued by a task which isn't > > the originiator. ie. Writeback issues IOs for pages which are > > dirtied by some other tasks. So, by the time an IO reaches the > > block layer, we don't know which cgroup the IO belongs to. > > > > Recently, block layer has grown support to attach a task to a bio > > which causes the bio to be handled as if it were issued by the > > associated task regardless of the actual issuing task. It currently > > only allows attaching %current to a bio - bio_associate_current() - > > but changing it to support other tasks is trivial. > > > > We'll need to update the async issuers to tag the IOs they issue but > > the mechanism is already there. > > Most likely this tagging will take place in "struct page" and I am not > sure if we will be allowed to grow size of "struct page" for this reason. With memcg enabled, we are already doing that and IIUC Jan and Fengguang think that using inode granularity should be good enough for writeback blaming. > > * There's a single request pool shared by all issuers per a request > > queue. This can lead to priority inversion among cgroups. Note > > that problem also exists without cgroups. Lower ioprio issuer may > > be holding a request holding back highprio issuer. > > > > We'll need to make request allocation cgroup (and hopefully ioprio) > > aware. Probably in the form of separate request pools. This will > > take some work but I don't think this will be too challenging. I'll > > work on it. > > This should be doable. I had implemented it long back with single request > pool but internal limits for each group. That is block the task in the > group if group has enough pending requests allocated from the pool. But > separate request pool should work equally well. > > Just that it conflits a bit with current definition of q->nr_requests. > Which specifies number of total outstanding requests on the queue. Once > you make the pool per queue, I guess this limit will have to be > transformed into per group upper limit. I'm not sure about the details yet. I *think* the suckiest part is the actual allocation part. We're deferring cgroup - request_queue association until actual usage and depending on atomic allocations to create those associations on IO path. Doing the same for requests might not be too pleasant. Hmm.... allocation failure handling on that path is already broken BTW. Maybe we just need to get the fallback behavior properly working. Unsure. > > * cfq cgroup policy throws all async IOs, which all buffered writes > > are, into the shared cgroup regardless of the actual cgroup. This > > behavior is, I believe, mostly historical and changing it isn't > > difficult. Prolly only few tens of lines of changes. This may > > cause significant changes to actual IO behavior with cgroups tho. I > > personally think the previous behavior was too wrong to keep (the > > weight was completely ignored for buffered writes) but we may want > > to introduce a switch to toggle between the two behaviors. > > I had kept all buffered writes in in same cgroup (root cgroup) for few > reasons. > > - Because of single request descriptor pool for writes, anyway one writer > gets backlogged behind other. So creating separate async queues per > group is not going to help. > > - Writeback logic was not cgroup aware. So it might not send enough IO > from each writer to maintain parallelism. So creating separate async > queues did not make sense till that was fixed. Yeah, the above are why I find "buffered writes need separate controls because cfq doesn't distinguish async writes" argument very ironic. We introduce one quirk to compensate for shortages in the other part and then later we work that around in that other part for that quirk? I mean, that's just twisted. > - As you said, it is historical also. We prioritize READS at the expense > of writes. Now by putting buffered/async writes in a separate group, we > will might end up prioritizing a group's async write over other group's > synchronous read. How many people really want that behavior? To me > keeping service differentiation among the sync IO matters most. Even > if all async IO is treated same, I guess not many people might care. While segregation of async IOs may not matter in some cases, it does matter to many other use cases, so it seems to me that we hard coded that optimization decision without thinking too much about it. For a lot of container type use cases, the current implementation is nearly useless (I know of cases where people are explicitly patching for separate async queues). At the same time, switching the default behavior *may* disturb some of the current users and that's why I'm thinking abut having a switch for the new behavior. > > Note that blk-throttle doesn't have this problem. > > I am not sure what are you trying to say here. But primarily blk-throttle > will throttle read and direct IO. Buffered writes will go to root cgroup > which is typically unthrottled. Ooh, my bad then. Anyways, then the same applies to blk-throttle. Our current implementation essentially collapses at the face of write-heavy workload. > > * Unlike dirty data pages, metadata tends to have strict ordering > > requirements and thus is susceptible to priority inversion. Two > > solutions were suggested - 1. allow overdrawl for metadata writes so > > that low prio metadata writes don't block the whole FS, 2. provide > > an interface to query and wait for bdi-cgroup congestion which can > > be called from FS metadata paths to throttle metadata operations > > before they enter the stream of ordered operations. > > So that probably will mean changing the order of operations also. IIUC, > in case of fsync (ordered mode), we opened a meta data transaction first, > then tried to flush all the cached data and then flush metadata. So if > fsync is throttled, all the metadata operations behind it will get > serialized for ext3/ext4. > > So you seem to be suggesting that we change the design so that metadata > operation does not thrown into ordered stream till we have finished > writing all the data back to disk? I am not a filesystem developer, so > I don't know how feasible this change is. Jan explained it to me and I don't think it requires extensive changes to the filesystem. IIUC, filesystems would just block tasks creating journal entry while its matching bdi is congested and that's the extent of the necessary change. > This is just one of the points. In the past while talking to Dave Chinner, > he mentioned that in XFS, if two cgroups fall into same allocation group > then there were cases where IO of one cgroup can get serialized behind > other. > > In general, the core of the issue is that filesystems are not cgroup aware > and if you do throttling below filesystems, then invariably one or other > serialization issue will come up and I am concerned that we will be constantly > fixing those serialization issues. Or the desgin point could be so central > to filesystem design that it can't be changed. So, the idea is to avoid allowing any congested cgroup to enter serialized journal. As there's time gap until journal commit, the bdi might be congested by the commit time. In that case, META writes get to overdraw cgroup limits to avoid causing priority inversion. I think we should be able to get most working with bdi congestion check at the front and limit bypass for META for now. We can worry about overdrawing later. > In general, if you do throttling deeper in the stakc and build back > pressure, then all the layers sitting above should be cgroup aware > to avoid problems. Two layers identified so far are writeback and > filesystems. Is it really worth the complexity. How about doing > throttling in higher layers when IO is entering the kernel and > keep proportional IO logic at the lowest level and current mechanism > of building pressure continues to work? First, I just don't think it's the right design. It's a rather abstract statement but I want to emphasize that having the "right" design, in the sense that we look at the overall picture and put configs, controls and other logics where they belong to in the structure that their roles point to tends to make long-term development and maintenance much easier in ways which may not be immediately foreseeable, for both technical and social reasons - logical structuring and layering keep us sane and make new comer's lives at least bearable. Secondly, I don't think it'll be a lot of added complexity. We *need* to fix all the said shortcoming in block layer for proper cgroup support anyway, right? Propagating that support upwards doesn't take too much code. Other than the metadata thing, it mostly just requires updates to writeback code such that they deal with bdi-cgroup combination instead of individual cgroups. They'll surely require some adjustments but we're not gonna be burdening the main paths with cgroup awareness. cgroup support would just make the existing implementation work on finer grained domains. Thirdly, I don't see how writeback can control all the IOs. I mean, what about reads or direct IOs? It's not like IO devices have separate channels for those different types of IOs. They interact heavily. Let's say we have iops/bps limitation applied on top of proportional IO distribution or a device holds two partitions and one of them is being used for direct IO w/o filesystems. How would that work? I think the question goes even deeper, what do the separate limits even mean? Does the IO sched have to calculate allocation of IO resource to different types of IOs and then give a "number" to writeback which in turn enforces that limit? How does the elevator know what number to give? Is the number iops or bps or weight? If the iosched doesn't know how much write workload exists, how does it distribute the surplus buffered writeback resource across different cgroups? If so, what makes the limit actualy enforceable (due to inaccuracies in estimation, fluctuation in workload, delay in enforcement in different layers and whatnot) except for block layer applying the limit *again* on the resulting stream of combined IOs? Fourthly, having clear layering usually means much more flexibility. The assumptions about IO characteristics that we have are still mostly based on devices with spindles, probably because they're still causing the most amount of pain. The assumptions keep changing and if we get the layering correct, we can mostly deal with changes at the layers concerning them - ie. in the block layer. Maybe we'll have a different iosched or cfq can be evolved to cover the new cases, but the required adaptation would be logical and while upper layers might need some adjustments they wouldn't need any major overhaul. They'll be still working from back pressure from IO. So, the above are the reasons why I don't like the idea of splitting IO control across multiple layers, well the ones that I can think of right now anyway. I'm currently feeling rather strong about them in the sense of "oh no, this is about to be messed up" but maybe I'm just not seeing what Fengguang is seeing. I'll keep discussing there. > So in general throttling at block layer and building back pressure is > fine. I am concerned about two cases. > > - How to handle NFS. As said above, maybe through network based bdi pressure propagation, Maybe some other special case mechanism. Unsure but I don't think this concern should dictate the whole design. > - Do filesystem developers agree with this approach and are they willing > to address any serialization issues arising due to this design. Jan, can you please fill in? Did I understand it correctly? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 18:49 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Vivek. On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > IIUC, without cgroup, the current writeback code works more or less > > like this. Throwing in cgroup doesn't really change the fundamental > > design. Instead of a single pipe going down, we just have multiple > > pipes to the same device, each of which should be treated separately. > > Of course, a spinning disk can't be divided that easily and their > > performance characteristics will be inter-dependent, but the place to > > solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. On principle, I don't think it has be any different. Filesystems's interface to the underlying device is through bdi. If a fs is block backed, block pressure should be propagated through bdi, which should be mostly trivial. If a fs is network backed, we can implement a mechanism for network backed bdis, so that they can relay the pressure from the server side to the local fs users. That said, network filesystems often show different behaviors and use different mechanisms for various reasons and it wouldn't be too surprising if something different would fit them better here or we might need something supplemental to the usual mechanism. > [..] > > In the discussion, for such implementation, the following obstacles > > were identified. > > > > * There are a lot of cases where IOs are issued by a task which isn't > > the originiator. ie. Writeback issues IOs for pages which are > > dirtied by some other tasks. So, by the time an IO reaches the > > block layer, we don't know which cgroup the IO belongs to. > > > > Recently, block layer has grown support to attach a task to a bio > > which causes the bio to be handled as if it were issued by the > > associated task regardless of the actual issuing task. It currently > > only allows attaching %current to a bio - bio_associate_current() - > > but changing it to support other tasks is trivial. > > > > We'll need to update the async issuers to tag the IOs they issue but > > the mechanism is already there. > > Most likely this tagging will take place in "struct page" and I am not > sure if we will be allowed to grow size of "struct page" for this reason. With memcg enabled, we are already doing that and IIUC Jan and Fengguang think that using inode granularity should be good enough for writeback blaming. > > * There's a single request pool shared by all issuers per a request > > queue. This can lead to priority inversion among cgroups. Note > > that problem also exists without cgroups. Lower ioprio issuer may > > be holding a request holding back highprio issuer. > > > > We'll need to make request allocation cgroup (and hopefully ioprio) > > aware. Probably in the form of separate request pools. This will > > take some work but I don't think this will be too challenging. I'll > > work on it. > > This should be doable. I had implemented it long back with single request > pool but internal limits for each group. That is block the task in the > group if group has enough pending requests allocated from the pool. But > separate request pool should work equally well. > > Just that it conflits a bit with current definition of q->nr_requests. > Which specifies number of total outstanding requests on the queue. Once > you make the pool per queue, I guess this limit will have to be > transformed into per group upper limit. I'm not sure about the details yet. I *think* the suckiest part is the actual allocation part. We're deferring cgroup - request_queue association until actual usage and depending on atomic allocations to create those associations on IO path. Doing the same for requests might not be too pleasant. Hmm.... allocation failure handling on that path is already broken BTW. Maybe we just need to get the fallback behavior properly working. Unsure. > > * cfq cgroup policy throws all async IOs, which all buffered writes > > are, into the shared cgroup regardless of the actual cgroup. This > > behavior is, I believe, mostly historical and changing it isn't > > difficult. Prolly only few tens of lines of changes. This may > > cause significant changes to actual IO behavior with cgroups tho. I > > personally think the previous behavior was too wrong to keep (the > > weight was completely ignored for buffered writes) but we may want > > to introduce a switch to toggle between the two behaviors. > > I had kept all buffered writes in in same cgroup (root cgroup) for few > reasons. > > - Because of single request descriptor pool for writes, anyway one writer > gets backlogged behind other. So creating separate async queues per > group is not going to help. > > - Writeback logic was not cgroup aware. So it might not send enough IO > from each writer to maintain parallelism. So creating separate async > queues did not make sense till that was fixed. Yeah, the above are why I find "buffered writes need separate controls because cfq doesn't distinguish async writes" argument very ironic. We introduce one quirk to compensate for shortages in the other part and then later we work that around in that other part for that quirk? I mean, that's just twisted. > - As you said, it is historical also. We prioritize READS at the expense > of writes. Now by putting buffered/async writes in a separate group, we > will might end up prioritizing a group's async write over other group's > synchronous read. How many people really want that behavior? To me > keeping service differentiation among the sync IO matters most. Even > if all async IO is treated same, I guess not many people might care. While segregation of async IOs may not matter in some cases, it does matter to many other use cases, so it seems to me that we hard coded that optimization decision without thinking too much about it. For a lot of container type use cases, the current implementation is nearly useless (I know of cases where people are explicitly patching for separate async queues). At the same time, switching the default behavior *may* disturb some of the current users and that's why I'm thinking abut having a switch for the new behavior. > > Note that blk-throttle doesn't have this problem. > > I am not sure what are you trying to say here. But primarily blk-throttle > will throttle read and direct IO. Buffered writes will go to root cgroup > which is typically unthrottled. Ooh, my bad then. Anyways, then the same applies to blk-throttle. Our current implementation essentially collapses at the face of write-heavy workload. > > * Unlike dirty data pages, metadata tends to have strict ordering > > requirements and thus is susceptible to priority inversion. Two > > solutions were suggested - 1. allow overdrawl for metadata writes so > > that low prio metadata writes don't block the whole FS, 2. provide > > an interface to query and wait for bdi-cgroup congestion which can > > be called from FS metadata paths to throttle metadata operations > > before they enter the stream of ordered operations. > > So that probably will mean changing the order of operations also. IIUC, > in case of fsync (ordered mode), we opened a meta data transaction first, > then tried to flush all the cached data and then flush metadata. So if > fsync is throttled, all the metadata operations behind it will get > serialized for ext3/ext4. > > So you seem to be suggesting that we change the design so that metadata > operation does not thrown into ordered stream till we have finished > writing all the data back to disk? I am not a filesystem developer, so > I don't know how feasible this change is. Jan explained it to me and I don't think it requires extensive changes to the filesystem. IIUC, filesystems would just block tasks creating journal entry while its matching bdi is congested and that's the extent of the necessary change. > This is just one of the points. In the past while talking to Dave Chinner, > he mentioned that in XFS, if two cgroups fall into same allocation group > then there were cases where IO of one cgroup can get serialized behind > other. > > In general, the core of the issue is that filesystems are not cgroup aware > and if you do throttling below filesystems, then invariably one or other > serialization issue will come up and I am concerned that we will be constantly > fixing those serialization issues. Or the desgin point could be so central > to filesystem design that it can't be changed. So, the idea is to avoid allowing any congested cgroup to enter serialized journal. As there's time gap until journal commit, the bdi might be congested by the commit time. In that case, META writes get to overdraw cgroup limits to avoid causing priority inversion. I think we should be able to get most working with bdi congestion check at the front and limit bypass for META for now. We can worry about overdrawing later. > In general, if you do throttling deeper in the stakc and build back > pressure, then all the layers sitting above should be cgroup aware > to avoid problems. Two layers identified so far are writeback and > filesystems. Is it really worth the complexity. How about doing > throttling in higher layers when IO is entering the kernel and > keep proportional IO logic at the lowest level and current mechanism > of building pressure continues to work? First, I just don't think it's the right design. It's a rather abstract statement but I want to emphasize that having the "right" design, in the sense that we look at the overall picture and put configs, controls and other logics where they belong to in the structure that their roles point to tends to make long-term development and maintenance much easier in ways which may not be immediately foreseeable, for both technical and social reasons - logical structuring and layering keep us sane and make new comer's lives at least bearable. Secondly, I don't think it'll be a lot of added complexity. We *need* to fix all the said shortcoming in block layer for proper cgroup support anyway, right? Propagating that support upwards doesn't take too much code. Other than the metadata thing, it mostly just requires updates to writeback code such that they deal with bdi-cgroup combination instead of individual cgroups. They'll surely require some adjustments but we're not gonna be burdening the main paths with cgroup awareness. cgroup support would just make the existing implementation work on finer grained domains. Thirdly, I don't see how writeback can control all the IOs. I mean, what about reads or direct IOs? It's not like IO devices have separate channels for those different types of IOs. They interact heavily. Let's say we have iops/bps limitation applied on top of proportional IO distribution or a device holds two partitions and one of them is being used for direct IO w/o filesystems. How would that work? I think the question goes even deeper, what do the separate limits even mean? Does the IO sched have to calculate allocation of IO resource to different types of IOs and then give a "number" to writeback which in turn enforces that limit? How does the elevator know what number to give? Is the number iops or bps or weight? If the iosched doesn't know how much write workload exists, how does it distribute the surplus buffered writeback resource across different cgroups? If so, what makes the limit actualy enforceable (due to inaccuracies in estimation, fluctuation in workload, delay in enforcement in different layers and whatnot) except for block layer applying the limit *again* on the resulting stream of combined IOs? Fourthly, having clear layering usually means much more flexibility. The assumptions about IO characteristics that we have are still mostly based on devices with spindles, probably because they're still causing the most amount of pain. The assumptions keep changing and if we get the layering correct, we can mostly deal with changes at the layers concerning them - ie. in the block layer. Maybe we'll have a different iosched or cfq can be evolved to cover the new cases, but the required adaptation would be logical and while upper layers might need some adjustments they wouldn't need any major overhaul. They'll be still working from back pressure from IO. So, the above are the reasons why I don't like the idea of splitting IO control across multiple layers, well the ones that I can think of right now anyway. I'm currently feeling rather strong about them in the sense of "oh no, this is about to be messed up" but maybe I'm just not seeing what Fengguang is seeing. I'll keep discussing there. > So in general throttling at block layer and building back pressure is > fine. I am concerned about two cases. > > - How to handle NFS. As said above, maybe through network based bdi pressure propagation, Maybe some other special case mechanism. Unsure but I don't think this concern should dictate the whole design. > - Do filesystem developers agree with this approach and are they willing > to address any serialization issues arising due to this design. Jan, can you please fill in? Did I understand it correctly? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 18:49 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hey, Vivek. On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > IIUC, without cgroup, the current writeback code works more or less > > like this. Throwing in cgroup doesn't really change the fundamental > > design. Instead of a single pipe going down, we just have multiple > > pipes to the same device, each of which should be treated separately. > > Of course, a spinning disk can't be divided that easily and their > > performance characteristics will be inter-dependent, but the place to > > solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. On principle, I don't think it has be any different. Filesystems's interface to the underlying device is through bdi. If a fs is block backed, block pressure should be propagated through bdi, which should be mostly trivial. If a fs is network backed, we can implement a mechanism for network backed bdis, so that they can relay the pressure from the server side to the local fs users. That said, network filesystems often show different behaviors and use different mechanisms for various reasons and it wouldn't be too surprising if something different would fit them better here or we might need something supplemental to the usual mechanism. > [..] > > In the discussion, for such implementation, the following obstacles > > were identified. > > > > * There are a lot of cases where IOs are issued by a task which isn't > > the originiator. ie. Writeback issues IOs for pages which are > > dirtied by some other tasks. So, by the time an IO reaches the > > block layer, we don't know which cgroup the IO belongs to. > > > > Recently, block layer has grown support to attach a task to a bio > > which causes the bio to be handled as if it were issued by the > > associated task regardless of the actual issuing task. It currently > > only allows attaching %current to a bio - bio_associate_current() - > > but changing it to support other tasks is trivial. > > > > We'll need to update the async issuers to tag the IOs they issue but > > the mechanism is already there. > > Most likely this tagging will take place in "struct page" and I am not > sure if we will be allowed to grow size of "struct page" for this reason. With memcg enabled, we are already doing that and IIUC Jan and Fengguang think that using inode granularity should be good enough for writeback blaming. > > * There's a single request pool shared by all issuers per a request > > queue. This can lead to priority inversion among cgroups. Note > > that problem also exists without cgroups. Lower ioprio issuer may > > be holding a request holding back highprio issuer. > > > > We'll need to make request allocation cgroup (and hopefully ioprio) > > aware. Probably in the form of separate request pools. This will > > take some work but I don't think this will be too challenging. I'll > > work on it. > > This should be doable. I had implemented it long back with single request > pool but internal limits for each group. That is block the task in the > group if group has enough pending requests allocated from the pool. But > separate request pool should work equally well. > > Just that it conflits a bit with current definition of q->nr_requests. > Which specifies number of total outstanding requests on the queue. Once > you make the pool per queue, I guess this limit will have to be > transformed into per group upper limit. I'm not sure about the details yet. I *think* the suckiest part is the actual allocation part. We're deferring cgroup - request_queue association until actual usage and depending on atomic allocations to create those associations on IO path. Doing the same for requests might not be too pleasant. Hmm.... allocation failure handling on that path is already broken BTW. Maybe we just need to get the fallback behavior properly working. Unsure. > > * cfq cgroup policy throws all async IOs, which all buffered writes > > are, into the shared cgroup regardless of the actual cgroup. This > > behavior is, I believe, mostly historical and changing it isn't > > difficult. Prolly only few tens of lines of changes. This may > > cause significant changes to actual IO behavior with cgroups tho. I > > personally think the previous behavior was too wrong to keep (the > > weight was completely ignored for buffered writes) but we may want > > to introduce a switch to toggle between the two behaviors. > > I had kept all buffered writes in in same cgroup (root cgroup) for few > reasons. > > - Because of single request descriptor pool for writes, anyway one writer > gets backlogged behind other. So creating separate async queues per > group is not going to help. > > - Writeback logic was not cgroup aware. So it might not send enough IO > from each writer to maintain parallelism. So creating separate async > queues did not make sense till that was fixed. Yeah, the above are why I find "buffered writes need separate controls because cfq doesn't distinguish async writes" argument very ironic. We introduce one quirk to compensate for shortages in the other part and then later we work that around in that other part for that quirk? I mean, that's just twisted. > - As you said, it is historical also. We prioritize READS at the expense > of writes. Now by putting buffered/async writes in a separate group, we > will might end up prioritizing a group's async write over other group's > synchronous read. How many people really want that behavior? To me > keeping service differentiation among the sync IO matters most. Even > if all async IO is treated same, I guess not many people might care. While segregation of async IOs may not matter in some cases, it does matter to many other use cases, so it seems to me that we hard coded that optimization decision without thinking too much about it. For a lot of container type use cases, the current implementation is nearly useless (I know of cases where people are explicitly patching for separate async queues). At the same time, switching the default behavior *may* disturb some of the current users and that's why I'm thinking abut having a switch for the new behavior. > > Note that blk-throttle doesn't have this problem. > > I am not sure what are you trying to say here. But primarily blk-throttle > will throttle read and direct IO. Buffered writes will go to root cgroup > which is typically unthrottled. Ooh, my bad then. Anyways, then the same applies to blk-throttle. Our current implementation essentially collapses at the face of write-heavy workload. > > * Unlike dirty data pages, metadata tends to have strict ordering > > requirements and thus is susceptible to priority inversion. Two > > solutions were suggested - 1. allow overdrawl for metadata writes so > > that low prio metadata writes don't block the whole FS, 2. provide > > an interface to query and wait for bdi-cgroup congestion which can > > be called from FS metadata paths to throttle metadata operations > > before they enter the stream of ordered operations. > > So that probably will mean changing the order of operations also. IIUC, > in case of fsync (ordered mode), we opened a meta data transaction first, > then tried to flush all the cached data and then flush metadata. So if > fsync is throttled, all the metadata operations behind it will get > serialized for ext3/ext4. > > So you seem to be suggesting that we change the design so that metadata > operation does not thrown into ordered stream till we have finished > writing all the data back to disk? I am not a filesystem developer, so > I don't know how feasible this change is. Jan explained it to me and I don't think it requires extensive changes to the filesystem. IIUC, filesystems would just block tasks creating journal entry while its matching bdi is congested and that's the extent of the necessary change. > This is just one of the points. In the past while talking to Dave Chinner, > he mentioned that in XFS, if two cgroups fall into same allocation group > then there were cases where IO of one cgroup can get serialized behind > other. > > In general, the core of the issue is that filesystems are not cgroup aware > and if you do throttling below filesystems, then invariably one or other > serialization issue will come up and I am concerned that we will be constantly > fixing those serialization issues. Or the desgin point could be so central > to filesystem design that it can't be changed. So, the idea is to avoid allowing any congested cgroup to enter serialized journal. As there's time gap until journal commit, the bdi might be congested by the commit time. In that case, META writes get to overdraw cgroup limits to avoid causing priority inversion. I think we should be able to get most working with bdi congestion check at the front and limit bypass for META for now. We can worry about overdrawing later. > In general, if you do throttling deeper in the stakc and build back > pressure, then all the layers sitting above should be cgroup aware > to avoid problems. Two layers identified so far are writeback and > filesystems. Is it really worth the complexity. How about doing > throttling in higher layers when IO is entering the kernel and > keep proportional IO logic at the lowest level and current mechanism > of building pressure continues to work? First, I just don't think it's the right design. It's a rather abstract statement but I want to emphasize that having the "right" design, in the sense that we look at the overall picture and put configs, controls and other logics where they belong to in the structure that their roles point to tends to make long-term development and maintenance much easier in ways which may not be immediately foreseeable, for both technical and social reasons - logical structuring and layering keep us sane and make new comer's lives at least bearable. Secondly, I don't think it'll be a lot of added complexity. We *need* to fix all the said shortcoming in block layer for proper cgroup support anyway, right? Propagating that support upwards doesn't take too much code. Other than the metadata thing, it mostly just requires updates to writeback code such that they deal with bdi-cgroup combination instead of individual cgroups. They'll surely require some adjustments but we're not gonna be burdening the main paths with cgroup awareness. cgroup support would just make the existing implementation work on finer grained domains. Thirdly, I don't see how writeback can control all the IOs. I mean, what about reads or direct IOs? It's not like IO devices have separate channels for those different types of IOs. They interact heavily. Let's say we have iops/bps limitation applied on top of proportional IO distribution or a device holds two partitions and one of them is being used for direct IO w/o filesystems. How would that work? I think the question goes even deeper, what do the separate limits even mean? Does the IO sched have to calculate allocation of IO resource to different types of IOs and then give a "number" to writeback which in turn enforces that limit? How does the elevator know what number to give? Is the number iops or bps or weight? If the iosched doesn't know how much write workload exists, how does it distribute the surplus buffered writeback resource across different cgroups? If so, what makes the limit actualy enforceable (due to inaccuracies in estimation, fluctuation in workload, delay in enforcement in different layers and whatnot) except for block layer applying the limit *again* on the resulting stream of combined IOs? Fourthly, having clear layering usually means much more flexibility. The assumptions about IO characteristics that we have are still mostly based on devices with spindles, probably because they're still causing the most amount of pain. The assumptions keep changing and if we get the layering correct, we can mostly deal with changes at the layers concerning them - ie. in the block layer. Maybe we'll have a different iosched or cfq can be evolved to cover the new cases, but the required adaptation would be logical and while upper layers might need some adjustments they wouldn't need any major overhaul. They'll be still working from back pressure from IO. So, the above are the reasons why I don't like the idea of splitting IO control across multiple layers, well the ones that I can think of right now anyway. I'm currently feeling rather strong about them in the sense of "oh no, this is about to be messed up" but maybe I'm just not seeing what Fengguang is seeing. I'll keep discussing there. > So in general throttling at block layer and building back pressure is > fine. I am concerned about two cases. > > - How to handle NFS. As said above, maybe through network based bdi pressure propagation, Maybe some other special case mechanism. Unsure but I don't think this concern should dictate the whole design. > - Do filesystem developers agree with this approach and are they willing > to address any serialization issues arising due to this design. Jan, can you please fill in? Did I understand it correctly? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-04 18:49 ` Tejun Heo @ 2012-04-04 19:23 ` Steve French -1 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj@kernel.org> wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote: >> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: >> > IIUC, without cgroup, the current writeback code works more or less >> > like this. Throwing in cgroup doesn't really change the fundamental >> > design. Instead of a single pipe going down, we just have multiple >> > pipes to the same device, each of which should be treated separately. >> > Of course, a spinning disk can't be divided that easily and their >> > performance characteristics will be inter-dependent, but the place to >> > solve that problem is where the problem is, the block layer. >> >> How do you take care of thorottling IO to NFS case in this model? Current >> throttling logic is tied to block device and in case of NFS, there is no >> block device. > > On principle, I don't think it has be any different. Filesystems's > interface to the underlying device is through bdi. If a fs is block > backed, block pressure should be propagated through bdi, which should > be mostly trivial. If a fs is network backed, we can implement a > mechanism for network backed bdis, so that they can relay the pressure > from the server side to the local fs users. > > That said, network filesystems often show different behaviors and use > different mechanisms for various reasons and it wouldn't be too > surprising if something different would fit them better here or we > might need something supplemental to the usual mechanism. For the network file system clients, we may be close already, but I don't know how to allow servers like Samba or Apache to query btrfs, xfs etc. for this information. superblock -> struct backing_dev_info is probably fine as long as we aren't making that structure more block device specific. Current use of bdi is a little hard to understand since there are 25+ fields in the structure. Is their use/purpose written up anywhere? I have a feeling we are under-utilizing what is already there. In any case bdi is "backing" info not "block" specific info. Since bdi can be assigned to a superblock and an inode, it seems reasonable for either network or local. Note that it isn't just traditional network file systems (nfs and cifs and smb2) but also virtualization (virtfs) and some special purpose file systems for which block device specific interfaces to higher layers (above the fs) are an awkward way to think about congestion. What about a case of a file system like btrfs that could back a volume to a pool of devices and distribute hot/cold data across multiple physical or logical devices? By the way, there may be less of a problem with current network file system clients due to small limits on simultaneous i/o. Until recently NFS client had a low default slot count of 16 IIRC and it was not much better for cifs. The typical cifs server defaulted to allowing a client to only send 50 simultaneous requests to that server at one time ... The cifs protocol allows more (up to 64K) and in 3.4 the client now can send more requests (up to 32K) if the server is so configured. With SMB2 since "credits" are returned on every response, fast servers (e.g. Samba running on a good clustered file system, or a good NAS box) may end up allowing thousands of simultaneous requests if they have the resources to handle this. Unfortunately, the Samba server developers do not know how to request information on superblock->bdi congestion information from user space. I vaguely remember bdi debugging info available in sysfs, but how would an application find out how congested the underlying volume it is exporting is. -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-04 19:23 ` Steve French 0 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj@kernel.org> wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote: >> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: >> > IIUC, without cgroup, the current writeback code works more or less >> > like this. Throwing in cgroup doesn't really change the fundamental >> > design. Instead of a single pipe going down, we just have multiple >> > pipes to the same device, each of which should be treated separately. >> > Of course, a spinning disk can't be divided that easily and their >> > performance characteristics will be inter-dependent, but the place to >> > solve that problem is where the problem is, the block layer. >> >> How do you take care of thorottling IO to NFS case in this model? Current >> throttling logic is tied to block device and in case of NFS, there is no >> block device. > > On principle, I don't think it has be any different. Filesystems's > interface to the underlying device is through bdi. If a fs is block > backed, block pressure should be propagated through bdi, which should > be mostly trivial. If a fs is network backed, we can implement a > mechanism for network backed bdis, so that they can relay the pressure > from the server side to the local fs users. > > That said, network filesystems often show different behaviors and use > different mechanisms for various reasons and it wouldn't be too > surprising if something different would fit them better here or we > might need something supplemental to the usual mechanism. For the network file system clients, we may be close already, but I don't know how to allow servers like Samba or Apache to query btrfs, xfs etc. for this information. superblock -> struct backing_dev_info is probably fine as long as we aren't making that structure more block device specific. Current use of bdi is a little hard to understand since there are 25+ fields in the structure. Is their use/purpose written up anywhere? I have a feeling we are under-utilizing what is already there. In any case bdi is "backing" info not "block" specific info. Since bdi can be assigned to a superblock and an inode, it seems reasonable for either network or local. Note that it isn't just traditional network file systems (nfs and cifs and smb2) but also virtualization (virtfs) and some special purpose file systems for which block device specific interfaces to higher layers (above the fs) are an awkward way to think about congestion. What about a case of a file system like btrfs that could back a volume to a pool of devices and distribute hot/cold data across multiple physical or logical devices? By the way, there may be less of a problem with current network file system clients due to small limits on simultaneous i/o. Until recently NFS client had a low default slot count of 16 IIRC and it was not much better for cifs. The typical cifs server defaulted to allowing a client to only send 50 simultaneous requests to that server at one time ... The cifs protocol allows more (up to 64K) and in 3.4 the client now can send more requests (up to 32K) if the server is so configured. With SMB2 since "credits" are returned on every response, fast servers (e.g. Samba running on a good clustered file system, or a good NAS box) may end up allowing thousands of simultaneous requests if they have the resources to handle this. Unfortunately, the Samba server developers do not know how to request information on superblock->bdi congestion information from user space. I vaguely remember bdi debugging info available in sysfs, but how would an application find out how congested the underlying volume it is exporting is. -- Thanks, Steve -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2012-04-14 12:15 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw) To: Steve French Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote: > Current use of bdi is a little hard to understand since > there are 25+ fields in the structure. Filesystems only need a small fraction of those. In particular, backing_dev_info::name -- string backing_dev_info::ra_pages -- number of read-ahead-pages backing_dev_info::capability -- see BDI_CAP_* One should properly initialize/destroy the thing using: bdi_init()/bdi_destroy() Furthermore, it has hooks into the regular page-writeback stuff: test_{set,clear}_page_writeback()/bdi_writeout_inc() set_page_dirty()/account_page_dirtied() but also allows filesystems to do custom stuff, see FUSE for example. The only other bit is the pressure valve, aka. {set,clear}_bdi_congested(). Which really is rather broken and of dubious value. ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-04 19:23 ` Steve French (?) @ 2012-04-14 12:15 ` Peter Zijlstra -1 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw) To: Steve French Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote: > Current use of bdi is a little hard to understand since > there are 25+ fields in the structure. Filesystems only need a small fraction of those. In particular, backing_dev_info::name -- string backing_dev_info::ra_pages -- number of read-ahead-pages backing_dev_info::capability -- see BDI_CAP_* One should properly initialize/destroy the thing using: bdi_init()/bdi_destroy() Furthermore, it has hooks into the regular page-writeback stuff: test_{set,clear}_page_writeback()/bdi_writeout_inc() set_page_dirty()/account_page_dirtied() but also allows filesystems to do custom stuff, see FUSE for example. The only other bit is the pressure valve, aka. {set,clear}_bdi_congested(). Which really is rather broken and of dubious value. ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-14 12:15 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw) To: Steve French Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote: > Current use of bdi is a little hard to understand since > there are 25+ fields in the structure. Filesystems only need a small fraction of those. In particular, backing_dev_info::name -- string backing_dev_info::ra_pages -- number of read-ahead-pages backing_dev_info::capability -- see BDI_CAP_* One should properly initialize/destroy the thing using: bdi_init()/bdi_destroy() Furthermore, it has hooks into the regular page-writeback stuff: test_{set,clear}_page_writeback()/bdi_writeout_inc() set_page_dirty()/account_page_dirtied() but also allows filesystems to do custom stuff, see FUSE for example. The only other bit is the pressure valve, aka. {set,clear}_bdi_congested(). Which really is rather broken and of dubious value. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-14 12:15 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw) To: Steve French Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote: > Current use of bdi is a little hard to understand since > there are 25+ fields in the structure. Filesystems only need a small fraction of those. In particular, backing_dev_info::name -- string backing_dev_info::ra_pages -- number of read-ahead-pages backing_dev_info::capability -- see BDI_CAP_* One should properly initialize/destroy the thing using: bdi_init()/bdi_destroy() Furthermore, it has hooks into the regular page-writeback stuff: test_{set,clear}_page_writeback()/bdi_writeout_inc() set_page_dirty()/account_page_dirtied() but also allows filesystems to do custom stuff, see FUSE for example. The only other bit is the pressure valve, aka. {set,clear}_bdi_congested(). Which really is rather broken and of dubious value. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> @ 2012-04-04 19:23 ` Steve French 2012-04-04 20:32 ` Vivek Goyal ` (2 subsequent siblings) 3 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw) To: Tejun Heo Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote: >> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: >> > IIUC, without cgroup, the current writeback code works more or less >> > like this. Throwing in cgroup doesn't really change the fundamental >> > design. Instead of a single pipe going down, we just have multiple >> > pipes to the same device, each of which should be treated separately. >> > Of course, a spinning disk can't be divided that easily and their >> > performance characteristics will be inter-dependent, but the place to >> > solve that problem is where the problem is, the block layer. >> >> How do you take care of thorottling IO to NFS case in this model? Current >> throttling logic is tied to block device and in case of NFS, there is no >> block device. > > On principle, I don't think it has be any different. Filesystems's > interface to the underlying device is through bdi. If a fs is block > backed, block pressure should be propagated through bdi, which should > be mostly trivial. If a fs is network backed, we can implement a > mechanism for network backed bdis, so that they can relay the pressure > from the server side to the local fs users. > > That said, network filesystems often show different behaviors and use > different mechanisms for various reasons and it wouldn't be too > surprising if something different would fit them better here or we > might need something supplemental to the usual mechanism. For the network file system clients, we may be close already, but I don't know how to allow servers like Samba or Apache to query btrfs, xfs etc. for this information. superblock -> struct backing_dev_info is probably fine as long as we aren't making that structure more block device specific. Current use of bdi is a little hard to understand since there are 25+ fields in the structure. Is their use/purpose written up anywhere? I have a feeling we are under-utilizing what is already there. In any case bdi is "backing" info not "block" specific info. Since bdi can be assigned to a superblock and an inode, it seems reasonable for either network or local. Note that it isn't just traditional network file systems (nfs and cifs and smb2) but also virtualization (virtfs) and some special purpose file systems for which block device specific interfaces to higher layers (above the fs) are an awkward way to think about congestion. What about a case of a file system like btrfs that could back a volume to a pool of devices and distribute hot/cold data across multiple physical or logical devices? By the way, there may be less of a problem with current network file system clients due to small limits on simultaneous i/o. Until recently NFS client had a low default slot count of 16 IIRC and it was not much better for cifs. The typical cifs server defaulted to allowing a client to only send 50 simultaneous requests to that server at one time ... The cifs protocol allows more (up to 64K) and in 3.4 the client now can send more requests (up to 32K) if the server is so configured. With SMB2 since "credits" are returned on every response, fast servers (e.g. Samba running on a good clustered file system, or a good NAS box) may end up allowing thousands of simultaneous requests if they have the resources to handle this. Unfortunately, the Samba server developers do not know how to request information on superblock->bdi congestion information from user space. I vaguely remember bdi debugging info available in sysfs, but how would an application find out how congested the underlying volume it is exporting is. -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 19:23 ` Steve French @ 2012-04-04 20:32 ` Vivek Goyal 2012-04-05 16:38 ` Tejun Heo 2012-04-14 11:53 ` [Lsf] " Peter Zijlstra 3 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: [..] > Thirdly, I don't see how writeback can control all the IOs. I mean, > what about reads or direct IOs? It's not like IO devices have > separate channels for those different types of IOs. They interact > heavily. > Let's say we have iops/bps limitation applied on top of proportional IO > distribution We already do that. First IO is subjected to throttling limit and only then it is passed to the elevator to do the proportional IO. So throttling is already stacked on top of proportional IO. The only question is should it be pushed to even higher layers or not. > or a device holds two partitions and one > of them is being used for direct IO w/o filesystems. How would that > work? I think the question goes even deeper, what do the separate > limits even mean? Separate limits for buffered writes are just filling the gap. Agreed it is not a very neat solution. > Does the IO sched have to calculate allocation of > IO resource to different types of IOs and then give a "number" to > writeback which in turn enforces that limit? How does the elevator > know what number to give? Is the number iops or bps or weight? If we push up all the throttling somewhere in higher layer, say some of kind of per bdi throttling interface, then elevator just have to worry about doing proportional IO. No interaction with higher layers regarding iops/bps etc. (Not that elevator worries about it today). > If > the iosched doesn't know how much write workload exists, how does it > distribute the surplus buffered writeback resource across different > cgroups? If so, what makes the limit actualy enforceable (due to > inaccuracies in estimation, fluctuation in workload, delay in > enforcement in different layers and whatnot) except for block layer > applying the limit *again* on the resulting stream of combined IOs? So split model is definitely confusing. Anyway, block layer will not apply the limits again as flusher IO will go in root cgroup which generally goes to root which is unthrottled generally. Or flusher could mark the bios with a flag saying "do not throttle" bios again as these have been throttled already. So throttling again is probably not an issue. In summary, agreed that split is confusing and it fills a gap existing today. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 19:23 ` Steve French 2012-04-04 20:32 ` Vivek Goyal @ 2012-04-05 16:38 ` Tejun Heo 2012-04-14 11:53 ` [Lsf] " Peter Zijlstra 3 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hey, Vivek. On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: > > I am not sure what are you trying to say here. But primarily blk-throttle > > will throttle read and direct IO. Buffered writes will go to root cgroup > > which is typically unthrottled. > > Ooh, my bad then. Anyways, then the same applies to blk-throttle. > Our current implementation essentially collapses at the face of > write-heavy workload. I went through the code and couldn't find where blk-throttle is discriminating async IOs. Were you saying that blk-throttle currently doesn't throttle because those IOs aren't associated with the dirtying task? If so, note that it's different from cfq which explicitly assigns all async IOs when choosing cfqq even if we fix tagging. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> ` (2 preceding siblings ...) 2012-04-05 16:38 ` Tejun Heo @ 2012-04-14 11:53 ` Peter Zijlstra 3 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw) To: Tejun Heo Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote: > > - How to handle NFS. > > As said above, maybe through network based bdi pressure propagation, > Maybe some other special case mechanism. Unsure but I don't think > this concern should dictate the whole design. NFS has a custom bdi implementation and implements congestion control based on the number of outstanding writeback pages. See fs/nfs/write.c:nfs_{set,end}_page_writeback All !block based filesystems have their own BDI implementation, I'm not sure on the congestion implementation of anything other than NFS though. ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 18:49 ` Tejun Heo @ 2012-04-04 20:32 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: [..] > Thirdly, I don't see how writeback can control all the IOs. I mean, > what about reads or direct IOs? It's not like IO devices have > separate channels for those different types of IOs. They interact > heavily. > Let's say we have iops/bps limitation applied on top of proportional IO > distribution We already do that. First IO is subjected to throttling limit and only then it is passed to the elevator to do the proportional IO. So throttling is already stacked on top of proportional IO. The only question is should it be pushed to even higher layers or not. > or a device holds two partitions and one > of them is being used for direct IO w/o filesystems. How would that > work? I think the question goes even deeper, what do the separate > limits even mean? Separate limits for buffered writes are just filling the gap. Agreed it is not a very neat solution. > Does the IO sched have to calculate allocation of > IO resource to different types of IOs and then give a "number" to > writeback which in turn enforces that limit? How does the elevator > know what number to give? Is the number iops or bps or weight? If we push up all the throttling somewhere in higher layer, say some of kind of per bdi throttling interface, then elevator just have to worry about doing proportional IO. No interaction with higher layers regarding iops/bps etc. (Not that elevator worries about it today). > If > the iosched doesn't know how much write workload exists, how does it > distribute the surplus buffered writeback resource across different > cgroups? If so, what makes the limit actualy enforceable (due to > inaccuracies in estimation, fluctuation in workload, delay in > enforcement in different layers and whatnot) except for block layer > applying the limit *again* on the resulting stream of combined IOs? So split model is definitely confusing. Anyway, block layer will not apply the limits again as flusher IO will go in root cgroup which generally goes to root which is unthrottled generally. Or flusher could mark the bios with a flag saying "do not throttle" bios again as these have been throttled already. So throttling again is probably not an issue. In summary, agreed that split is confusing and it fills a gap existing today. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 20:32 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: [..] > Thirdly, I don't see how writeback can control all the IOs. I mean, > what about reads or direct IOs? It's not like IO devices have > separate channels for those different types of IOs. They interact > heavily. > Let's say we have iops/bps limitation applied on top of proportional IO > distribution We already do that. First IO is subjected to throttling limit and only then it is passed to the elevator to do the proportional IO. So throttling is already stacked on top of proportional IO. The only question is should it be pushed to even higher layers or not. > or a device holds two partitions and one > of them is being used for direct IO w/o filesystems. How would that > work? I think the question goes even deeper, what do the separate > limits even mean? Separate limits for buffered writes are just filling the gap. Agreed it is not a very neat solution. > Does the IO sched have to calculate allocation of > IO resource to different types of IOs and then give a "number" to > writeback which in turn enforces that limit? How does the elevator > know what number to give? Is the number iops or bps or weight? If we push up all the throttling somewhere in higher layer, say some of kind of per bdi throttling interface, then elevator just have to worry about doing proportional IO. No interaction with higher layers regarding iops/bps etc. (Not that elevator worries about it today). > If > the iosched doesn't know how much write workload exists, how does it > distribute the surplus buffered writeback resource across different > cgroups? If so, what makes the limit actualy enforceable (due to > inaccuracies in estimation, fluctuation in workload, delay in > enforcement in different layers and whatnot) except for block layer > applying the limit *again* on the resulting stream of combined IOs? So split model is definitely confusing. Anyway, block layer will not apply the limits again as flusher IO will go in root cgroup which generally goes to root which is unthrottled generally. Or flusher could mark the bios with a flag saying "do not throttle" bios again as these have been throttled already. So throttling again is probably not an issue. In summary, agreed that split is confusing and it fills a gap existing today. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-04 23:02 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hello, Vivek. On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote: > > Let's say we have iops/bps limitation applied on top of proportional IO > > distribution > > We already do that. First IO is subjected to throttling limit and only > then it is passed to the elevator to do the proportional IO. So throttling > is already stacked on top of proportional IO. The only question is > should it be pushed to even higher layers or not. Yeah, I know we already can do that. I was trying to give an example of non-trivial IO limit configuration. > So split model is definitely confusing. Anyway, block layer will not > apply the limits again as flusher IO will go in root cgroup which > generally goes to root which is unthrottled generally. Or flusher > could mark the bios with a flag saying "do not throttle" bios again as > these have been throttled already. So throttling again is probably not > an issue. > > In summary, agreed that split is confusing and it fills a gap existing > today. It's not only confusing. It's broken. So, what you're saying is that there's no provision to orchestrate between buffered writes and other types of IOs. So, it would essentially work as if there are two separate controls controlling each of two heavily interacting parts with no designed provision between them. What the.... -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 23:02 ` Tejun Heo @ 2012-04-04 23:02 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Vivek. On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote: > > Let's say we have iops/bps limitation applied on top of proportional IO > > distribution > > We already do that. First IO is subjected to throttling limit and only > then it is passed to the elevator to do the proportional IO. So throttling > is already stacked on top of proportional IO. The only question is > should it be pushed to even higher layers or not. Yeah, I know we already can do that. I was trying to give an example of non-trivial IO limit configuration. > So split model is definitely confusing. Anyway, block layer will not > apply the limits again as flusher IO will go in root cgroup which > generally goes to root which is unthrottled generally. Or flusher > could mark the bios with a flag saying "do not throttle" bios again as > these have been throttled already. So throttling again is probably not > an issue. > > In summary, agreed that split is confusing and it fills a gap existing > today. It's not only confusing. It's broken. So, what you're saying is that there's no provision to orchestrate between buffered writes and other types of IOs. So, it would essentially work as if there are two separate controls controlling each of two heavily interacting parts with no designed provision between them. What the.... -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 23:02 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Vivek. On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote: > > Let's say we have iops/bps limitation applied on top of proportional IO > > distribution > > We already do that. First IO is subjected to throttling limit and only > then it is passed to the elevator to do the proportional IO. So throttling > is already stacked on top of proportional IO. The only question is > should it be pushed to even higher layers or not. Yeah, I know we already can do that. I was trying to give an example of non-trivial IO limit configuration. > So split model is definitely confusing. Anyway, block layer will not > apply the limits again as flusher IO will go in root cgroup which > generally goes to root which is unthrottled generally. Or flusher > could mark the bios with a flag saying "do not throttle" bios again as > these have been throttled already. So throttling again is probably not > an issue. > > In summary, agreed that split is confusing and it fills a gap existing > today. It's not only confusing. It's broken. So, what you're saying is that there's no provision to orchestrate between buffered writes and other types of IOs. So, it would essentially work as if there are two separate controls controlling each of two heavily interacting parts with no designed provision between them. What the.... -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 23:02 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, Vivek. On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote: > > Let's say we have iops/bps limitation applied on top of proportional IO > > distribution > > We already do that. First IO is subjected to throttling limit and only > then it is passed to the elevator to do the proportional IO. So throttling > is already stacked on top of proportional IO. The only question is > should it be pushed to even higher layers or not. Yeah, I know we already can do that. I was trying to give an example of non-trivial IO limit configuration. > So split model is definitely confusing. Anyway, block layer will not > apply the limits again as flusher IO will go in root cgroup which > generally goes to root which is unthrottled generally. Or flusher > could mark the bios with a flag saying "do not throttle" bios again as > these have been throttled already. So throttling again is probably not > an issue. > > In summary, agreed that split is confusing and it fills a gap existing > today. It's not only confusing. It's broken. So, what you're saying is that there's no provision to orchestrate between buffered writes and other types of IOs. So, it would essentially work as if there are two separate controls controlling each of two heavily interacting parts with no designed provision between them. What the.... -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 19:23 ` Steve French @ 2012-04-05 16:38 ` Tejun Heo 2012-04-05 16:38 ` Tejun Heo 2012-04-14 11:53 ` [Lsf] " Peter Zijlstra 3 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Vivek. On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: > > I am not sure what are you trying to say here. But primarily blk-throttle > > will throttle read and direct IO. Buffered writes will go to root cgroup > > which is typically unthrottled. > > Ooh, my bad then. Anyways, then the same applies to blk-throttle. > Our current implementation essentially collapses at the face of > write-heavy workload. I went through the code and couldn't find where blk-throttle is discriminating async IOs. Were you saying that blk-throttle currently doesn't throttle because those IOs aren't associated with the dirtying task? If so, note that it's different from cfq which explicitly assigns all async IOs when choosing cfqq even if we fix tagging. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-05 16:38 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Vivek. On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: > > I am not sure what are you trying to say here. But primarily blk-throttle > > will throttle read and direct IO. Buffered writes will go to root cgroup > > which is typically unthrottled. > > Ooh, my bad then. Anyways, then the same applies to blk-throttle. > Our current implementation essentially collapses at the face of > write-heavy workload. I went through the code and couldn't find where blk-throttle is discriminating async IOs. Were you saying that blk-throttle currently doesn't throttle because those IOs aren't associated with the dirtying task? If so, note that it's different from cfq which explicitly assigns all async IOs when choosing cfqq even if we fix tagging. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-05 16:38 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hey, Vivek. On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: > > I am not sure what are you trying to say here. But primarily blk-throttle > > will throttle read and direct IO. Buffered writes will go to root cgroup > > which is typically unthrottled. > > Ooh, my bad then. Anyways, then the same applies to blk-throttle. > Our current implementation essentially collapses at the face of > write-heavy workload. I went through the code and couldn't find where blk-throttle is discriminating async IOs. Were you saying that blk-throttle currently doesn't throttle because those IOs aren't associated with the dirtying task? If so, note that it's different from cfq which explicitly assigns all async IOs when choosing cfqq even if we fix tagging. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120405163854.GE12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-05 16:38 ` Tejun Heo (?) @ 2012-04-05 17:13 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: > > > I am not sure what are you trying to say here. But primarily blk-throttle > > > will throttle read and direct IO. Buffered writes will go to root cgroup > > > which is typically unthrottled. > > > > Ooh, my bad then. Anyways, then the same applies to blk-throttle. > > Our current implementation essentially collapses at the face of > > write-heavy workload. > > I went through the code and couldn't find where blk-throttle is > discriminating async IOs. Were you saying that blk-throttle currently > doesn't throttle because those IOs aren't associated with the dirtying > task? Yes that's what I meant. Currently most of the async IO will come from flusher thread which is in root cgroup. So all the async IO will be in root group and we typically keep root group unthrottled. Sorry for the confusion here. > If so, note that it's different from cfq which explicitly > assigns all async IOs when choosing cfqq even if we fix tagging. Yes. So if we can properly account for submitter, and for blk-throttle, async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic to keep async IO in a particular group. It is just a matter of getting the right cgroup information. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-05 17:13 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: > > > I am not sure what are you trying to say here. But primarily blk-throttle > > > will throttle read and direct IO. Buffered writes will go to root cgroup > > > which is typically unthrottled. > > > > Ooh, my bad then. Anyways, then the same applies to blk-throttle. > > Our current implementation essentially collapses at the face of > > write-heavy workload. > > I went through the code and couldn't find where blk-throttle is > discriminating async IOs. Were you saying that blk-throttle currently > doesn't throttle because those IOs aren't associated with the dirtying > task? Yes that's what I meant. Currently most of the async IO will come from flusher thread which is in root cgroup. So all the async IO will be in root group and we typically keep root group unthrottled. Sorry for the confusion here. > If so, note that it's different from cfq which explicitly > assigns all async IOs when choosing cfqq even if we fix tagging. Yes. So if we can properly account for submitter, and for blk-throttle, async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic to keep async IO in a particular group. It is just a matter of getting the right cgroup information. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-05 17:13 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote: > > > I am not sure what are you trying to say here. But primarily blk-throttle > > > will throttle read and direct IO. Buffered writes will go to root cgroup > > > which is typically unthrottled. > > > > Ooh, my bad then. Anyways, then the same applies to blk-throttle. > > Our current implementation essentially collapses at the face of > > write-heavy workload. > > I went through the code and couldn't find where blk-throttle is > discriminating async IOs. Were you saying that blk-throttle currently > doesn't throttle because those IOs aren't associated with the dirtying > task? Yes that's what I meant. Currently most of the async IO will come from flusher thread which is in root cgroup. So all the async IO will be in root group and we typically keep root group unthrottled. Sorry for the confusion here. > If so, note that it's different from cfq which explicitly > assigns all async IOs when choosing cfqq even if we fix tagging. Yes. So if we can properly account for submitter, and for blk-throttle, async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic to keep async IO in a particular group. It is just a matter of getting the right cgroup information. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 19:23 ` Steve French @ 2012-04-14 11:53 ` Peter Zijlstra 2012-04-05 16:38 ` Tejun Heo 2012-04-14 11:53 ` [Lsf] " Peter Zijlstra 3 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote: > > - How to handle NFS. > > As said above, maybe through network based bdi pressure propagation, > Maybe some other special case mechanism. Unsure but I don't think > this concern should dictate the whole design. NFS has a custom bdi implementation and implements congestion control based on the number of outstanding writeback pages. See fs/nfs/write.c:nfs_{set,end}_page_writeback All !block based filesystems have their own BDI implementation, I'm not sure on the congestion implementation of anything other than NFS though. ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-14 11:53 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote: > > - How to handle NFS. > > As said above, maybe through network based bdi pressure propagation, > Maybe some other special case mechanism. Unsure but I don't think > this concern should dictate the whole design. NFS has a custom bdi implementation and implements congestion control based on the number of outstanding writeback pages. See fs/nfs/write.c:nfs_{set,end}_page_writeback All !block based filesystems have their own BDI implementation, I'm not sure on the congestion implementation of anything other than NFS though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-14 11:53 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote: > > - How to handle NFS. > > As said above, maybe through network based bdi pressure propagation, > Maybe some other special case mechanism. Unsure but I don't think > this concern should dictate the whole design. NFS has a custom bdi implementation and implements congestion control based on the number of outstanding writeback pages. See fs/nfs/write.c:nfs_{set,end}_page_writeback All !block based filesystems have their own BDI implementation, I'm not sure on the congestion implementation of anything other than NFS though. ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-14 11:53 ` Peter Zijlstra (?) (?) @ 2012-04-16 1:25 ` Steve French -1 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-16 1:25 UTC (permalink / raw) To: linux-cifs-u79uwXL29TY76Z2rM5mHXA This long thread on linux-mm and linux-fsdevel has been discussing writeback, throttling, cgroups etc. This post reminded me that we should look more carefully at the cifs bdi implementation, compare to nfs, and also check what needs to be improved in the bdi implementation to handle smb2 credits. It will be interesting to see if that will help writeback. On Sat, Apr 14, 2012 at 6:53 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote: > > > - How to handle NFS. > > > > As said above, maybe through network based bdi pressure propagation, > > Maybe some other special case mechanism. Unsure but I don't think > > this concern should dictate the whole design. > > NFS has a custom bdi implementation and implements congestion control > based on the number of outstanding writeback pages. > > See fs/nfs/write.c:nfs_{set,end}_page_writeback > > All !block based filesystems have their own BDI implementation, I'm not > sure on the congestion implementation of anything other than NFS though. > _______________________________________________ > Lsf mailing list > Lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linuxfoundation.org/mailman/listinfo/lsf -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-04 15:36 ` Steve French 2012-04-04 18:49 ` Tejun Heo 2012-04-07 8:00 ` Jan Kara 2 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw) To: Vivek Goyal Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > Hi Tejun, > > Thanks for the RFC and looking into this issue. Few thoughts inline. > > [..] >> IIUC, without cgroup, the current writeback code works more or less >> like this. Throwing in cgroup doesn't really change the fundamental >> design. Instead of a single pipe going down, we just have multiple >> pipes to the same device, each of which should be treated separately. >> Of course, a spinning disk can't be divided that easily and their >> performance characteristics will be inter-dependent, but the place to >> solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. Similarly smb2 gets congestion info (number of "credits") returned from the server on every response - but not sure why congestion control is tied to the block device when this would create problems for network file systems -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 15:36 ` Steve French @ 2012-04-04 18:49 ` Tejun Heo 2012-04-07 8:00 ` Jan Kara 2 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hey, Vivek. On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > > IIUC, without cgroup, the current writeback code works more or less > > like this. Throwing in cgroup doesn't really change the fundamental > > design. Instead of a single pipe going down, we just have multiple > > pipes to the same device, each of which should be treated separately. > > Of course, a spinning disk can't be divided that easily and their > > performance characteristics will be inter-dependent, but the place to > > solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. On principle, I don't think it has be any different. Filesystems's interface to the underlying device is through bdi. If a fs is block backed, block pressure should be propagated through bdi, which should be mostly trivial. If a fs is network backed, we can implement a mechanism for network backed bdis, so that they can relay the pressure from the server side to the local fs users. That said, network filesystems often show different behaviors and use different mechanisms for various reasons and it wouldn't be too surprising if something different would fit them better here or we might need something supplemental to the usual mechanism. > [..] > > In the discussion, for such implementation, the following obstacles > > were identified. > > > > * There are a lot of cases where IOs are issued by a task which isn't > > the originiator. ie. Writeback issues IOs for pages which are > > dirtied by some other tasks. So, by the time an IO reaches the > > block layer, we don't know which cgroup the IO belongs to. > > > > Recently, block layer has grown support to attach a task to a bio > > which causes the bio to be handled as if it were issued by the > > associated task regardless of the actual issuing task. It currently > > only allows attaching %current to a bio - bio_associate_current() - > > but changing it to support other tasks is trivial. > > > > We'll need to update the async issuers to tag the IOs they issue but > > the mechanism is already there. > > Most likely this tagging will take place in "struct page" and I am not > sure if we will be allowed to grow size of "struct page" for this reason. With memcg enabled, we are already doing that and IIUC Jan and Fengguang think that using inode granularity should be good enough for writeback blaming. > > * There's a single request pool shared by all issuers per a request > > queue. This can lead to priority inversion among cgroups. Note > > that problem also exists without cgroups. Lower ioprio issuer may > > be holding a request holding back highprio issuer. > > > > We'll need to make request allocation cgroup (and hopefully ioprio) > > aware. Probably in the form of separate request pools. This will > > take some work but I don't think this will be too challenging. I'll > > work on it. > > This should be doable. I had implemented it long back with single request > pool but internal limits for each group. That is block the task in the > group if group has enough pending requests allocated from the pool. But > separate request pool should work equally well. > > Just that it conflits a bit with current definition of q->nr_requests. > Which specifies number of total outstanding requests on the queue. Once > you make the pool per queue, I guess this limit will have to be > transformed into per group upper limit. I'm not sure about the details yet. I *think* the suckiest part is the actual allocation part. We're deferring cgroup - request_queue association until actual usage and depending on atomic allocations to create those associations on IO path. Doing the same for requests might not be too pleasant. Hmm.... allocation failure handling on that path is already broken BTW. Maybe we just need to get the fallback behavior properly working. Unsure. > > * cfq cgroup policy throws all async IOs, which all buffered writes > > are, into the shared cgroup regardless of the actual cgroup. This > > behavior is, I believe, mostly historical and changing it isn't > > difficult. Prolly only few tens of lines of changes. This may > > cause significant changes to actual IO behavior with cgroups tho. I > > personally think the previous behavior was too wrong to keep (the > > weight was completely ignored for buffered writes) but we may want > > to introduce a switch to toggle between the two behaviors. > > I had kept all buffered writes in in same cgroup (root cgroup) for few > reasons. > > - Because of single request descriptor pool for writes, anyway one writer > gets backlogged behind other. So creating separate async queues per > group is not going to help. > > - Writeback logic was not cgroup aware. So it might not send enough IO > from each writer to maintain parallelism. So creating separate async > queues did not make sense till that was fixed. Yeah, the above are why I find "buffered writes need separate controls because cfq doesn't distinguish async writes" argument very ironic. We introduce one quirk to compensate for shortages in the other part and then later we work that around in that other part for that quirk? I mean, that's just twisted. > - As you said, it is historical also. We prioritize READS at the expense > of writes. Now by putting buffered/async writes in a separate group, we > will might end up prioritizing a group's async write over other group's > synchronous read. How many people really want that behavior? To me > keeping service differentiation among the sync IO matters most. Even > if all async IO is treated same, I guess not many people might care. While segregation of async IOs may not matter in some cases, it does matter to many other use cases, so it seems to me that we hard coded that optimization decision without thinking too much about it. For a lot of container type use cases, the current implementation is nearly useless (I know of cases where people are explicitly patching for separate async queues). At the same time, switching the default behavior *may* disturb some of the current users and that's why I'm thinking abut having a switch for the new behavior. > > Note that blk-throttle doesn't have this problem. > > I am not sure what are you trying to say here. But primarily blk-throttle > will throttle read and direct IO. Buffered writes will go to root cgroup > which is typically unthrottled. Ooh, my bad then. Anyways, then the same applies to blk-throttle. Our current implementation essentially collapses at the face of write-heavy workload. > > * Unlike dirty data pages, metadata tends to have strict ordering > > requirements and thus is susceptible to priority inversion. Two > > solutions were suggested - 1. allow overdrawl for metadata writes so > > that low prio metadata writes don't block the whole FS, 2. provide > > an interface to query and wait for bdi-cgroup congestion which can > > be called from FS metadata paths to throttle metadata operations > > before they enter the stream of ordered operations. > > So that probably will mean changing the order of operations also. IIUC, > in case of fsync (ordered mode), we opened a meta data transaction first, > then tried to flush all the cached data and then flush metadata. So if > fsync is throttled, all the metadata operations behind it will get > serialized for ext3/ext4. > > So you seem to be suggesting that we change the design so that metadata > operation does not thrown into ordered stream till we have finished > writing all the data back to disk? I am not a filesystem developer, so > I don't know how feasible this change is. Jan explained it to me and I don't think it requires extensive changes to the filesystem. IIUC, filesystems would just block tasks creating journal entry while its matching bdi is congested and that's the extent of the necessary change. > This is just one of the points. In the past while talking to Dave Chinner, > he mentioned that in XFS, if two cgroups fall into same allocation group > then there were cases where IO of one cgroup can get serialized behind > other. > > In general, the core of the issue is that filesystems are not cgroup aware > and if you do throttling below filesystems, then invariably one or other > serialization issue will come up and I am concerned that we will be constantly > fixing those serialization issues. Or the desgin point could be so central > to filesystem design that it can't be changed. So, the idea is to avoid allowing any congested cgroup to enter serialized journal. As there's time gap until journal commit, the bdi might be congested by the commit time. In that case, META writes get to overdraw cgroup limits to avoid causing priority inversion. I think we should be able to get most working with bdi congestion check at the front and limit bypass for META for now. We can worry about overdrawing later. > In general, if you do throttling deeper in the stakc and build back > pressure, then all the layers sitting above should be cgroup aware > to avoid problems. Two layers identified so far are writeback and > filesystems. Is it really worth the complexity. How about doing > throttling in higher layers when IO is entering the kernel and > keep proportional IO logic at the lowest level and current mechanism > of building pressure continues to work? First, I just don't think it's the right design. It's a rather abstract statement but I want to emphasize that having the "right" design, in the sense that we look at the overall picture and put configs, controls and other logics where they belong to in the structure that their roles point to tends to make long-term development and maintenance much easier in ways which may not be immediately foreseeable, for both technical and social reasons - logical structuring and layering keep us sane and make new comer's lives at least bearable. Secondly, I don't think it'll be a lot of added complexity. We *need* to fix all the said shortcoming in block layer for proper cgroup support anyway, right? Propagating that support upwards doesn't take too much code. Other than the metadata thing, it mostly just requires updates to writeback code such that they deal with bdi-cgroup combination instead of individual cgroups. They'll surely require some adjustments but we're not gonna be burdening the main paths with cgroup awareness. cgroup support would just make the existing implementation work on finer grained domains. Thirdly, I don't see how writeback can control all the IOs. I mean, what about reads or direct IOs? It's not like IO devices have separate channels for those different types of IOs. They interact heavily. Let's say we have iops/bps limitation applied on top of proportional IO distribution or a device holds two partitions and one of them is being used for direct IO w/o filesystems. How would that work? I think the question goes even deeper, what do the separate limits even mean? Does the IO sched have to calculate allocation of IO resource to different types of IOs and then give a "number" to writeback which in turn enforces that limit? How does the elevator know what number to give? Is the number iops or bps or weight? If the iosched doesn't know how much write workload exists, how does it distribute the surplus buffered writeback resource across different cgroups? If so, what makes the limit actualy enforceable (due to inaccuracies in estimation, fluctuation in workload, delay in enforcement in different layers and whatnot) except for block layer applying the limit *again* on the resulting stream of combined IOs? Fourthly, having clear layering usually means much more flexibility. The assumptions about IO characteristics that we have are still mostly based on devices with spindles, probably because they're still causing the most amount of pain. The assumptions keep changing and if we get the layering correct, we can mostly deal with changes at the layers concerning them - ie. in the block layer. Maybe we'll have a different iosched or cfq can be evolved to cover the new cases, but the required adaptation would be logical and while upper layers might need some adjustments they wouldn't need any major overhaul. They'll be still working from back pressure from IO. So, the above are the reasons why I don't like the idea of splitting IO control across multiple layers, well the ones that I can think of right now anyway. I'm currently feeling rather strong about them in the sense of "oh no, this is about to be messed up" but maybe I'm just not seeing what Fengguang is seeing. I'll keep discussing there. > So in general throttling at block layer and building back pressure is > fine. I am concerned about two cases. > > - How to handle NFS. As said above, maybe through network based bdi pressure propagation, Maybe some other special case mechanism. Unsure but I don't think this concern should dictate the whole design. > - Do filesystem developers agree with this approach and are they willing > to address any serialization issues arising due to this design. Jan, can you please fill in? Did I understand it correctly? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 15:36 ` Steve French 2012-04-04 18:49 ` Tejun Heo @ 2012-04-07 8:00 ` Jan Kara 2 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-07 8:00 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hi Vivek, On Wed 04-04-12 10:51:34, Vivek Goyal wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > [..] > > IIUC, without cgroup, the current writeback code works more or less > > like this. Throwing in cgroup doesn't really change the fundamental > > design. Instead of a single pipe going down, we just have multiple > > pipes to the same device, each of which should be treated separately. > > Of course, a spinning disk can't be divided that easily and their > > performance characteristics will be inter-dependent, but the place to > > solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. Yeah, for throttling NFS or other network filesystems we'd have to come up with some throttling mechanism at some other level. The problem with throttling at higher levels is that you have to somehow extract information from lower levels about amount of work so I'm not completely certain now, where would be the right place. Possibly it also depends on the intended usecase - so far I don't know about any real user for this functionality... > [..] > > In the discussion, for such implementation, the following obstacles > > were identified. > > > > * There are a lot of cases where IOs are issued by a task which isn't > > the originiator. ie. Writeback issues IOs for pages which are > > dirtied by some other tasks. So, by the time an IO reaches the > > block layer, we don't know which cgroup the IO belongs to. > > > > Recently, block layer has grown support to attach a task to a bio > > which causes the bio to be handled as if it were issued by the > > associated task regardless of the actual issuing task. It currently > > only allows attaching %current to a bio - bio_associate_current() - > > but changing it to support other tasks is trivial. > > > > We'll need to update the async issuers to tag the IOs they issue but > > the mechanism is already there. > > Most likely this tagging will take place in "struct page" and I am not > sure if we will be allowed to grow size of "struct page" for this reason. We can tag inodes and then bios so this should be fine. > > * Unlike dirty data pages, metadata tends to have strict ordering > > requirements and thus is susceptible to priority inversion. Two > > solutions were suggested - 1. allow overdrawl for metadata writes so > > that low prio metadata writes don't block the whole FS, 2. provide > > an interface to query and wait for bdi-cgroup congestion which can > > be called from FS metadata paths to throttle metadata operations > > before they enter the stream of ordered operations. > > So that probably will mean changing the order of operations also. IIUC, > in case of fsync (ordered mode), we opened a meta data transaction first, > then tried to flush all the cached data and then flush metadata. So if > fsync is throttled, all the metadata operations behind it will get > serialized for ext3/ext4. > > So you seem to be suggesting that we change the design so that metadata > operation does not thrown into ordered stream till we have finished > writing all the data back to disk? I am not a filesystem developer, so > I don't know how feasible this change is. > > This is just one of the points. In the past while talking to Dave Chinner, > he mentioned that in XFS, if two cgroups fall into same allocation group > then there were cases where IO of one cgroup can get serialized behind > other. > > In general, the core of the issue is that filesystems are not cgroup aware > and if you do throttling below filesystems, then invariably one or other > serialization issue will come up and I am concerned that we will be constantly > fixing those serialization issues. Or the desgin point could be so central > to filesystem design that it can't be changed. We talked about this at LSF and Dave Chinner had the idea that we could make processes wait at the time when a transaction is started. At that time we don't hold any global locks so process can be throttled without serializing other processes. This effectively builds some cgroup awareness into filesystems but pretty simple one so it should be doable. > In general, if you do throttling deeper in the stakc and build back > pressure, then all the layers sitting above should be cgroup aware > to avoid problems. Two layers identified so far are writeback and > filesystems. Is it really worth the complexity. How about doing > throttling in higher layers when IO is entering the kernel and > keep proportional IO logic at the lowest level and current mechanism > of building pressure continues to work? I would like to keep single throttling mechanism for different limitting methods - i.e. handle proportional IO the same way as IO hard limits. So we cannot really rely on the fact that throttling is work preserving. The advantage of throttling at IO layer is that we can keep all the details inside it and only export pretty minimal information (like is bdi congested for given cgroup) to upper layers. If we wanted to do throttling at upper layers (such as Fengguang's buffered write throttling), we need to export the internal details to allow effective throttling... Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 14:51 ` Vivek Goyal @ 2012-04-07 8:00 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-07 8:00 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Wed 04-04-12 10:51:34, Vivek Goyal wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > [..] > > IIUC, without cgroup, the current writeback code works more or less > > like this. Throwing in cgroup doesn't really change the fundamental > > design. Instead of a single pipe going down, we just have multiple > > pipes to the same device, each of which should be treated separately. > > Of course, a spinning disk can't be divided that easily and their > > performance characteristics will be inter-dependent, but the place to > > solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. Yeah, for throttling NFS or other network filesystems we'd have to come up with some throttling mechanism at some other level. The problem with throttling at higher levels is that you have to somehow extract information from lower levels about amount of work so I'm not completely certain now, where would be the right place. Possibly it also depends on the intended usecase - so far I don't know about any real user for this functionality... > [..] > > In the discussion, for such implementation, the following obstacles > > were identified. > > > > * There are a lot of cases where IOs are issued by a task which isn't > > the originiator. ie. Writeback issues IOs for pages which are > > dirtied by some other tasks. So, by the time an IO reaches the > > block layer, we don't know which cgroup the IO belongs to. > > > > Recently, block layer has grown support to attach a task to a bio > > which causes the bio to be handled as if it were issued by the > > associated task regardless of the actual issuing task. It currently > > only allows attaching %current to a bio - bio_associate_current() - > > but changing it to support other tasks is trivial. > > > > We'll need to update the async issuers to tag the IOs they issue but > > the mechanism is already there. > > Most likely this tagging will take place in "struct page" and I am not > sure if we will be allowed to grow size of "struct page" for this reason. We can tag inodes and then bios so this should be fine. > > * Unlike dirty data pages, metadata tends to have strict ordering > > requirements and thus is susceptible to priority inversion. Two > > solutions were suggested - 1. allow overdrawl for metadata writes so > > that low prio metadata writes don't block the whole FS, 2. provide > > an interface to query and wait for bdi-cgroup congestion which can > > be called from FS metadata paths to throttle metadata operations > > before they enter the stream of ordered operations. > > So that probably will mean changing the order of operations also. IIUC, > in case of fsync (ordered mode), we opened a meta data transaction first, > then tried to flush all the cached data and then flush metadata. So if > fsync is throttled, all the metadata operations behind it will get > serialized for ext3/ext4. > > So you seem to be suggesting that we change the design so that metadata > operation does not thrown into ordered stream till we have finished > writing all the data back to disk? I am not a filesystem developer, so > I don't know how feasible this change is. > > This is just one of the points. In the past while talking to Dave Chinner, > he mentioned that in XFS, if two cgroups fall into same allocation group > then there were cases where IO of one cgroup can get serialized behind > other. > > In general, the core of the issue is that filesystems are not cgroup aware > and if you do throttling below filesystems, then invariably one or other > serialization issue will come up and I am concerned that we will be constantly > fixing those serialization issues. Or the desgin point could be so central > to filesystem design that it can't be changed. We talked about this at LSF and Dave Chinner had the idea that we could make processes wait at the time when a transaction is started. At that time we don't hold any global locks so process can be throttled without serializing other processes. This effectively builds some cgroup awareness into filesystems but pretty simple one so it should be doable. > In general, if you do throttling deeper in the stakc and build back > pressure, then all the layers sitting above should be cgroup aware > to avoid problems. Two layers identified so far are writeback and > filesystems. Is it really worth the complexity. How about doing > throttling in higher layers when IO is entering the kernel and > keep proportional IO logic at the lowest level and current mechanism > of building pressure continues to work? I would like to keep single throttling mechanism for different limitting methods - i.e. handle proportional IO the same way as IO hard limits. So we cannot really rely on the fact that throttling is work preserving. The advantage of throttling at IO layer is that we can keep all the details inside it and only export pretty minimal information (like is bdi congested for given cgroup) to upper layers. If we wanted to do throttling at upper layers (such as Fengguang's buffered write throttling), we need to export the internal details to allow effective throttling... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-07 8:00 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-07 8:00 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Wed 04-04-12 10:51:34, Vivek Goyal wrote: > On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > [..] > > IIUC, without cgroup, the current writeback code works more or less > > like this. Throwing in cgroup doesn't really change the fundamental > > design. Instead of a single pipe going down, we just have multiple > > pipes to the same device, each of which should be treated separately. > > Of course, a spinning disk can't be divided that easily and their > > performance characteristics will be inter-dependent, but the place to > > solve that problem is where the problem is, the block layer. > > How do you take care of thorottling IO to NFS case in this model? Current > throttling logic is tied to block device and in case of NFS, there is no > block device. Yeah, for throttling NFS or other network filesystems we'd have to come up with some throttling mechanism at some other level. The problem with throttling at higher levels is that you have to somehow extract information from lower levels about amount of work so I'm not completely certain now, where would be the right place. Possibly it also depends on the intended usecase - so far I don't know about any real user for this functionality... > [..] > > In the discussion, for such implementation, the following obstacles > > were identified. > > > > * There are a lot of cases where IOs are issued by a task which isn't > > the originiator. ie. Writeback issues IOs for pages which are > > dirtied by some other tasks. So, by the time an IO reaches the > > block layer, we don't know which cgroup the IO belongs to. > > > > Recently, block layer has grown support to attach a task to a bio > > which causes the bio to be handled as if it were issued by the > > associated task regardless of the actual issuing task. It currently > > only allows attaching %current to a bio - bio_associate_current() - > > but changing it to support other tasks is trivial. > > > > We'll need to update the async issuers to tag the IOs they issue but > > the mechanism is already there. > > Most likely this tagging will take place in "struct page" and I am not > sure if we will be allowed to grow size of "struct page" for this reason. We can tag inodes and then bios so this should be fine. > > * Unlike dirty data pages, metadata tends to have strict ordering > > requirements and thus is susceptible to priority inversion. Two > > solutions were suggested - 1. allow overdrawl for metadata writes so > > that low prio metadata writes don't block the whole FS, 2. provide > > an interface to query and wait for bdi-cgroup congestion which can > > be called from FS metadata paths to throttle metadata operations > > before they enter the stream of ordered operations. > > So that probably will mean changing the order of operations also. IIUC, > in case of fsync (ordered mode), we opened a meta data transaction first, > then tried to flush all the cached data and then flush metadata. So if > fsync is throttled, all the metadata operations behind it will get > serialized for ext3/ext4. > > So you seem to be suggesting that we change the design so that metadata > operation does not thrown into ordered stream till we have finished > writing all the data back to disk? I am not a filesystem developer, so > I don't know how feasible this change is. > > This is just one of the points. In the past while talking to Dave Chinner, > he mentioned that in XFS, if two cgroups fall into same allocation group > then there were cases where IO of one cgroup can get serialized behind > other. > > In general, the core of the issue is that filesystems are not cgroup aware > and if you do throttling below filesystems, then invariably one or other > serialization issue will come up and I am concerned that we will be constantly > fixing those serialization issues. Or the desgin point could be so central > to filesystem design that it can't be changed. We talked about this at LSF and Dave Chinner had the idea that we could make processes wait at the time when a transaction is started. At that time we don't hold any global locks so process can be throttled without serializing other processes. This effectively builds some cgroup awareness into filesystems but pretty simple one so it should be doable. > In general, if you do throttling deeper in the stakc and build back > pressure, then all the layers sitting above should be cgroup aware > to avoid problems. Two layers identified so far are writeback and > filesystems. Is it really worth the complexity. How about doing > throttling in higher layers when IO is entering the kernel and > keep proportional IO logic at the lowest level and current mechanism > of building pressure continues to work? I would like to keep single throttling mechanism for different limitting methods - i.e. handle proportional IO the same way as IO hard limits. So we cannot really rely on the fact that throttling is work preserving. The advantage of throttling at IO layer is that we can keep all the details inside it and only export pretty minimal information (like is bdi congested for given cgroup) to upper layers. If we wanted to do throttling at upper layers (such as Fengguang's buffered write throttling), we need to export the internal details to allow effective throttling... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-10 16:23 ` Steve French 2012-04-10 18:06 ` Vivek Goyal 1 sibling, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw) To: Jan Kara Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote: > Hi Vivek, > > On Wed 04-04-12 10:51:34, Vivek Goyal wrote: >> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: >> [..] >> > IIUC, without cgroup, the current writeback code works more or less >> > like this. Throwing in cgroup doesn't really change the fundamental >> > design. Instead of a single pipe going down, we just have multiple >> > pipes to the same device, each of which should be treated separately. >> > Of course, a spinning disk can't be divided that easily and their >> > performance characteristics will be inter-dependent, but the place to >> > solve that problem is where the problem is, the block layer. >> >> How do you take care of thorottling IO to NFS case in this model? Current >> throttling logic is tied to block device and in case of NFS, there is no >> block device. > Yeah, for throttling NFS or other network filesystems we'd have to come > up with some throttling mechanism at some other level. The problem with > throttling at higher levels is that you have to somehow extract information > from lower levels about amount of work so I'm not completely certain now, > where would be the right place. Possibly it also depends on the intended > usecase - so far I don't know about any real user for this functionality... Remember to distinguish between the two ends of the network file system. There are slightly different problems. The client has to be able to expose the number of requests (and size of writes, or equivalently number of pages it can write at one time) so that writeback is not done too aggressively. File servers have to be able to discover the i/o limits dynamically of the underlying volume (not the block device, but potentially a pool of devices) so it can tell the client how much i/o it can send. For SMB2 server (Samba) and eventually for NFS, how many simultaneous requests it can support will allow them to sanely set the number of "credits" on each response - ie tell the client how many requests are allowed in flight to a particular export. In the case of block device throttling - other than the file system internally using such APIs who would use block device specific throttling - only the file system knows where it wants to put hot data, and in the case of btrfs, doesn't the file system manage the storage pool. The block device should be transparent to the user in the long run, and only the volume visible. -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-10 16:23 ` [Lsf] " Steve French @ 2012-04-10 18:06 ` Vivek Goyal 1 sibling, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote: Hi Jan, [..] > > In general, the core of the issue is that filesystems are not cgroup aware > > and if you do throttling below filesystems, then invariably one or other > > serialization issue will come up and I am concerned that we will be constantly > > fixing those serialization issues. Or the desgin point could be so central > > to filesystem design that it can't be changed. > We talked about this at LSF and Dave Chinner had the idea that we could > make processes wait at the time when a transaction is started. At that time > we don't hold any global locks so process can be throttled without > serializing other processes. This effectively builds some cgroup awareness > into filesystems but pretty simple one so it should be doable. Ok. So what is the meaning of "make process wait" here? What it will be dependent on? I am thinking of a case where a process has 100MB of dirty data, has 10MB/s write limit and it issues fsync. So before that process is able to open a transaction, one needs to wait atleast 10seconds (assuming other processes are not doing IO in same cgroup). If this wait is based on making sure all dirty data has been written back before opening transaction, then it will work without any interaction with block layer and sounds more feasible. > > > In general, if you do throttling deeper in the stakc and build back > > pressure, then all the layers sitting above should be cgroup aware > > to avoid problems. Two layers identified so far are writeback and > > filesystems. Is it really worth the complexity. How about doing > > throttling in higher layers when IO is entering the kernel and > > keep proportional IO logic at the lowest level and current mechanism > > of building pressure continues to work? > I would like to keep single throttling mechanism for different limitting > methods - i.e. handle proportional IO the same way as IO hard limits. So we > cannot really rely on the fact that throttling is work preserving. > > The advantage of throttling at IO layer is that we can keep all the details > inside it and only export pretty minimal information (like is bdi congested > for given cgroup) to upper layers. If we wanted to do throttling at upper > layers (such as Fengguang's buffered write throttling), we need to export > the internal details to allow effective throttling... For absolute throttling we really don't have to expose any details. In fact in my implementation of throttling buffered writes, I just had exported a single function to be called in bdi dirty rate limit. The caller will simply sleep long enough depending on the size of IO it is doing and how many other processes are doing IO in same cgroup. So implementation was still in block layer and only a single function was exposed to higher layers. One more factor makes absolute throttling interesting and that is global throttling and not per device throttling. For example in case of btrfs, there is no single stacked device on which to put total throttling limits. So if filesystems can handle serialization issue, then back pressure method looks more clean (thought complex). Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-10 16:23 ` [Lsf] " Steve French @ 2012-04-10 16:23 ` Steve French 1 sibling, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack@suse.cz> wrote: > Hi Vivek, > > On Wed 04-04-12 10:51:34, Vivek Goyal wrote: >> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: >> [..] >> > IIUC, without cgroup, the current writeback code works more or less >> > like this. Throwing in cgroup doesn't really change the fundamental >> > design. Instead of a single pipe going down, we just have multiple >> > pipes to the same device, each of which should be treated separately. >> > Of course, a spinning disk can't be divided that easily and their >> > performance characteristics will be inter-dependent, but the place to >> > solve that problem is where the problem is, the block layer. >> >> How do you take care of thorottling IO to NFS case in this model? Current >> throttling logic is tied to block device and in case of NFS, there is no >> block device. > Yeah, for throttling NFS or other network filesystems we'd have to come > up with some throttling mechanism at some other level. The problem with > throttling at higher levels is that you have to somehow extract information > from lower levels about amount of work so I'm not completely certain now, > where would be the right place. Possibly it also depends on the intended > usecase - so far I don't know about any real user for this functionality... Remember to distinguish between the two ends of the network file system. There are slightly different problems. The client has to be able to expose the number of requests (and size of writes, or equivalently number of pages it can write at one time) so that writeback is not done too aggressively. File servers have to be able to discover the i/o limits dynamically of the underlying volume (not the block device, but potentially a pool of devices) so it can tell the client how much i/o it can send. For SMB2 server (Samba) and eventually for NFS, how many simultaneous requests it can support will allow them to sanely set the number of "credits" on each response - ie tell the client how many requests are allowed in flight to a particular export. In the case of block device throttling - other than the file system internally using such APIs who would use block device specific throttling - only the file system knows where it wants to put hot data, and in the case of btrfs, doesn't the file system manage the storage pool. The block device should be transparent to the user in the long run, and only the volume visible. -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-10 16:23 ` Steve French 0 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack@suse.cz> wrote: > Hi Vivek, > > On Wed 04-04-12 10:51:34, Vivek Goyal wrote: >> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: >> [..] >> > IIUC, without cgroup, the current writeback code works more or less >> > like this. Throwing in cgroup doesn't really change the fundamental >> > design. Instead of a single pipe going down, we just have multiple >> > pipes to the same device, each of which should be treated separately. >> > Of course, a spinning disk can't be divided that easily and their >> > performance characteristics will be inter-dependent, but the place to >> > solve that problem is where the problem is, the block layer. >> >> How do you take care of thorottling IO to NFS case in this model? Current >> throttling logic is tied to block device and in case of NFS, there is no >> block device. > Yeah, for throttling NFS or other network filesystems we'd have to come > up with some throttling mechanism at some other level. The problem with > throttling at higher levels is that you have to somehow extract information > from lower levels about amount of work so I'm not completely certain now, > where would be the right place. Possibly it also depends on the intended > usecase - so far I don't know about any real user for this functionality... Remember to distinguish between the two ends of the network file system. There are slightly different problems. The client has to be able to expose the number of requests (and size of writes, or equivalently number of pages it can write at one time) so that writeback is not done too aggressively. File servers have to be able to discover the i/o limits dynamically of the underlying volume (not the block device, but potentially a pool of devices) so it can tell the client how much i/o it can send. For SMB2 server (Samba) and eventually for NFS, how many simultaneous requests it can support will allow them to sanely set the number of "credits" on each response - ie tell the client how many requests are allowed in flight to a particular export. In the case of block device throttling - other than the file system internally using such APIs who would use block device specific throttling - only the file system knows where it wants to put hot data, and in the case of btrfs, doesn't the file system manage the storage pool. The block device should be transparent to the user in the long run, and only the volume visible. -- Thanks, Steve -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-10 16:23 ` Steve French 0 siblings, 0 replies; 261+ messages in thread From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote: > Hi Vivek, > > On Wed 04-04-12 10:51:34, Vivek Goyal wrote: >> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: >> [..] >> > IIUC, without cgroup, the current writeback code works more or less >> > like this. Throwing in cgroup doesn't really change the fundamental >> > design. Instead of a single pipe going down, we just have multiple >> > pipes to the same device, each of which should be treated separately. >> > Of course, a spinning disk can't be divided that easily and their >> > performance characteristics will be inter-dependent, but the place to >> > solve that problem is where the problem is, the block layer. >> >> How do you take care of thorottling IO to NFS case in this model? Current >> throttling logic is tied to block device and in case of NFS, there is no >> block device. > Yeah, for throttling NFS or other network filesystems we'd have to come > up with some throttling mechanism at some other level. The problem with > throttling at higher levels is that you have to somehow extract information > from lower levels about amount of work so I'm not completely certain now, > where would be the right place. Possibly it also depends on the intended > usecase - so far I don't know about any real user for this functionality... Remember to distinguish between the two ends of the network file system. There are slightly different problems. The client has to be able to expose the number of requests (and size of writes, or equivalently number of pages it can write at one time) so that writeback is not done too aggressively. File servers have to be able to discover the i/o limits dynamically of the underlying volume (not the block device, but potentially a pool of devices) so it can tell the client how much i/o it can send. For SMB2 server (Samba) and eventually for NFS, how many simultaneous requests it can support will allow them to sanely set the number of "credits" on each response - ie tell the client how many requests are allowed in flight to a particular export. In the case of block device throttling - other than the file system internally using such APIs who would use block device specific throttling - only the file system knows where it wants to put hot data, and in the case of btrfs, doesn't the file system manage the storage pool. The block device should be transparent to the user in the long run, and only the volume visible. -- Thanks, Steve ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2012-04-10 18:16 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw) To: Steve French Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote: [..] > In the case of block device throttling - other than the file system > internally using such APIs who would use block device specific > throttling - only the file system knows where it wants to put hot data, > and in the case of btrfs, doesn't the file system manage the > storage pool. The block device should be transparent to the > user in the long run, and only the volume visible. This is a good point. I guess this goes back to Jan's question of what's the intended use case of absolute throttling. Having a dependency on per device limits has the drawback of user knowing exactly the details of storage stack and it assumes that there is one single aggregation point of block devices. (Which is not true in case of btrfs). If user is simply looking for something like that I don't want a backup process to be writing at more than 50MB/s (so that other processes doing IO to same filesystem are effected less), then it is a case of global throttling and per device throttling really does not gel well. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-10 16:23 ` Steve French @ 2012-04-10 18:16 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw) To: Steve French Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote: [..] > In the case of block device throttling - other than the file system > internally using such APIs who would use block device specific > throttling - only the file system knows where it wants to put hot data, > and in the case of btrfs, doesn't the file system manage the > storage pool. The block device should be transparent to the > user in the long run, and only the volume visible. This is a good point. I guess this goes back to Jan's question of what's the intended use case of absolute throttling. Having a dependency on per device limits has the drawback of user knowing exactly the details of storage stack and it assumes that there is one single aggregation point of block devices. (Which is not true in case of btrfs). If user is simply looking for something like that I don't want a backup process to be writing at more than 50MB/s (so that other processes doing IO to same filesystem are effected less), then it is a case of global throttling and per device throttling really does not gel well. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-10 18:16 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw) To: Steve French Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote: [..] > In the case of block device throttling - other than the file system > internally using such APIs who would use block device specific > throttling - only the file system knows where it wants to put hot data, > and in the case of btrfs, doesn't the file system manage the > storage pool. The block device should be transparent to the > user in the long run, and only the volume visible. This is a good point. I guess this goes back to Jan's question of what's the intended use case of absolute throttling. Having a dependency on per device limits has the drawback of user knowing exactly the details of storage stack and it assumes that there is one single aggregation point of block devices. (Which is not true in case of btrfs). If user is simply looking for something like that I don't want a backup process to be writing at more than 50MB/s (so that other processes doing IO to same filesystem are effected less), then it is a case of global throttling and per device throttling really does not gel well. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-07 8:00 ` Jan Kara @ 2012-04-10 18:06 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote: Hi Jan, [..] > > In general, the core of the issue is that filesystems are not cgroup aware > > and if you do throttling below filesystems, then invariably one or other > > serialization issue will come up and I am concerned that we will be constantly > > fixing those serialization issues. Or the desgin point could be so central > > to filesystem design that it can't be changed. > We talked about this at LSF and Dave Chinner had the idea that we could > make processes wait at the time when a transaction is started. At that time > we don't hold any global locks so process can be throttled without > serializing other processes. This effectively builds some cgroup awareness > into filesystems but pretty simple one so it should be doable. Ok. So what is the meaning of "make process wait" here? What it will be dependent on? I am thinking of a case where a process has 100MB of dirty data, has 10MB/s write limit and it issues fsync. So before that process is able to open a transaction, one needs to wait atleast 10seconds (assuming other processes are not doing IO in same cgroup). If this wait is based on making sure all dirty data has been written back before opening transaction, then it will work without any interaction with block layer and sounds more feasible. > > > In general, if you do throttling deeper in the stakc and build back > > pressure, then all the layers sitting above should be cgroup aware > > to avoid problems. Two layers identified so far are writeback and > > filesystems. Is it really worth the complexity. How about doing > > throttling in higher layers when IO is entering the kernel and > > keep proportional IO logic at the lowest level and current mechanism > > of building pressure continues to work? > I would like to keep single throttling mechanism for different limitting > methods - i.e. handle proportional IO the same way as IO hard limits. So we > cannot really rely on the fact that throttling is work preserving. > > The advantage of throttling at IO layer is that we can keep all the details > inside it and only export pretty minimal information (like is bdi congested > for given cgroup) to upper layers. If we wanted to do throttling at upper > layers (such as Fengguang's buffered write throttling), we need to export > the internal details to allow effective throttling... For absolute throttling we really don't have to expose any details. In fact in my implementation of throttling buffered writes, I just had exported a single function to be called in bdi dirty rate limit. The caller will simply sleep long enough depending on the size of IO it is doing and how many other processes are doing IO in same cgroup. So implementation was still in block layer and only a single function was exposed to higher layers. One more factor makes absolute throttling interesting and that is global throttling and not per device throttling. For example in case of btrfs, there is no single stacked device on which to put total throttling limits. So if filesystems can handle serialization issue, then back pressure method looks more clean (thought complex). Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-10 18:06 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote: Hi Jan, [..] > > In general, the core of the issue is that filesystems are not cgroup aware > > and if you do throttling below filesystems, then invariably one or other > > serialization issue will come up and I am concerned that we will be constantly > > fixing those serialization issues. Or the desgin point could be so central > > to filesystem design that it can't be changed. > We talked about this at LSF and Dave Chinner had the idea that we could > make processes wait at the time when a transaction is started. At that time > we don't hold any global locks so process can be throttled without > serializing other processes. This effectively builds some cgroup awareness > into filesystems but pretty simple one so it should be doable. Ok. So what is the meaning of "make process wait" here? What it will be dependent on? I am thinking of a case where a process has 100MB of dirty data, has 10MB/s write limit and it issues fsync. So before that process is able to open a transaction, one needs to wait atleast 10seconds (assuming other processes are not doing IO in same cgroup). If this wait is based on making sure all dirty data has been written back before opening transaction, then it will work without any interaction with block layer and sounds more feasible. > > > In general, if you do throttling deeper in the stakc and build back > > pressure, then all the layers sitting above should be cgroup aware > > to avoid problems. Two layers identified so far are writeback and > > filesystems. Is it really worth the complexity. How about doing > > throttling in higher layers when IO is entering the kernel and > > keep proportional IO logic at the lowest level and current mechanism > > of building pressure continues to work? > I would like to keep single throttling mechanism for different limitting > methods - i.e. handle proportional IO the same way as IO hard limits. So we > cannot really rely on the fact that throttling is work preserving. > > The advantage of throttling at IO layer is that we can keep all the details > inside it and only export pretty minimal information (like is bdi congested > for given cgroup) to upper layers. If we wanted to do throttling at upper > layers (such as Fengguang's buffered write throttling), we need to export > the internal details to allow effective throttling... For absolute throttling we really don't have to expose any details. In fact in my implementation of throttling buffered writes, I just had exported a single function to be called in bdi dirty rate limit. The caller will simply sleep long enough depending on the size of IO it is doing and how many other processes are doing IO in same cgroup. So implementation was still in block layer and only a single function was exposed to higher layers. One more factor makes absolute throttling interesting and that is global throttling and not per device throttling. For example in case of btrfs, there is no single stacked device on which to put total throttling limits. So if filesystems can handle serialization issue, then back pressure method looks more clean (thought complex). Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120410180653.GJ21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-10 18:06 ` Vivek Goyal (?) @ 2012-04-10 21:05 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hi Vivek, On Tue 10-04-12 14:06:53, Vivek Goyal wrote: > On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote: > > > In general, the core of the issue is that filesystems are not cgroup aware > > > and if you do throttling below filesystems, then invariably one or other > > > serialization issue will come up and I am concerned that we will be constantly > > > fixing those serialization issues. Or the desgin point could be so central > > > to filesystem design that it can't be changed. > > We talked about this at LSF and Dave Chinner had the idea that we could > > make processes wait at the time when a transaction is started. At that time > > we don't hold any global locks so process can be throttled without > > serializing other processes. This effectively builds some cgroup awareness > > into filesystems but pretty simple one so it should be doable. > > Ok. So what is the meaning of "make process wait" here? What it will be > dependent on? I am thinking of a case where a process has 100MB of dirty > data, has 10MB/s write limit and it issues fsync. So before that process > is able to open a transaction, one needs to wait atleast 10seconds > (assuming other processes are not doing IO in same cgroup). The original idea was that we'd have "bdi-congested-for-cgroup" flag and the process starting a transaction will wait for this flag to get cleared before starting a new transaction. This will be easy to implement in filesystems and won't have serialization issues. But my knowledge of blk-throttle is lacking so there might be some problems with this approach. > If this wait is based on making sure all dirty data has been written back > before opening transaction, then it will work without any interaction with > block layer and sounds more feasible. > > > > > > In general, if you do throttling deeper in the stakc and build back > > > pressure, then all the layers sitting above should be cgroup aware > > > to avoid problems. Two layers identified so far are writeback and > > > filesystems. Is it really worth the complexity. How about doing > > > throttling in higher layers when IO is entering the kernel and > > > keep proportional IO logic at the lowest level and current mechanism > > > of building pressure continues to work? > > I would like to keep single throttling mechanism for different limitting > > methods - i.e. handle proportional IO the same way as IO hard limits. So we > > cannot really rely on the fact that throttling is work preserving. > > > > The advantage of throttling at IO layer is that we can keep all the details > > inside it and only export pretty minimal information (like is bdi congested > > for given cgroup) to upper layers. If we wanted to do throttling at upper > > layers (such as Fengguang's buffered write throttling), we need to export > > the internal details to allow effective throttling... > > For absolute throttling we really don't have to expose any details. In > fact in my implementation of throttling buffered writes, I just had exported > a single function to be called in bdi dirty rate limit. The caller will > simply sleep long enough depending on the size of IO it is doing and > how many other processes are doing IO in same cgroup. > > So implementation was still in block layer and only a single function > was exposed to higher layers. OK, I see. > One more factor makes absolute throttling interesting and that is global > throttling and not per device throttling. For example in case of btrfs, > there is no single stacked device on which to put total throttling > limits. Yes. My intended interface for the throttling is bdi. But you are right it does not exactly match the fact that the throttling happens per device so it might get tricky. Which brings up a question - shouldn't the throttling blk-throttle does rather happen at bdi layer? Because the uses of the functionality I have in mind would match that better. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-10 21:05 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Tue 10-04-12 14:06:53, Vivek Goyal wrote: > On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote: > > > In general, the core of the issue is that filesystems are not cgroup aware > > > and if you do throttling below filesystems, then invariably one or other > > > serialization issue will come up and I am concerned that we will be constantly > > > fixing those serialization issues. Or the desgin point could be so central > > > to filesystem design that it can't be changed. > > We talked about this at LSF and Dave Chinner had the idea that we could > > make processes wait at the time when a transaction is started. At that time > > we don't hold any global locks so process can be throttled without > > serializing other processes. This effectively builds some cgroup awareness > > into filesystems but pretty simple one so it should be doable. > > Ok. So what is the meaning of "make process wait" here? What it will be > dependent on? I am thinking of a case where a process has 100MB of dirty > data, has 10MB/s write limit and it issues fsync. So before that process > is able to open a transaction, one needs to wait atleast 10seconds > (assuming other processes are not doing IO in same cgroup). The original idea was that we'd have "bdi-congested-for-cgroup" flag and the process starting a transaction will wait for this flag to get cleared before starting a new transaction. This will be easy to implement in filesystems and won't have serialization issues. But my knowledge of blk-throttle is lacking so there might be some problems with this approach. > If this wait is based on making sure all dirty data has been written back > before opening transaction, then it will work without any interaction with > block layer and sounds more feasible. > > > > > > In general, if you do throttling deeper in the stakc and build back > > > pressure, then all the layers sitting above should be cgroup aware > > > to avoid problems. Two layers identified so far are writeback and > > > filesystems. Is it really worth the complexity. How about doing > > > throttling in higher layers when IO is entering the kernel and > > > keep proportional IO logic at the lowest level and current mechanism > > > of building pressure continues to work? > > I would like to keep single throttling mechanism for different limitting > > methods - i.e. handle proportional IO the same way as IO hard limits. So we > > cannot really rely on the fact that throttling is work preserving. > > > > The advantage of throttling at IO layer is that we can keep all the details > > inside it and only export pretty minimal information (like is bdi congested > > for given cgroup) to upper layers. If we wanted to do throttling at upper > > layers (such as Fengguang's buffered write throttling), we need to export > > the internal details to allow effective throttling... > > For absolute throttling we really don't have to expose any details. In > fact in my implementation of throttling buffered writes, I just had exported > a single function to be called in bdi dirty rate limit. The caller will > simply sleep long enough depending on the size of IO it is doing and > how many other processes are doing IO in same cgroup. > > So implementation was still in block layer and only a single function > was exposed to higher layers. OK, I see. > One more factor makes absolute throttling interesting and that is global > throttling and not per device throttling. For example in case of btrfs, > there is no single stacked device on which to put total throttling > limits. Yes. My intended interface for the throttling is bdi. But you are right it does not exactly match the fact that the throttling happens per device so it might get tricky. Which brings up a question - shouldn't the throttling blk-throttle does rather happen at bdi layer? Because the uses of the functionality I have in mind would match that better. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-10 21:05 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Tue 10-04-12 14:06:53, Vivek Goyal wrote: > On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote: > > > In general, the core of the issue is that filesystems are not cgroup aware > > > and if you do throttling below filesystems, then invariably one or other > > > serialization issue will come up and I am concerned that we will be constantly > > > fixing those serialization issues. Or the desgin point could be so central > > > to filesystem design that it can't be changed. > > We talked about this at LSF and Dave Chinner had the idea that we could > > make processes wait at the time when a transaction is started. At that time > > we don't hold any global locks so process can be throttled without > > serializing other processes. This effectively builds some cgroup awareness > > into filesystems but pretty simple one so it should be doable. > > Ok. So what is the meaning of "make process wait" here? What it will be > dependent on? I am thinking of a case where a process has 100MB of dirty > data, has 10MB/s write limit and it issues fsync. So before that process > is able to open a transaction, one needs to wait atleast 10seconds > (assuming other processes are not doing IO in same cgroup). The original idea was that we'd have "bdi-congested-for-cgroup" flag and the process starting a transaction will wait for this flag to get cleared before starting a new transaction. This will be easy to implement in filesystems and won't have serialization issues. But my knowledge of blk-throttle is lacking so there might be some problems with this approach. > If this wait is based on making sure all dirty data has been written back > before opening transaction, then it will work without any interaction with > block layer and sounds more feasible. > > > > > > In general, if you do throttling deeper in the stakc and build back > > > pressure, then all the layers sitting above should be cgroup aware > > > to avoid problems. Two layers identified so far are writeback and > > > filesystems. Is it really worth the complexity. How about doing > > > throttling in higher layers when IO is entering the kernel and > > > keep proportional IO logic at the lowest level and current mechanism > > > of building pressure continues to work? > > I would like to keep single throttling mechanism for different limitting > > methods - i.e. handle proportional IO the same way as IO hard limits. So we > > cannot really rely on the fact that throttling is work preserving. > > > > The advantage of throttling at IO layer is that we can keep all the details > > inside it and only export pretty minimal information (like is bdi congested > > for given cgroup) to upper layers. If we wanted to do throttling at upper > > layers (such as Fengguang's buffered write throttling), we need to export > > the internal details to allow effective throttling... > > For absolute throttling we really don't have to expose any details. In > fact in my implementation of throttling buffered writes, I just had exported > a single function to be called in bdi dirty rate limit. The caller will > simply sleep long enough depending on the size of IO it is doing and > how many other processes are doing IO in same cgroup. > > So implementation was still in block layer and only a single function > was exposed to higher layers. OK, I see. > One more factor makes absolute throttling interesting and that is global > throttling and not per device throttling. For example in case of btrfs, > there is no single stacked device on which to put total throttling > limits. Yes. My intended interface for the throttling is bdi. But you are right it does not exactly match the fact that the throttling happens per device so it might get tricky. Which brings up a question - shouldn't the throttling blk-throttle does rather happen at bdi layer? Because the uses of the functionality I have in mind would match that better. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-10 21:20 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote: [..] > > Ok. So what is the meaning of "make process wait" here? What it will be > > dependent on? I am thinking of a case where a process has 100MB of dirty > > data, has 10MB/s write limit and it issues fsync. So before that process > > is able to open a transaction, one needs to wait atleast 10seconds > > (assuming other processes are not doing IO in same cgroup). > The original idea was that we'd have "bdi-congested-for-cgroup" flag > and the process starting a transaction will wait for this flag to get > cleared before starting a new transaction. This will be easy to implement > in filesystems and won't have serialization issues. But my knowledge of > blk-throttle is lacking so there might be some problems with this approach. I have implemented and posted patches for per bdi per cgroup congestion flag. The only problem I see with that is that a group might be congested for a long time because of lots of other IO happening (say direct IO) and if you keep on backing off and never submit the metadata IO (transaction), you get starved. And if you go ahead and submit IO in a congested group, we are back to serialization issue. [..] > > One more factor makes absolute throttling interesting and that is global > > throttling and not per device throttling. For example in case of btrfs, > > there is no single stacked device on which to put total throttling > > limits. > Yes. My intended interface for the throttling is bdi. But you are right > it does not exactly match the fact that the throttling happens per device > so it might get tricky. Which brings up a question - shouldn't the > throttling blk-throttle does rather happen at bdi layer? Because the > uses of the functionality I have in mind would match that better. I guess throttling at bdi layer will take care of network filesystem case too? But isn't the notion of "bdi" internal to kernel and user does not really program thing in terms of bdi. Also per bdi limit mechanism will not solve the issue of global throttling where in case of btrfs an IO might go to multiple bdi's. So throttling limits are not total but per bdi. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-10 21:05 ` Jan Kara @ 2012-04-10 21:20 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote: [..] > > Ok. So what is the meaning of "make process wait" here? What it will be > > dependent on? I am thinking of a case where a process has 100MB of dirty > > data, has 10MB/s write limit and it issues fsync. So before that process > > is able to open a transaction, one needs to wait atleast 10seconds > > (assuming other processes are not doing IO in same cgroup). > The original idea was that we'd have "bdi-congested-for-cgroup" flag > and the process starting a transaction will wait for this flag to get > cleared before starting a new transaction. This will be easy to implement > in filesystems and won't have serialization issues. But my knowledge of > blk-throttle is lacking so there might be some problems with this approach. I have implemented and posted patches for per bdi per cgroup congestion flag. The only problem I see with that is that a group might be congested for a long time because of lots of other IO happening (say direct IO) and if you keep on backing off and never submit the metadata IO (transaction), you get starved. And if you go ahead and submit IO in a congested group, we are back to serialization issue. [..] > > One more factor makes absolute throttling interesting and that is global > > throttling and not per device throttling. For example in case of btrfs, > > there is no single stacked device on which to put total throttling > > limits. > Yes. My intended interface for the throttling is bdi. But you are right > it does not exactly match the fact that the throttling happens per device > so it might get tricky. Which brings up a question - shouldn't the > throttling blk-throttle does rather happen at bdi layer? Because the > uses of the functionality I have in mind would match that better. I guess throttling at bdi layer will take care of network filesystem case too? But isn't the notion of "bdi" internal to kernel and user does not really program thing in terms of bdi. Also per bdi limit mechanism will not solve the issue of global throttling where in case of btrfs an IO might go to multiple bdi's. So throttling limits are not total but per bdi. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-10 21:20 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote: [..] > > Ok. So what is the meaning of "make process wait" here? What it will be > > dependent on? I am thinking of a case where a process has 100MB of dirty > > data, has 10MB/s write limit and it issues fsync. So before that process > > is able to open a transaction, one needs to wait atleast 10seconds > > (assuming other processes are not doing IO in same cgroup). > The original idea was that we'd have "bdi-congested-for-cgroup" flag > and the process starting a transaction will wait for this flag to get > cleared before starting a new transaction. This will be easy to implement > in filesystems and won't have serialization issues. But my knowledge of > blk-throttle is lacking so there might be some problems with this approach. I have implemented and posted patches for per bdi per cgroup congestion flag. The only problem I see with that is that a group might be congested for a long time because of lots of other IO happening (say direct IO) and if you keep on backing off and never submit the metadata IO (transaction), you get starved. And if you go ahead and submit IO in a congested group, we are back to serialization issue. [..] > > One more factor makes absolute throttling interesting and that is global > > throttling and not per device throttling. For example in case of btrfs, > > there is no single stacked device on which to put total throttling > > limits. > Yes. My intended interface for the throttling is bdi. But you are right > it does not exactly match the fact that the throttling happens per device > so it might get tricky. Which brings up a question - shouldn't the > throttling blk-throttle does rather happen at bdi layer? Because the > uses of the functionality I have in mind would match that better. I guess throttling at bdi layer will take care of network filesystem case too? But isn't the notion of "bdi" internal to kernel and user does not really program thing in terms of bdi. Also per bdi limit mechanism will not solve the issue of global throttling where in case of btrfs an IO might go to multiple bdi's. So throttling limits are not total but per bdi. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-10 22:24 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Tue 10-04-12 17:20:41, Vivek Goyal wrote: > On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote: > > [..] > > > Ok. So what is the meaning of "make process wait" here? What it will be > > > dependent on? I am thinking of a case where a process has 100MB of dirty > > > data, has 10MB/s write limit and it issues fsync. So before that process > > > is able to open a transaction, one needs to wait atleast 10seconds > > > (assuming other processes are not doing IO in same cgroup). > > The original idea was that we'd have "bdi-congested-for-cgroup" flag > > and the process starting a transaction will wait for this flag to get > > cleared before starting a new transaction. This will be easy to implement > > in filesystems and won't have serialization issues. But my knowledge of > > blk-throttle is lacking so there might be some problems with this approach. > > I have implemented and posted patches for per bdi per cgroup congestion > flag. The only problem I see with that is that a group might be congested > for a long time because of lots of other IO happening (say direct IO) and > if you keep on backing off and never submit the metadata IO (transaction), > you get starved. And if you go ahead and submit IO in a congested group, > we are back to serialization issue. Clearly, we mustn't throttle metadata IO once it gets to the block layer. That's why we discuss throttling of processes at transaction start after all. But I agree starvation is an issue - I originally thought blk-throttle throttles synchronously which wouldn't have starvation issues. But when that's not the case things are a bit more tricky. We could treat transaction start as an IO of some size (since we already have some estimation how large a transaction will be when we are starting it) and let the transaction start only when our "virtual" IO would be submitted but I feel that gets maybe too complicated... Maybe we could just delay the transaction start by the amount reported from blk-throttle layer? Something along your callback for throttling you implemented? > [..] > > > One more factor makes absolute throttling interesting and that is global > > > throttling and not per device throttling. For example in case of btrfs, > > > there is no single stacked device on which to put total throttling > > > limits. > > Yes. My intended interface for the throttling is bdi. But you are right > > it does not exactly match the fact that the throttling happens per device > > so it might get tricky. Which brings up a question - shouldn't the > > throttling blk-throttle does rather happen at bdi layer? Because the > > uses of the functionality I have in mind would match that better. > > I guess throttling at bdi layer will take care of network filesystem > case too? Yes. At least for client side. On sever side Steve wants server to have insight into how much IO we could push in future so that it can limit number of outstanding requests if I understand him right. I'm not sure we really want / are able to provide this amount of knowledge to filesystems even less userspace... > But isn't the notion of "bdi" internal to kernel and user does > not really program thing in terms of bdi. Well, it is. But we already have per-bdi tunables (e.g. readahead) that are exported in /sys/block/<device>/queue/ so we have some precedens. > Also per bdi limit mechanism will not solve the issue of global throttling > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > are not total but per bdi. Well, btrfs plays tricks with bdi's but there is a special bdi called "btrfs" which backs the whole filesystem and that is what's put in sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a global bdi to work with. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-10 21:20 ` Vivek Goyal @ 2012-04-10 22:24 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue 10-04-12 17:20:41, Vivek Goyal wrote: > On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote: > > [..] > > > Ok. So what is the meaning of "make process wait" here? What it will be > > > dependent on? I am thinking of a case where a process has 100MB of dirty > > > data, has 10MB/s write limit and it issues fsync. So before that process > > > is able to open a transaction, one needs to wait atleast 10seconds > > > (assuming other processes are not doing IO in same cgroup). > > The original idea was that we'd have "bdi-congested-for-cgroup" flag > > and the process starting a transaction will wait for this flag to get > > cleared before starting a new transaction. This will be easy to implement > > in filesystems and won't have serialization issues. But my knowledge of > > blk-throttle is lacking so there might be some problems with this approach. > > I have implemented and posted patches for per bdi per cgroup congestion > flag. The only problem I see with that is that a group might be congested > for a long time because of lots of other IO happening (say direct IO) and > if you keep on backing off and never submit the metadata IO (transaction), > you get starved. And if you go ahead and submit IO in a congested group, > we are back to serialization issue. Clearly, we mustn't throttle metadata IO once it gets to the block layer. That's why we discuss throttling of processes at transaction start after all. But I agree starvation is an issue - I originally thought blk-throttle throttles synchronously which wouldn't have starvation issues. But when that's not the case things are a bit more tricky. We could treat transaction start as an IO of some size (since we already have some estimation how large a transaction will be when we are starting it) and let the transaction start only when our "virtual" IO would be submitted but I feel that gets maybe too complicated... Maybe we could just delay the transaction start by the amount reported from blk-throttle layer? Something along your callback for throttling you implemented? > [..] > > > One more factor makes absolute throttling interesting and that is global > > > throttling and not per device throttling. For example in case of btrfs, > > > there is no single stacked device on which to put total throttling > > > limits. > > Yes. My intended interface for the throttling is bdi. But you are right > > it does not exactly match the fact that the throttling happens per device > > so it might get tricky. Which brings up a question - shouldn't the > > throttling blk-throttle does rather happen at bdi layer? Because the > > uses of the functionality I have in mind would match that better. > > I guess throttling at bdi layer will take care of network filesystem > case too? Yes. At least for client side. On sever side Steve wants server to have insight into how much IO we could push in future so that it can limit number of outstanding requests if I understand him right. I'm not sure we really want / are able to provide this amount of knowledge to filesystems even less userspace... > But isn't the notion of "bdi" internal to kernel and user does > not really program thing in terms of bdi. Well, it is. But we already have per-bdi tunables (e.g. readahead) that are exported in /sys/block/<device>/queue/ so we have some precedens. > Also per bdi limit mechanism will not solve the issue of global throttling > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > are not total but per bdi. Well, btrfs plays tricks with bdi's but there is a special bdi called "btrfs" which backs the whole filesystem and that is what's put in sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a global bdi to work with. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-10 22:24 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue 10-04-12 17:20:41, Vivek Goyal wrote: > On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote: > > [..] > > > Ok. So what is the meaning of "make process wait" here? What it will be > > > dependent on? I am thinking of a case where a process has 100MB of dirty > > > data, has 10MB/s write limit and it issues fsync. So before that process > > > is able to open a transaction, one needs to wait atleast 10seconds > > > (assuming other processes are not doing IO in same cgroup). > > The original idea was that we'd have "bdi-congested-for-cgroup" flag > > and the process starting a transaction will wait for this flag to get > > cleared before starting a new transaction. This will be easy to implement > > in filesystems and won't have serialization issues. But my knowledge of > > blk-throttle is lacking so there might be some problems with this approach. > > I have implemented and posted patches for per bdi per cgroup congestion > flag. The only problem I see with that is that a group might be congested > for a long time because of lots of other IO happening (say direct IO) and > if you keep on backing off and never submit the metadata IO (transaction), > you get starved. And if you go ahead and submit IO in a congested group, > we are back to serialization issue. Clearly, we mustn't throttle metadata IO once it gets to the block layer. That's why we discuss throttling of processes at transaction start after all. But I agree starvation is an issue - I originally thought blk-throttle throttles synchronously which wouldn't have starvation issues. But when that's not the case things are a bit more tricky. We could treat transaction start as an IO of some size (since we already have some estimation how large a transaction will be when we are starting it) and let the transaction start only when our "virtual" IO would be submitted but I feel that gets maybe too complicated... Maybe we could just delay the transaction start by the amount reported from blk-throttle layer? Something along your callback for throttling you implemented? > [..] > > > One more factor makes absolute throttling interesting and that is global > > > throttling and not per device throttling. For example in case of btrfs, > > > there is no single stacked device on which to put total throttling > > > limits. > > Yes. My intended interface for the throttling is bdi. But you are right > > it does not exactly match the fact that the throttling happens per device > > so it might get tricky. Which brings up a question - shouldn't the > > throttling blk-throttle does rather happen at bdi layer? Because the > > uses of the functionality I have in mind would match that better. > > I guess throttling at bdi layer will take care of network filesystem > case too? Yes. At least for client side. On sever side Steve wants server to have insight into how much IO we could push in future so that it can limit number of outstanding requests if I understand him right. I'm not sure we really want / are able to provide this amount of knowledge to filesystems even less userspace... > But isn't the notion of "bdi" internal to kernel and user does > not really program thing in terms of bdi. Well, it is. But we already have per-bdi tunables (e.g. readahead) that are exported in /sys/block/<device>/queue/ so we have some precedens. > Also per bdi limit mechanism will not solve the issue of global throttling > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > are not total but per bdi. Well, btrfs plays tricks with bdi's but there is a special bdi called "btrfs" which backs the whole filesystem and that is what's put in sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a global bdi to work with. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120410222425.GF4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-10 22:24 ` Jan Kara (?) @ 2012-04-11 15:40 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: [..] > > I have implemented and posted patches for per bdi per cgroup congestion > > flag. The only problem I see with that is that a group might be congested > > for a long time because of lots of other IO happening (say direct IO) and > > if you keep on backing off and never submit the metadata IO (transaction), > > you get starved. And if you go ahead and submit IO in a congested group, > > we are back to serialization issue. > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > That's why we discuss throttling of processes at transaction start after > all. But I agree starvation is an issue - I originally thought blk-throttle > throttles synchronously which wouldn't have starvation issues. But when > that's not the case things are a bit more tricky. We could treat > transaction start as an IO of some size (since we already have some > estimation how large a transaction will be when we are starting it) and let > the transaction start only when our "virtual" IO would be submitted but > I feel that gets maybe too complicated... Maybe we could just delay the > transaction start by the amount reported from blk-throttle layer? Something > along your callback for throttling you implemented? I think now I have lost you. It probably stems from the fact that I don't know much about transactions and filesystem. So all the metadata IO will happen thorough journaling thread and that will be in root group which should remain unthrottled. So any journal IO going to disk should remain unthrottled. Now, IIRC, fsync problem with throttling was that we had opened a transaction but could not write it back to disk because we had to wait for all the cached data to go to disk (which is throttled). So my question is, can't we first wait for all the data to be flushed to disk and then open a transaction for metadata. metadata will be unthrottled so filesystem will not have to do any tricks like bdi is congested or not. IOW, can't we first wait for dependent operation to finish before we throw anything into metada stream. [..] > > I guess throttling at bdi layer will take care of network filesystem > > case too? > Yes. At least for client side. On sever side Steve wants server to have > insight into how much IO we could push in future so that it can limit > number of outstanding requests if I understand him right. I'm not sure we > really want / are able to provide this amount of knowledge to filesystems > even less userspace... I am not sure what does it mean but server could simply query the bdi and read configured rate and then it knows at what rate IO will go to disk and make predictions about future? > > > But isn't the notion of "bdi" internal to kernel and user does > > not really program thing in terms of bdi. > Well, it is. But we already have per-bdi tunables (e.g. readahead) that > are exported in /sys/block/<device>/queue/ so we have some precedens. ok, so they are exposed as if they are queue/device tunables but internally stored in bdi and work accordingly. > > > Also per bdi limit mechanism will not solve the issue of global throttling > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > > are not total but per bdi. > Well, btrfs plays tricks with bdi's but there is a special bdi called > "btrfs" which backs the whole filesystem and that is what's put in > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > global bdi to work with. Ok, that's good to know. How would we configure this special bdi? I am assuming there is no backing device visible in /sys/block/<device>/queue/? Same is true for network file systems. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 15:40 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: [..] > > I have implemented and posted patches for per bdi per cgroup congestion > > flag. The only problem I see with that is that a group might be congested > > for a long time because of lots of other IO happening (say direct IO) and > > if you keep on backing off and never submit the metadata IO (transaction), > > you get starved. And if you go ahead and submit IO in a congested group, > > we are back to serialization issue. > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > That's why we discuss throttling of processes at transaction start after > all. But I agree starvation is an issue - I originally thought blk-throttle > throttles synchronously which wouldn't have starvation issues. But when > that's not the case things are a bit more tricky. We could treat > transaction start as an IO of some size (since we already have some > estimation how large a transaction will be when we are starting it) and let > the transaction start only when our "virtual" IO would be submitted but > I feel that gets maybe too complicated... Maybe we could just delay the > transaction start by the amount reported from blk-throttle layer? Something > along your callback for throttling you implemented? I think now I have lost you. It probably stems from the fact that I don't know much about transactions and filesystem. So all the metadata IO will happen thorough journaling thread and that will be in root group which should remain unthrottled. So any journal IO going to disk should remain unthrottled. Now, IIRC, fsync problem with throttling was that we had opened a transaction but could not write it back to disk because we had to wait for all the cached data to go to disk (which is throttled). So my question is, can't we first wait for all the data to be flushed to disk and then open a transaction for metadata. metadata will be unthrottled so filesystem will not have to do any tricks like bdi is congested or not. IOW, can't we first wait for dependent operation to finish before we throw anything into metada stream. [..] > > I guess throttling at bdi layer will take care of network filesystem > > case too? > Yes. At least for client side. On sever side Steve wants server to have > insight into how much IO we could push in future so that it can limit > number of outstanding requests if I understand him right. I'm not sure we > really want / are able to provide this amount of knowledge to filesystems > even less userspace... I am not sure what does it mean but server could simply query the bdi and read configured rate and then it knows at what rate IO will go to disk and make predictions about future? > > > But isn't the notion of "bdi" internal to kernel and user does > > not really program thing in terms of bdi. > Well, it is. But we already have per-bdi tunables (e.g. readahead) that > are exported in /sys/block/<device>/queue/ so we have some precedens. ok, so they are exposed as if they are queue/device tunables but internally stored in bdi and work accordingly. > > > Also per bdi limit mechanism will not solve the issue of global throttling > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > > are not total but per bdi. > Well, btrfs plays tricks with bdi's but there is a special bdi called > "btrfs" which backs the whole filesystem and that is what's put in > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > global bdi to work with. Ok, that's good to know. How would we configure this special bdi? I am assuming there is no backing device visible in /sys/block/<device>/queue/? Same is true for network file systems. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 15:40 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: [..] > > I have implemented and posted patches for per bdi per cgroup congestion > > flag. The only problem I see with that is that a group might be congested > > for a long time because of lots of other IO happening (say direct IO) and > > if you keep on backing off and never submit the metadata IO (transaction), > > you get starved. And if you go ahead and submit IO in a congested group, > > we are back to serialization issue. > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > That's why we discuss throttling of processes at transaction start after > all. But I agree starvation is an issue - I originally thought blk-throttle > throttles synchronously which wouldn't have starvation issues. But when > that's not the case things are a bit more tricky. We could treat > transaction start as an IO of some size (since we already have some > estimation how large a transaction will be when we are starting it) and let > the transaction start only when our "virtual" IO would be submitted but > I feel that gets maybe too complicated... Maybe we could just delay the > transaction start by the amount reported from blk-throttle layer? Something > along your callback for throttling you implemented? I think now I have lost you. It probably stems from the fact that I don't know much about transactions and filesystem. So all the metadata IO will happen thorough journaling thread and that will be in root group which should remain unthrottled. So any journal IO going to disk should remain unthrottled. Now, IIRC, fsync problem with throttling was that we had opened a transaction but could not write it back to disk because we had to wait for all the cached data to go to disk (which is throttled). So my question is, can't we first wait for all the data to be flushed to disk and then open a transaction for metadata. metadata will be unthrottled so filesystem will not have to do any tricks like bdi is congested or not. IOW, can't we first wait for dependent operation to finish before we throw anything into metada stream. [..] > > I guess throttling at bdi layer will take care of network filesystem > > case too? > Yes. At least for client side. On sever side Steve wants server to have > insight into how much IO we could push in future so that it can limit > number of outstanding requests if I understand him right. I'm not sure we > really want / are able to provide this amount of knowledge to filesystems > even less userspace... I am not sure what does it mean but server could simply query the bdi and read configured rate and then it knows at what rate IO will go to disk and make predictions about future? > > > But isn't the notion of "bdi" internal to kernel and user does > > not really program thing in terms of bdi. > Well, it is. But we already have per-bdi tunables (e.g. readahead) that > are exported in /sys/block/<device>/queue/ so we have some precedens. ok, so they are exposed as if they are queue/device tunables but internally stored in bdi and work accordingly. > > > Also per bdi limit mechanism will not solve the issue of global throttling > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > > are not total but per bdi. > Well, btrfs plays tricks with bdi's but there is a special bdi called > "btrfs" which backs the whole filesystem and that is what's put in > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > global bdi to work with. Ok, that's good to know. How would we configure this special bdi? I am assuming there is no backing device visible in /sys/block/<device>/queue/? Same is true for network file systems. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-11 15:40 ` Vivek Goyal @ 2012-04-11 15:45 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > [..] > > > I have implemented and posted patches for per bdi per cgroup congestion > > > flag. The only problem I see with that is that a group might be congested > > > for a long time because of lots of other IO happening (say direct IO) and > > > if you keep on backing off and never submit the metadata IO (transaction), > > > you get starved. And if you go ahead and submit IO in a congested group, > > > we are back to serialization issue. > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > That's why we discuss throttling of processes at transaction start after > > all. But I agree starvation is an issue - I originally thought blk-throttle > > throttles synchronously which wouldn't have starvation issues. Current bio throttling is asynchrounous. Process can submit the bio and go back and wait for bio to finish. That bio will be queued at device queue in a per cgroup queue and will be dispatched to device according to configured IO rate for cgroup. The additional feature for buffered throttle (which never went upstream), was synchronous in nature. That is we were actively putting writer to sleep on a per cgroup wait queue in the request queue and wake it up when it can do further IO based on cgroup limits. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 15:45 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > [..] > > > I have implemented and posted patches for per bdi per cgroup congestion > > > flag. The only problem I see with that is that a group might be congested > > > for a long time because of lots of other IO happening (say direct IO) and > > > if you keep on backing off and never submit the metadata IO (transaction), > > > you get starved. And if you go ahead and submit IO in a congested group, > > > we are back to serialization issue. > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > That's why we discuss throttling of processes at transaction start after > > all. But I agree starvation is an issue - I originally thought blk-throttle > > throttles synchronously which wouldn't have starvation issues. Current bio throttling is asynchrounous. Process can submit the bio and go back and wait for bio to finish. That bio will be queued at device queue in a per cgroup queue and will be dispatched to device according to configured IO rate for cgroup. The additional feature for buffered throttle (which never went upstream), was synchronous in nature. That is we were actively putting writer to sleep on a per cgroup wait queue in the request queue and wake it up when it can do further IO based on cgroup limits. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-11 17:05 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > [..] > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > flag. The only problem I see with that is that a group might be congested > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > we are back to serialization issue. > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > That's why we discuss throttling of processes at transaction start after > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > throttles synchronously which wouldn't have starvation issues. > > Current bio throttling is asynchrounous. Process can submit the bio > and go back and wait for bio to finish. That bio will be queued at device > queue in a per cgroup queue and will be dispatched to device according > to configured IO rate for cgroup. > > The additional feature for buffered throttle (which never went upstream), > was synchronous in nature. That is we were actively putting writer to > sleep on a per cgroup wait queue in the request queue and wake it up when > it can do further IO based on cgroup limits. Hmm, but then there would be similar starvation issues as with my simple scheme because async IO could always use the whole available bandwidth. Mixing of sync & async throttling is really problematic... I'm wondering how useful the async throttling is. Because we will block on request allocation once there are more than nr_requests pending requests so at that point throttling becomes sync anyway. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-11 15:45 ` Vivek Goyal @ 2012-04-11 17:05 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > [..] > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > flag. The only problem I see with that is that a group might be congested > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > we are back to serialization issue. > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > That's why we discuss throttling of processes at transaction start after > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > throttles synchronously which wouldn't have starvation issues. > > Current bio throttling is asynchrounous. Process can submit the bio > and go back and wait for bio to finish. That bio will be queued at device > queue in a per cgroup queue and will be dispatched to device according > to configured IO rate for cgroup. > > The additional feature for buffered throttle (which never went upstream), > was synchronous in nature. That is we were actively putting writer to > sleep on a per cgroup wait queue in the request queue and wake it up when > it can do further IO based on cgroup limits. Hmm, but then there would be similar starvation issues as with my simple scheme because async IO could always use the whole available bandwidth. Mixing of sync & async throttling is really problematic... I'm wondering how useful the async throttling is. Because we will block on request allocation once there are more than nr_requests pending requests so at that point throttling becomes sync anyway. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 17:05 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > [..] > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > flag. The only problem I see with that is that a group might be congested > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > we are back to serialization issue. > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > That's why we discuss throttling of processes at transaction start after > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > throttles synchronously which wouldn't have starvation issues. > > Current bio throttling is asynchrounous. Process can submit the bio > and go back and wait for bio to finish. That bio will be queued at device > queue in a per cgroup queue and will be dispatched to device according > to configured IO rate for cgroup. > > The additional feature for buffered throttle (which never went upstream), > was synchronous in nature. That is we were actively putting writer to > sleep on a per cgroup wait queue in the request queue and wake it up when > it can do further IO based on cgroup limits. Hmm, but then there would be similar starvation issues as with my simple scheme because async IO could always use the whole available bandwidth. Mixing of sync & async throttling is really problematic... I'm wondering how useful the async throttling is. Because we will block on request allocation once there are more than nr_requests pending requests so at that point throttling becomes sync anyway. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-11 17:05 ` Jan Kara @ 2012-04-11 17:23 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > > > [..] > > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > > flag. The only problem I see with that is that a group might be congested > > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > > we are back to serialization issue. > > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > > That's why we discuss throttling of processes at transaction start after > > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > > throttles synchronously which wouldn't have starvation issues. > > > > Current bio throttling is asynchrounous. Process can submit the bio > > and go back and wait for bio to finish. That bio will be queued at device > > queue in a per cgroup queue and will be dispatched to device according > > to configured IO rate for cgroup. > > > > The additional feature for buffered throttle (which never went upstream), > > was synchronous in nature. That is we were actively putting writer to > > sleep on a per cgroup wait queue in the request queue and wake it up when > > it can do further IO based on cgroup limits. > Hmm, but then there would be similar starvation issues as with my simple > scheme because async IO could always use the whole available bandwidth. It depends on how the throttling logic decides to divide bandwidth between sync and async. I had chosen a round robin policy of dispatching some bios and then allowing some async IO etc. So async IO was not consuming the whole available bandwidth. We could easibly tilt it in favor of sync IO with a tunable knob. > Mixing of sync & async throttling is really problematic... I'm wondering > how useful the async throttling is. If sync throttling is useful, then async throttling has to be useful too? Especially given the fact that often async IO consumes all bandwidth impacting sync latencies. > Because we will block on request > allocation once there are more than nr_requests pending requests so at that > point throttling becomes sync anyway. First of all flushers will block on nr_requests and not actual writers. And secondly we thought of having per group request descriptors so that writes of one group don't impact others. So once the writes of a group are backlogged, then flusher can query the congestion status of group and not submit any more writes to that group. As some writes are already queued in that group, writes will not be starved. Well, in case of deadline, even direct writes go in write queue so theoritically we can hit starvation issue (flush not being able to submit writes without risking blocking) there too. To avoid this starvation, ideally we need per bdi per cgroup flusher. so that flusher can simply block if there are not enough request descriptors in the cgroup. So trying to throttle buffered writes synchronously in balance_dirty_pages(), atleast simlifies the implementation. I like my implementation better over Fengguang's approach of throttling for simple reason that buffered writes and direct writes can be subjected to same throttling limits instead of separate limits for buffered writes. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 17:23 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > > > [..] > > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > > flag. The only problem I see with that is that a group might be congested > > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > > we are back to serialization issue. > > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > > That's why we discuss throttling of processes at transaction start after > > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > > throttles synchronously which wouldn't have starvation issues. > > > > Current bio throttling is asynchrounous. Process can submit the bio > > and go back and wait for bio to finish. That bio will be queued at device > > queue in a per cgroup queue and will be dispatched to device according > > to configured IO rate for cgroup. > > > > The additional feature for buffered throttle (which never went upstream), > > was synchronous in nature. That is we were actively putting writer to > > sleep on a per cgroup wait queue in the request queue and wake it up when > > it can do further IO based on cgroup limits. > Hmm, but then there would be similar starvation issues as with my simple > scheme because async IO could always use the whole available bandwidth. It depends on how the throttling logic decides to divide bandwidth between sync and async. I had chosen a round robin policy of dispatching some bios and then allowing some async IO etc. So async IO was not consuming the whole available bandwidth. We could easibly tilt it in favor of sync IO with a tunable knob. > Mixing of sync & async throttling is really problematic... I'm wondering > how useful the async throttling is. If sync throttling is useful, then async throttling has to be useful too? Especially given the fact that often async IO consumes all bandwidth impacting sync latencies. > Because we will block on request > allocation once there are more than nr_requests pending requests so at that > point throttling becomes sync anyway. First of all flushers will block on nr_requests and not actual writers. And secondly we thought of having per group request descriptors so that writes of one group don't impact others. So once the writes of a group are backlogged, then flusher can query the congestion status of group and not submit any more writes to that group. As some writes are already queued in that group, writes will not be starved. Well, in case of deadline, even direct writes go in write queue so theoritically we can hit starvation issue (flush not being able to submit writes without risking blocking) there too. To avoid this starvation, ideally we need per bdi per cgroup flusher. so that flusher can simply block if there are not enough request descriptors in the cgroup. So trying to throttle buffered writes synchronously in balance_dirty_pages(), atleast simlifies the implementation. I like my implementation better over Fengguang's approach of throttling for simple reason that buffered writes and direct writes can be subjected to same throttling limits instead of separate limits for buffered writes. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120411172311.GF16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-11 17:23 ` Vivek Goyal (?) @ 2012-04-11 19:44 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed 11-04-12 13:23:11, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > > On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > > > > > [..] > > > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > > > flag. The only problem I see with that is that a group might be congested > > > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > > > we are back to serialization issue. > > > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > > > That's why we discuss throttling of processes at transaction start after > > > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > > > throttles synchronously which wouldn't have starvation issues. > > > > > > Current bio throttling is asynchrounous. Process can submit the bio > > > and go back and wait for bio to finish. That bio will be queued at device > > > queue in a per cgroup queue and will be dispatched to device according > > > to configured IO rate for cgroup. > > > > > > The additional feature for buffered throttle (which never went upstream), > > > was synchronous in nature. That is we were actively putting writer to > > > sleep on a per cgroup wait queue in the request queue and wake it up when > > > it can do further IO based on cgroup limits. > > Hmm, but then there would be similar starvation issues as with my simple > > scheme because async IO could always use the whole available bandwidth. > > It depends on how the throttling logic decides to divide bandwidth between > sync and async. I had chosen a round robin policy of dispatching some > bios and then allowing some async IO etc. So async IO was not consuming > the whole available bandwidth. We could easibly tilt it in favor of sync IO > with a tunable knob. Ah, OK. > > Mixing of sync & async throttling is really problematic... I'm wondering > > how useful the async throttling is. > > If sync throttling is useful, then async throttling has to be useful too? > Especially given the fact that often async IO consumes all bandwidth > impacting sync latencies. I wasn't clear enough I guess. I meant to ask if async throttling brings some serious advantage over the sync one. And I think your answer is that we want to have at least some IO prepared to be submitted to maintain reasonable device utilization. > > Because we will block on request > > allocation once there are more than nr_requests pending requests so at that > > point throttling becomes sync anyway. > > First of all flushers will block on nr_requests and not actual writers. Well, but as soon as you are going to do real IO (not just use the cache), you can block - i.e. direct IO writers, or fsync, or readers can block. > And secondly we thought of having per group request descriptors so that > writes of one group don't impact others. So once the writes of a group > are backlogged, then flusher can query the congestion status of group > and not submit any more writes to that group. As some writes are already > queued in that group, writes will not be starved. Well, in case of > deadline, even direct writes go in write queue so theoritically we can > hit starvation issue (flush not being able to submit writes without > risking blocking) there too. > > To avoid this starvation, ideally we need per bdi per cgroup flusher. so > that flusher can simply block if there are not enough request descriptors > in the cgroup. Yeah, on one hand this would simplify some things, but on the other hand you would possibly create performance issue with interleaving IO from different flusher threads (although that shouldn't be a big problem because they would work on disjoint sets of inodes and should submit large enough chunks) and also fs-wide operations such as sync(2) would need some thinking. Actually handling of sync(2) is interesting on it's own because if it should obey throttling limits for each cgroup whose inode is written, it may take *really* long time to complete it... > So trying to throttle buffered writes synchronously in balance_dirty_pages(), > atleast simlifies the implementation. I like my implementation better > over Fengguang's approach of throttling for simple reason that buffered > writes and direct writes can be subjected to same throttling limits > instead of separate limits for buffered writes. I guess we all agree (including Fengguang) that this is desirable. -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 19:44 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 11-04-12 13:23:11, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > > On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > > > > > [..] > > > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > > > flag. The only problem I see with that is that a group might be congested > > > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > > > we are back to serialization issue. > > > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > > > That's why we discuss throttling of processes at transaction start after > > > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > > > throttles synchronously which wouldn't have starvation issues. > > > > > > Current bio throttling is asynchrounous. Process can submit the bio > > > and go back and wait for bio to finish. That bio will be queued at device > > > queue in a per cgroup queue and will be dispatched to device according > > > to configured IO rate for cgroup. > > > > > > The additional feature for buffered throttle (which never went upstream), > > > was synchronous in nature. That is we were actively putting writer to > > > sleep on a per cgroup wait queue in the request queue and wake it up when > > > it can do further IO based on cgroup limits. > > Hmm, but then there would be similar starvation issues as with my simple > > scheme because async IO could always use the whole available bandwidth. > > It depends on how the throttling logic decides to divide bandwidth between > sync and async. I had chosen a round robin policy of dispatching some > bios and then allowing some async IO etc. So async IO was not consuming > the whole available bandwidth. We could easibly tilt it in favor of sync IO > with a tunable knob. Ah, OK. > > Mixing of sync & async throttling is really problematic... I'm wondering > > how useful the async throttling is. > > If sync throttling is useful, then async throttling has to be useful too? > Especially given the fact that often async IO consumes all bandwidth > impacting sync latencies. I wasn't clear enough I guess. I meant to ask if async throttling brings some serious advantage over the sync one. And I think your answer is that we want to have at least some IO prepared to be submitted to maintain reasonable device utilization. > > Because we will block on request > > allocation once there are more than nr_requests pending requests so at that > > point throttling becomes sync anyway. > > First of all flushers will block on nr_requests and not actual writers. Well, but as soon as you are going to do real IO (not just use the cache), you can block - i.e. direct IO writers, or fsync, or readers can block. > And secondly we thought of having per group request descriptors so that > writes of one group don't impact others. So once the writes of a group > are backlogged, then flusher can query the congestion status of group > and not submit any more writes to that group. As some writes are already > queued in that group, writes will not be starved. Well, in case of > deadline, even direct writes go in write queue so theoritically we can > hit starvation issue (flush not being able to submit writes without > risking blocking) there too. > > To avoid this starvation, ideally we need per bdi per cgroup flusher. so > that flusher can simply block if there are not enough request descriptors > in the cgroup. Yeah, on one hand this would simplify some things, but on the other hand you would possibly create performance issue with interleaving IO from different flusher threads (although that shouldn't be a big problem because they would work on disjoint sets of inodes and should submit large enough chunks) and also fs-wide operations such as sync(2) would need some thinking. Actually handling of sync(2) is interesting on it's own because if it should obey throttling limits for each cgroup whose inode is written, it may take *really* long time to complete it... > So trying to throttle buffered writes synchronously in balance_dirty_pages(), > atleast simlifies the implementation. I like my implementation better > over Fengguang's approach of throttling for simple reason that buffered > writes and direct writes can be subjected to same throttling limits > instead of separate limits for buffered writes. I guess we all agree (including Fengguang) that this is desirable. -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 19:44 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 11-04-12 13:23:11, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > > On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > > > > > [..] > > > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > > > flag. The only problem I see with that is that a group might be congested > > > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > > > we are back to serialization issue. > > > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > > > That's why we discuss throttling of processes at transaction start after > > > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > > > throttles synchronously which wouldn't have starvation issues. > > > > > > Current bio throttling is asynchrounous. Process can submit the bio > > > and go back and wait for bio to finish. That bio will be queued at device > > > queue in a per cgroup queue and will be dispatched to device according > > > to configured IO rate for cgroup. > > > > > > The additional feature for buffered throttle (which never went upstream), > > > was synchronous in nature. That is we were actively putting writer to > > > sleep on a per cgroup wait queue in the request queue and wake it up when > > > it can do further IO based on cgroup limits. > > Hmm, but then there would be similar starvation issues as with my simple > > scheme because async IO could always use the whole available bandwidth. > > It depends on how the throttling logic decides to divide bandwidth between > sync and async. I had chosen a round robin policy of dispatching some > bios and then allowing some async IO etc. So async IO was not consuming > the whole available bandwidth. We could easibly tilt it in favor of sync IO > with a tunable knob. Ah, OK. > > Mixing of sync & async throttling is really problematic... I'm wondering > > how useful the async throttling is. > > If sync throttling is useful, then async throttling has to be useful too? > Especially given the fact that often async IO consumes all bandwidth > impacting sync latencies. I wasn't clear enough I guess. I meant to ask if async throttling brings some serious advantage over the sync one. And I think your answer is that we want to have at least some IO prepared to be submitted to maintain reasonable device utilization. > > Because we will block on request > > allocation once there are more than nr_requests pending requests so at that > > point throttling becomes sync anyway. > > First of all flushers will block on nr_requests and not actual writers. Well, but as soon as you are going to do real IO (not just use the cache), you can block - i.e. direct IO writers, or fsync, or readers can block. > And secondly we thought of having per group request descriptors so that > writes of one group don't impact others. So once the writes of a group > are backlogged, then flusher can query the congestion status of group > and not submit any more writes to that group. As some writes are already > queued in that group, writes will not be starved. Well, in case of > deadline, even direct writes go in write queue so theoritically we can > hit starvation issue (flush not being able to submit writes without > risking blocking) there too. > > To avoid this starvation, ideally we need per bdi per cgroup flusher. so > that flusher can simply block if there are not enough request descriptors > in the cgroup. Yeah, on one hand this would simplify some things, but on the other hand you would possibly create performance issue with interleaving IO from different flusher threads (although that shouldn't be a big problem because they would work on disjoint sets of inodes and should submit large enough chunks) and also fs-wide operations such as sync(2) would need some thinking. Actually handling of sync(2) is interesting on it's own because if it should obey throttling limits for each cgroup whose inode is written, it may take *really* long time to complete it... > So trying to throttle buffered writes synchronously in balance_dirty_pages(), > atleast simlifies the implementation. I like my implementation better > over Fengguang's approach of throttling for simple reason that buffered > writes and direct writes can be subjected to same throttling limits > instead of separate limits for buffered writes. I guess we all agree (including Fengguang) that this is desirable. -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-11 17:23 ` Vivek Goyal 2012-04-17 21:48 ` Tejun Heo 1 sibling, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > On Wed 11-04-12 11:45:31, Vivek Goyal wrote: > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > > > > > [..] > > > > > I have implemented and posted patches for per bdi per cgroup congestion > > > > > flag. The only problem I see with that is that a group might be congested > > > > > for a long time because of lots of other IO happening (say direct IO) and > > > > > if you keep on backing off and never submit the metadata IO (transaction), > > > > > you get starved. And if you go ahead and submit IO in a congested group, > > > > > we are back to serialization issue. > > > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > > > That's why we discuss throttling of processes at transaction start after > > > > all. But I agree starvation is an issue - I originally thought blk-throttle > > > > throttles synchronously which wouldn't have starvation issues. > > > > Current bio throttling is asynchrounous. Process can submit the bio > > and go back and wait for bio to finish. That bio will be queued at device > > queue in a per cgroup queue and will be dispatched to device according > > to configured IO rate for cgroup. > > > > The additional feature for buffered throttle (which never went upstream), > > was synchronous in nature. That is we were actively putting writer to > > sleep on a per cgroup wait queue in the request queue and wake it up when > > it can do further IO based on cgroup limits. > Hmm, but then there would be similar starvation issues as with my simple > scheme because async IO could always use the whole available bandwidth. It depends on how the throttling logic decides to divide bandwidth between sync and async. I had chosen a round robin policy of dispatching some bios and then allowing some async IO etc. So async IO was not consuming the whole available bandwidth. We could easibly tilt it in favor of sync IO with a tunable knob. > Mixing of sync & async throttling is really problematic... I'm wondering > how useful the async throttling is. If sync throttling is useful, then async throttling has to be useful too? Especially given the fact that often async IO consumes all bandwidth impacting sync latencies. > Because we will block on request > allocation once there are more than nr_requests pending requests so at that > point throttling becomes sync anyway. First of all flushers will block on nr_requests and not actual writers. And secondly we thought of having per group request descriptors so that writes of one group don't impact others. So once the writes of a group are backlogged, then flusher can query the congestion status of group and not submit any more writes to that group. As some writes are already queued in that group, writes will not be starved. Well, in case of deadline, even direct writes go in write queue so theoritically we can hit starvation issue (flush not being able to submit writes without risking blocking) there too. To avoid this starvation, ideally we need per bdi per cgroup flusher. so that flusher can simply block if there are not enough request descriptors in the cgroup. So trying to throttle buffered writes synchronously in balance_dirty_pages(), atleast simlifies the implementation. I like my implementation better over Fengguang's approach of throttling for simple reason that buffered writes and direct writes can be subjected to same throttling limits instead of separate limits for buffered writes. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-11 17:23 ` Vivek Goyal @ 2012-04-17 21:48 ` Tejun Heo 1 sibling, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal Hello, On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > > The additional feature for buffered throttle (which never went upstream), > > was synchronous in nature. That is we were actively putting writer to > > sleep on a per cgroup wait queue in the request queue and wake it up when > > it can do further IO based on cgroup limits. > > Hmm, but then there would be similar starvation issues as with my simple > scheme because async IO could always use the whole available bandwidth. > Mixing of sync & async throttling is really problematic... I'm wondering > how useful the async throttling is. Because we will block on request > allocation once there are more than nr_requests pending requests so at that > point throttling becomes sync anyway. I haven't thought about the interface too much yet but, with the synchronous wait at transaction start, we have information both ways - ie. lower layer also knows that there are synchrnous waiters. At the simplest, not allowing any more async IOs when sync writers exist should solve the starvation issue. As for priority inversion through shared request pool, it is a problem which needs to be solved regardless of how async IOs are throttled. I'm not determined to which extent yet tho. Different cgroups definitely need to be on separate pools but do we also want distinguish sync and async and what about ioprio? Maybe we need a bybrid approach with larger common pool and reserved ones for each class? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-11 17:23 ` Vivek Goyal @ 2012-04-17 21:48 ` Tejun Heo 1 sibling, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > > The additional feature for buffered throttle (which never went upstream), > > was synchronous in nature. That is we were actively putting writer to > > sleep on a per cgroup wait queue in the request queue and wake it up when > > it can do further IO based on cgroup limits. > > Hmm, but then there would be similar starvation issues as with my simple > scheme because async IO could always use the whole available bandwidth. > Mixing of sync & async throttling is really problematic... I'm wondering > how useful the async throttling is. Because we will block on request > allocation once there are more than nr_requests pending requests so at that > point throttling becomes sync anyway. I haven't thought about the interface too much yet but, with the synchronous wait at transaction start, we have information both ways - ie. lower layer also knows that there are synchrnous waiters. At the simplest, not allowing any more async IOs when sync writers exist should solve the starvation issue. As for priority inversion through shared request pool, it is a problem which needs to be solved regardless of how async IOs are throttled. I'm not determined to which extent yet tho. Different cgroups definitely need to be on separate pools but do we also want distinguish sync and async and what about ioprio? Maybe we need a bybrid approach with larger common pool and reserved ones for each class? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-17 21:48 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > > The additional feature for buffered throttle (which never went upstream), > > was synchronous in nature. That is we were actively putting writer to > > sleep on a per cgroup wait queue in the request queue and wake it up when > > it can do further IO based on cgroup limits. > > Hmm, but then there would be similar starvation issues as with my simple > scheme because async IO could always use the whole available bandwidth. > Mixing of sync & async throttling is really problematic... I'm wondering > how useful the async throttling is. Because we will block on request > allocation once there are more than nr_requests pending requests so at that > point throttling becomes sync anyway. I haven't thought about the interface too much yet but, with the synchronous wait at transaction start, we have information both ways - ie. lower layer also knows that there are synchrnous waiters. At the simplest, not allowing any more async IOs when sync writers exist should solve the starvation issue. As for priority inversion through shared request pool, it is a problem which needs to be solved regardless of how async IOs are throttled. I'm not determined to which extent yet tho. Different cgroups definitely need to be on separate pools but do we also want distinguish sync and async and what about ioprio? Maybe we need a bybrid approach with larger common pool and reserved ones for each class? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-17 21:48 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote: > > The additional feature for buffered throttle (which never went upstream), > > was synchronous in nature. That is we were actively putting writer to > > sleep on a per cgroup wait queue in the request queue and wake it up when > > it can do further IO based on cgroup limits. > > Hmm, but then there would be similar starvation issues as with my simple > scheme because async IO could always use the whole available bandwidth. > Mixing of sync & async throttling is really problematic... I'm wondering > how useful the async throttling is. Because we will block on request > allocation once there are more than nr_requests pending requests so at that > point throttling becomes sync anyway. I haven't thought about the interface too much yet but, with the synchronous wait at transaction start, we have information both ways - ie. lower layer also knows that there are synchrnous waiters. At the simplest, not allowing any more async IOs when sync writers exist should solve the starvation issue. As for priority inversion through shared request pool, it is a problem which needs to be solved regardless of how async IOs are throttled. I'm not determined to which extent yet tho. Different cgroups definitely need to be on separate pools but do we also want distinguish sync and async and what about ioprio? Maybe we need a bybrid approach with larger common pool and reserved ones for each class? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-17 21:48 ` Tejun Heo @ 2012-04-18 18:18 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote: [..] > As for priority inversion through shared request pool, it is a problem > which needs to be solved regardless of how async IOs are throttled. > I'm not determined to which extent yet tho. Different cgroups > definitely need to be on separate pools but do we also want > distinguish sync and async and what about ioprio? Maybe we need a > bybrid approach with larger common pool and reserved ones for each > class? currently we have global pool with separate limits for sync and async and there is no consideration of ioprio. I think to keep it simple we can just extend the same notion to keep per cgroup pool with internal limits on sync/async requests to make sure sync IO does not get serialized behind async IO. Personally I am not too worried about async IO prio. It has never worked. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-18 18:18 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote: [..] > As for priority inversion through shared request pool, it is a problem > which needs to be solved regardless of how async IOs are throttled. > I'm not determined to which extent yet tho. Different cgroups > definitely need to be on separate pools but do we also want > distinguish sync and async and what about ioprio? Maybe we need a > bybrid approach with larger common pool and reserved ones for each > class? currently we have global pool with separate limits for sync and async and there is no consideration of ioprio. I think to keep it simple we can just extend the same notion to keep per cgroup pool with internal limits on sync/async requests to make sure sync IO does not get serialized behind async IO. Personally I am not too worried about async IO prio. It has never worked. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-04-18 18:18 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote: [..] > As for priority inversion through shared request pool, it is a problem > which needs to be solved regardless of how async IOs are throttled. > I'm not determined to which extent yet tho. Different cgroups > definitely need to be on separate pools but do we also want > distinguish sync and async and what about ioprio? Maybe we need a > bybrid approach with larger common pool and reserved ones for each > class? currently we have global pool with separate limits for sync and async and there is no consideration of ioprio. I think to keep it simple we can just extend the same notion to keep per cgroup pool with internal limits on sync/async requests to make sure sync IO does not get serialized behind async IO. Personally I am not too worried about async IO prio. It has never worked. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-11 15:45 ` Vivek Goyal 2012-04-11 19:22 ` Jan Kara 2012-04-14 12:25 ` [Lsf] " Peter Zijlstra 2 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > [..] > > > I have implemented and posted patches for per bdi per cgroup congestion > > > flag. The only problem I see with that is that a group might be congested > > > for a long time because of lots of other IO happening (say direct IO) and > > > if you keep on backing off and never submit the metadata IO (transaction), > > > you get starved. And if you go ahead and submit IO in a congested group, > > > we are back to serialization issue. > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > That's why we discuss throttling of processes at transaction start after > > all. But I agree starvation is an issue - I originally thought blk-throttle > > throttles synchronously which wouldn't have starvation issues. Current bio throttling is asynchrounous. Process can submit the bio and go back and wait for bio to finish. That bio will be queued at device queue in a per cgroup queue and will be dispatched to device according to configured IO rate for cgroup. The additional feature for buffered throttle (which never went upstream), was synchronous in nature. That is we were actively putting writer to sleep on a per cgroup wait queue in the request queue and wake it up when it can do further IO based on cgroup limits. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-11 15:45 ` Vivek Goyal @ 2012-04-11 19:22 ` Jan Kara 2012-04-14 12:25 ` [Lsf] " Peter Zijlstra 2 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed 11-04-12 11:40:05, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > [..] > > > I have implemented and posted patches for per bdi per cgroup congestion > > > flag. The only problem I see with that is that a group might be congested > > > for a long time because of lots of other IO happening (say direct IO) and > > > if you keep on backing off and never submit the metadata IO (transaction), > > > you get starved. And if you go ahead and submit IO in a congested group, > > > we are back to serialization issue. > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > That's why we discuss throttling of processes at transaction start after > > all. But I agree starvation is an issue - I originally thought blk-throttle > > throttles synchronously which wouldn't have starvation issues. But when > > that's not the case things are a bit more tricky. We could treat > > transaction start as an IO of some size (since we already have some > > estimation how large a transaction will be when we are starting it) and let > > the transaction start only when our "virtual" IO would be submitted but > > I feel that gets maybe too complicated... Maybe we could just delay the > > transaction start by the amount reported from blk-throttle layer? Something > > along your callback for throttling you implemented? > > I think now I have lost you. It probably stems from the fact that I don't > know much about transactions and filesystem. > > So all the metadata IO will happen thorough journaling thread and that > will be in root group which should remain unthrottled. So any journal > IO going to disk should remain unthrottled. Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't have to have the journal thread (as is the case of reiserfs where random writer may end up doing commit) but let's not complicate things unnecessarily. > Now, IIRC, fsync problem with throttling was that we had opened a > transaction but could not write it back to disk because we had to > wait for all the cached data to go to disk (which is throttled). So > my question is, can't we first wait for all the data to be flushed > to disk and then open a transaction for metadata. metadata will be > unthrottled so filesystem will not have to do any tricks like bdi is > congested or not. Actually that's what's happening. We first do filemap_write_and_wait() which syncs all the data and then we go and force transaction commit to make sure all metadata got to stable storage. The problem is that writeout of data may need to allocate new blocks and that starts a transaction and while the transaction is started we may need to do some reads (e.g. of bitmaps etc.) which may be throttled and at that moment the whole filesystem is blocked. I don't remember the stack traces you showed me so I'm not sure it this is what your observed but it's certainly one possible scenario. The reason why fsync triggers problems is simply that it's the only place where process normally does significant amount of writing. In most cases flusher thread / journal thread do it so this effect is not visible. And to precede your question, it would be rather hard to avoid IO while the transaction is started due to locking. > [..] > > > I guess throttling at bdi layer will take care of network filesystem > > > case too? > > Yes. At least for client side. On sever side Steve wants server to have > > insight into how much IO we could push in future so that it can limit > > number of outstanding requests if I understand him right. I'm not sure we > > really want / are able to provide this amount of knowledge to filesystems > > even less userspace... > > I am not sure what does it mean but server could simply query the bdi > and read configured rate and then it knows at what rate IO will go to > disk and make predictions about future? Yeah, that would work if we had the current bandwidth for current cgroup exposed in bdi. > > > Also per bdi limit mechanism will not solve the issue of global throttling > > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > > > are not total but per bdi. > > Well, btrfs plays tricks with bdi's but there is a special bdi called > > "btrfs" which backs the whole filesystem and that is what's put in > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > > global bdi to work with. > > Ok, that's good to know. How would we configure this special bdi? I am > assuming there is no backing device visible in /sys/block/<device>/queue/? > Same is true for network file systems. Where should be the backing device visible? Now it's me who is lost :) Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-11 15:45 ` Vivek Goyal 2012-04-11 19:22 ` Jan Kara @ 2012-04-14 12:25 ` Peter Zijlstra 2 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw) To: Vivek Goyal Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > Ok, that's good to know. How would we configure this special bdi? I am > assuming there is no backing device visible in /sys/block/<device>/queue/? > Same is true for network file systems. root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done ls: cannot access /sys/class/bdi/0:20/: No such file or directory total 0 drwxr-xr-x 3 root root 0 2012-03-27 23:18 . drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio drwxr-xr-x 2 root root 0 2012-04-14 14:22 power -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-11 15:40 ` Vivek Goyal @ 2012-04-11 19:22 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 11-04-12 11:40:05, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > [..] > > > I have implemented and posted patches for per bdi per cgroup congestion > > > flag. The only problem I see with that is that a group might be congested > > > for a long time because of lots of other IO happening (say direct IO) and > > > if you keep on backing off and never submit the metadata IO (transaction), > > > you get starved. And if you go ahead and submit IO in a congested group, > > > we are back to serialization issue. > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > That's why we discuss throttling of processes at transaction start after > > all. But I agree starvation is an issue - I originally thought blk-throttle > > throttles synchronously which wouldn't have starvation issues. But when > > that's not the case things are a bit more tricky. We could treat > > transaction start as an IO of some size (since we already have some > > estimation how large a transaction will be when we are starting it) and let > > the transaction start only when our "virtual" IO would be submitted but > > I feel that gets maybe too complicated... Maybe we could just delay the > > transaction start by the amount reported from blk-throttle layer? Something > > along your callback for throttling you implemented? > > I think now I have lost you. It probably stems from the fact that I don't > know much about transactions and filesystem. > > So all the metadata IO will happen thorough journaling thread and that > will be in root group which should remain unthrottled. So any journal > IO going to disk should remain unthrottled. Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't have to have the journal thread (as is the case of reiserfs where random writer may end up doing commit) but let's not complicate things unnecessarily. > Now, IIRC, fsync problem with throttling was that we had opened a > transaction but could not write it back to disk because we had to > wait for all the cached data to go to disk (which is throttled). So > my question is, can't we first wait for all the data to be flushed > to disk and then open a transaction for metadata. metadata will be > unthrottled so filesystem will not have to do any tricks like bdi is > congested or not. Actually that's what's happening. We first do filemap_write_and_wait() which syncs all the data and then we go and force transaction commit to make sure all metadata got to stable storage. The problem is that writeout of data may need to allocate new blocks and that starts a transaction and while the transaction is started we may need to do some reads (e.g. of bitmaps etc.) which may be throttled and at that moment the whole filesystem is blocked. I don't remember the stack traces you showed me so I'm not sure it this is what your observed but it's certainly one possible scenario. The reason why fsync triggers problems is simply that it's the only place where process normally does significant amount of writing. In most cases flusher thread / journal thread do it so this effect is not visible. And to precede your question, it would be rather hard to avoid IO while the transaction is started due to locking. > [..] > > > I guess throttling at bdi layer will take care of network filesystem > > > case too? > > Yes. At least for client side. On sever side Steve wants server to have > > insight into how much IO we could push in future so that it can limit > > number of outstanding requests if I understand him right. I'm not sure we > > really want / are able to provide this amount of knowledge to filesystems > > even less userspace... > > I am not sure what does it mean but server could simply query the bdi > and read configured rate and then it knows at what rate IO will go to > disk and make predictions about future? Yeah, that would work if we had the current bandwidth for current cgroup exposed in bdi. > > > Also per bdi limit mechanism will not solve the issue of global throttling > > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > > > are not total but per bdi. > > Well, btrfs plays tricks with bdi's but there is a special bdi called > > "btrfs" which backs the whole filesystem and that is what's put in > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > > global bdi to work with. > > Ok, that's good to know. How would we configure this special bdi? I am > assuming there is no backing device visible in /sys/block/<device>/queue/? > Same is true for network file systems. Where should be the backing device visible? Now it's me who is lost :) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-11 19:22 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 11-04-12 11:40:05, Vivek Goyal wrote: > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote: > > [..] > > > I have implemented and posted patches for per bdi per cgroup congestion > > > flag. The only problem I see with that is that a group might be congested > > > for a long time because of lots of other IO happening (say direct IO) and > > > if you keep on backing off and never submit the metadata IO (transaction), > > > you get starved. And if you go ahead and submit IO in a congested group, > > > we are back to serialization issue. > > Clearly, we mustn't throttle metadata IO once it gets to the block layer. > > That's why we discuss throttling of processes at transaction start after > > all. But I agree starvation is an issue - I originally thought blk-throttle > > throttles synchronously which wouldn't have starvation issues. But when > > that's not the case things are a bit more tricky. We could treat > > transaction start as an IO of some size (since we already have some > > estimation how large a transaction will be when we are starting it) and let > > the transaction start only when our "virtual" IO would be submitted but > > I feel that gets maybe too complicated... Maybe we could just delay the > > transaction start by the amount reported from blk-throttle layer? Something > > along your callback for throttling you implemented? > > I think now I have lost you. It probably stems from the fact that I don't > know much about transactions and filesystem. > > So all the metadata IO will happen thorough journaling thread and that > will be in root group which should remain unthrottled. So any journal > IO going to disk should remain unthrottled. Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't have to have the journal thread (as is the case of reiserfs where random writer may end up doing commit) but let's not complicate things unnecessarily. > Now, IIRC, fsync problem with throttling was that we had opened a > transaction but could not write it back to disk because we had to > wait for all the cached data to go to disk (which is throttled). So > my question is, can't we first wait for all the data to be flushed > to disk and then open a transaction for metadata. metadata will be > unthrottled so filesystem will not have to do any tricks like bdi is > congested or not. Actually that's what's happening. We first do filemap_write_and_wait() which syncs all the data and then we go and force transaction commit to make sure all metadata got to stable storage. The problem is that writeout of data may need to allocate new blocks and that starts a transaction and while the transaction is started we may need to do some reads (e.g. of bitmaps etc.) which may be throttled and at that moment the whole filesystem is blocked. I don't remember the stack traces you showed me so I'm not sure it this is what your observed but it's certainly one possible scenario. The reason why fsync triggers problems is simply that it's the only place where process normally does significant amount of writing. In most cases flusher thread / journal thread do it so this effect is not visible. And to precede your question, it would be rather hard to avoid IO while the transaction is started due to locking. > [..] > > > I guess throttling at bdi layer will take care of network filesystem > > > case too? > > Yes. At least for client side. On sever side Steve wants server to have > > insight into how much IO we could push in future so that it can limit > > number of outstanding requests if I understand him right. I'm not sure we > > really want / are able to provide this amount of knowledge to filesystems > > even less userspace... > > I am not sure what does it mean but server could simply query the bdi > and read configured rate and then it knows at what rate IO will go to > disk and make predictions about future? Yeah, that would work if we had the current bandwidth for current cgroup exposed in bdi. > > > Also per bdi limit mechanism will not solve the issue of global throttling > > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits > > > are not total but per bdi. > > Well, btrfs plays tricks with bdi's but there is a special bdi called > > "btrfs" which backs the whole filesystem and that is what's put in > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > > global bdi to work with. > > Ok, that's good to know. How would we configure this special bdi? I am > assuming there is no backing device visible in /sys/block/<device>/queue/? > Same is true for network file systems. Where should be the backing device visible? Now it's me who is lost :) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-11 19:22 ` Jan Kara (?) @ 2012-04-12 20:37 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: [..] > > > Well, btrfs plays tricks with bdi's but there is a special bdi called > > > "btrfs" which backs the whole filesystem and that is what's put in > > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > > > global bdi to work with. > > > > Ok, that's good to know. How would we configure this special bdi? I am > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > Same is true for network file systems. > Where should be the backing device visible? Now it's me who is lost :) I mean how are we supposed to put cgroup throttling rules using cgroup interface for network filesystems and for btrfs global bdi. Using "dev_t" associated with bdi? I see that all the bdi's are showing up in /sys/class/bdi, but how do I know which one I am intereste in or which one belongs to filesystem I am interestd in putting throttling rule on. For block devices, we simply use "major:min limit" format to write to a cgroup file and this configuration will sit in one of the per queue per cgroup data structure. I am assuming that when you say throttling should happen at bdi, you are thinking of maintaining per cgroup per bdi data structures and user is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? If yes, how does one map a filesystem's bdi we want to put rules on? Also, at request queue level we have bios and we throttle bios. At bdi level, I think there are no bios yet. So somehow we got to deal with pages. Not sure how exactly will throttling happen. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-12 20:37 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: [..] > > > Well, btrfs plays tricks with bdi's but there is a special bdi called > > > "btrfs" which backs the whole filesystem and that is what's put in > > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > > > global bdi to work with. > > > > Ok, that's good to know. How would we configure this special bdi? I am > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > Same is true for network file systems. > Where should be the backing device visible? Now it's me who is lost :) I mean how are we supposed to put cgroup throttling rules using cgroup interface for network filesystems and for btrfs global bdi. Using "dev_t" associated with bdi? I see that all the bdi's are showing up in /sys/class/bdi, but how do I know which one I am intereste in or which one belongs to filesystem I am interestd in putting throttling rule on. For block devices, we simply use "major:min limit" format to write to a cgroup file and this configuration will sit in one of the per queue per cgroup data structure. I am assuming that when you say throttling should happen at bdi, you are thinking of maintaining per cgroup per bdi data structures and user is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? If yes, how does one map a filesystem's bdi we want to put rules on? Also, at request queue level we have bios and we throttle bios. At bdi level, I think there are no bios yet. So somehow we got to deal with pages. Not sure how exactly will throttling happen. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-12 20:37 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: [..] > > > Well, btrfs plays tricks with bdi's but there is a special bdi called > > > "btrfs" which backs the whole filesystem and that is what's put in > > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a > > > global bdi to work with. > > > > Ok, that's good to know. How would we configure this special bdi? I am > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > Same is true for network file systems. > Where should be the backing device visible? Now it's me who is lost :) I mean how are we supposed to put cgroup throttling rules using cgroup interface for network filesystems and for btrfs global bdi. Using "dev_t" associated with bdi? I see that all the bdi's are showing up in /sys/class/bdi, but how do I know which one I am intereste in or which one belongs to filesystem I am interestd in putting throttling rule on. For block devices, we simply use "major:min limit" format to write to a cgroup file and this configuration will sit in one of the per queue per cgroup data structure. I am assuming that when you say throttling should happen at bdi, you are thinking of maintaining per cgroup per bdi data structures and user is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? If yes, how does one map a filesystem's bdi we want to put rules on? Also, at request queue level we have bios and we throttle bios. At bdi level, I think there are no bios yet. So somehow we got to deal with pages. Not sure how exactly will throttling happen. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-12 20:37 ` Vivek Goyal @ 2012-04-12 20:51 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Vivek. On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote: > I mean how are we supposed to put cgroup throttling rules using cgroup > interface for network filesystems and for btrfs global bdi. Using "dev_t" > associated with bdi? I see that all the bdi's are showing up in > /sys/class/bdi, but how do I know which one I am intereste in or which > one belongs to filesystem I am interestd in putting throttling rule on. > > For block devices, we simply use "major:min limit" format to write to > a cgroup file and this configuration will sit in one of the per queue > per cgroup data structure. > > I am assuming that when you say throttling should happen at bdi, you > are thinking of maintaining per cgroup per bdi data structures and user > is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? > If yes, how does one map a filesystem's bdi we want to put rules on? I think you're worrying way too much. One of the biggest reasons we have layers and abstractions is to avoid worrying about everything from everywhere. Let block device implement per-device limits. Let writeback work from the backpressure it gets from the relevant IO channel, bdi-cgroup combination in this case. For stacked or combined devices, let the combining layer deal with piping the congestion information. If it's per-file split, the combined bdi can simply forward information from the matching underlying device. If the file is striped / duplicated somehow, the *only* layer which knows what to do is and should be the layer performing the striping and duplication. There's no need to worry about it from blkcg and if you get the layering correct it isn't difficult to slice such logic inbetween. In fact, most of it (backpressure propagation) would just happen as part of the usual buffering between layers. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-12 20:51 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Vivek. On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote: > I mean how are we supposed to put cgroup throttling rules using cgroup > interface for network filesystems and for btrfs global bdi. Using "dev_t" > associated with bdi? I see that all the bdi's are showing up in > /sys/class/bdi, but how do I know which one I am intereste in or which > one belongs to filesystem I am interestd in putting throttling rule on. > > For block devices, we simply use "major:min limit" format to write to > a cgroup file and this configuration will sit in one of the per queue > per cgroup data structure. > > I am assuming that when you say throttling should happen at bdi, you > are thinking of maintaining per cgroup per bdi data structures and user > is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? > If yes, how does one map a filesystem's bdi we want to put rules on? I think you're worrying way too much. One of the biggest reasons we have layers and abstractions is to avoid worrying about everything from everywhere. Let block device implement per-device limits. Let writeback work from the backpressure it gets from the relevant IO channel, bdi-cgroup combination in this case. For stacked or combined devices, let the combining layer deal with piping the congestion information. If it's per-file split, the combined bdi can simply forward information from the matching underlying device. If the file is striped / duplicated somehow, the *only* layer which knows what to do is and should be the layer performing the striping and duplication. There's no need to worry about it from blkcg and if you get the layering correct it isn't difficult to slice such logic inbetween. In fact, most of it (backpressure propagation) would just happen as part of the usual buffering between layers. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120412205148.GA24056-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-12 20:51 ` Tejun Heo @ 2012-04-14 14:36 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-14 14:36 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal [-- Attachment #1: Type: text/plain, Size: 4887 bytes --] On Thu, Apr 12, 2012 at 01:51:48PM -0700, Tejun Heo wrote: > Hello, Vivek. > > On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote: > > I mean how are we supposed to put cgroup throttling rules using cgroup > > interface for network filesystems and for btrfs global bdi. Using "dev_t" > > associated with bdi? I see that all the bdi's are showing up in > > /sys/class/bdi, but how do I know which one I am intereste in or which > > one belongs to filesystem I am interestd in putting throttling rule on. > > > > For block devices, we simply use "major:min limit" format to write to > > a cgroup file and this configuration will sit in one of the per queue > > per cgroup data structure. > > > > I am assuming that when you say throttling should happen at bdi, you > > are thinking of maintaining per cgroup per bdi data structures and user > > is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? > > If yes, how does one map a filesystem's bdi we want to put rules on? > > I think you're worrying way too much. One of the biggest reasons we > have layers and abstractions is to avoid worrying about everything > from everywhere. Let block device implement per-device limits. Let > writeback work from the backpressure it gets from the relevant IO > channel, bdi-cgroup combination in this case. > > For stacked or combined devices, let the combining layer deal with > piping the congestion information. If it's per-file split, the > combined bdi can simply forward information from the matching > underlying device. If the file is striped / duplicated somehow, the > *only* layer which knows what to do is and should be the layer > performing the striping and duplication. There's no need to worry > about it from blkcg and if you get the layering correct it isn't > difficult to slice such logic inbetween. In fact, most of it > (backpressure propagation) would just happen as part of the usual > buffering between layers. Yeah the backpressure idea would work nicely with all possible intermediate stacking between the bdi and leaf devices. In my attempt to do combined IO bandwidth control for - buffered writes, in balance_dirty_pages() - direct IO, in the cfq IO scheduler I have to look into the cfq code in the past days to get an idea how the two throttling layers can cooperate (and suffer from the pains arise from the violations of layers). It's also rather tricky to get two previously independent throttling mechanisms to work seamlessly with each other for providing the desired _unified_ user interface. It took a lot of reasoning and experiments to work the basic scheme out... But here is the first result. The attached graph shows progress of 4 tasks: - cgroup A: 1 direct dd + 1 buffered dd - cgroup B: 1 direct dd + 1 buffered dd The 4 tasks are mostly progressing at the same pace. The top 2 smoother lines are for the buffered dirtiers. The bottom 2 lines are for the direct writers. As you may notice, the two direct writers are somehow stalled for 1-2 times, which increases the gaps between the lines. Otherwise, the algorithm is working as expected to distribute the bandwidth to each task. The current code's target is to satisfy the more realistic user demand of distributing bandwidth equally to each cgroup, and inside each cgroup, distribute bandwidth equally to buffered/direct writes. On top of which, weights can be specified to change the default distribution. The implementation involves adding "weight for direct IO" to the cfq groups and "weight for buffered writes" to the root cgroup. Note that current cfq proportional IO conroller does not offer explicit control over the direct:buffered ratio. When there are both direct/buffered writers in the cgroup, balance_dirty_pages() will kick in and adjust the weights for cfq to execute. Note that cfq will continue to send all flusher IOs to the root cgroup. balance_dirty_pages() will compute the overall async weight for it so that in the above test case, the computed weights will be - 1000 async weight for the root cgroup (2 buffered dds) - 500 dio weight for cgroup A (1 direct dd) - 500 dio weight for cgroup B (1 direct dd) The second graph shows result for another test case: - cgroup A, weight 300: 1 buffered cp - cgroup B, weight 600: 1 buffered dd + 1 direct dd - cgroup C, weight 300: 1 direct dd which is also working as expected. Once the cfq properly grants total async IO share to the flusher, balance_dirty_pages() will then do its original job of distributing the buffered write bandwidth among the buffered dd tasks. It will have to assume that the devices under the same bdi are "symmetry". It also needs further stats feedback on IOPS or disk time in order to do IOPS/time based IO distribution. Anyway it would be interesting to see how far this scheme can go. I'll cleanup the code and post it soon. Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 72619 bytes --] [-- Attachment #3: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 69646 bytes --] [-- Attachment #4: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-14 14:36 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-14 14:36 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf [-- Attachment #1: Type: text/plain, Size: 4887 bytes --] On Thu, Apr 12, 2012 at 01:51:48PM -0700, Tejun Heo wrote: > Hello, Vivek. > > On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote: > > I mean how are we supposed to put cgroup throttling rules using cgroup > > interface for network filesystems and for btrfs global bdi. Using "dev_t" > > associated with bdi? I see that all the bdi's are showing up in > > /sys/class/bdi, but how do I know which one I am intereste in or which > > one belongs to filesystem I am interestd in putting throttling rule on. > > > > For block devices, we simply use "major:min limit" format to write to > > a cgroup file and this configuration will sit in one of the per queue > > per cgroup data structure. > > > > I am assuming that when you say throttling should happen at bdi, you > > are thinking of maintaining per cgroup per bdi data structures and user > > is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? > > If yes, how does one map a filesystem's bdi we want to put rules on? > > I think you're worrying way too much. One of the biggest reasons we > have layers and abstractions is to avoid worrying about everything > from everywhere. Let block device implement per-device limits. Let > writeback work from the backpressure it gets from the relevant IO > channel, bdi-cgroup combination in this case. > > For stacked or combined devices, let the combining layer deal with > piping the congestion information. If it's per-file split, the > combined bdi can simply forward information from the matching > underlying device. If the file is striped / duplicated somehow, the > *only* layer which knows what to do is and should be the layer > performing the striping and duplication. There's no need to worry > about it from blkcg and if you get the layering correct it isn't > difficult to slice such logic inbetween. In fact, most of it > (backpressure propagation) would just happen as part of the usual > buffering between layers. Yeah the backpressure idea would work nicely with all possible intermediate stacking between the bdi and leaf devices. In my attempt to do combined IO bandwidth control for - buffered writes, in balance_dirty_pages() - direct IO, in the cfq IO scheduler I have to look into the cfq code in the past days to get an idea how the two throttling layers can cooperate (and suffer from the pains arise from the violations of layers). It's also rather tricky to get two previously independent throttling mechanisms to work seamlessly with each other for providing the desired _unified_ user interface. It took a lot of reasoning and experiments to work the basic scheme out... But here is the first result. The attached graph shows progress of 4 tasks: - cgroup A: 1 direct dd + 1 buffered dd - cgroup B: 1 direct dd + 1 buffered dd The 4 tasks are mostly progressing at the same pace. The top 2 smoother lines are for the buffered dirtiers. The bottom 2 lines are for the direct writers. As you may notice, the two direct writers are somehow stalled for 1-2 times, which increases the gaps between the lines. Otherwise, the algorithm is working as expected to distribute the bandwidth to each task. The current code's target is to satisfy the more realistic user demand of distributing bandwidth equally to each cgroup, and inside each cgroup, distribute bandwidth equally to buffered/direct writes. On top of which, weights can be specified to change the default distribution. The implementation involves adding "weight for direct IO" to the cfq groups and "weight for buffered writes" to the root cgroup. Note that current cfq proportional IO conroller does not offer explicit control over the direct:buffered ratio. When there are both direct/buffered writers in the cgroup, balance_dirty_pages() will kick in and adjust the weights for cfq to execute. Note that cfq will continue to send all flusher IOs to the root cgroup. balance_dirty_pages() will compute the overall async weight for it so that in the above test case, the computed weights will be - 1000 async weight for the root cgroup (2 buffered dds) - 500 dio weight for cgroup A (1 direct dd) - 500 dio weight for cgroup B (1 direct dd) The second graph shows result for another test case: - cgroup A, weight 300: 1 buffered cp - cgroup B, weight 600: 1 buffered dd + 1 direct dd - cgroup C, weight 300: 1 direct dd which is also working as expected. Once the cfq properly grants total async IO share to the flusher, balance_dirty_pages() will then do its original job of distributing the buffered write bandwidth among the buffered dd tasks. It will have to assume that the devices under the same bdi are "symmetry". It also needs further stats feedback on IOPS or disk time in order to do IOPS/time based IO distribution. Anyway it would be interesting to see how far this scheme can go. I'll cleanup the code and post it soon. Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 72619 bytes --] [-- Attachment #3: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 69646 bytes --] ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-14 14:36 ` Fengguang Wu @ 2012-04-16 14:57 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: [..] > Yeah the backpressure idea would work nicely with all possible > intermediate stacking between the bdi and leaf devices. In my attempt > to do combined IO bandwidth control for > > - buffered writes, in balance_dirty_pages() > - direct IO, in the cfq IO scheduler > > I have to look into the cfq code in the past days to get an idea how > the two throttling layers can cooperate (and suffer from the pains > arise from the violations of layers). It's also rather tricky to get > two previously independent throttling mechanisms to work seamlessly > with each other for providing the desired _unified_ user interface. It > took a lot of reasoning and experiments to work the basic scheme out... > > But here is the first result. The attached graph shows progress of 4 > tasks: > - cgroup A: 1 direct dd + 1 buffered dd > - cgroup B: 1 direct dd + 1 buffered dd > > The 4 tasks are mostly progressing at the same pace. The top 2 > smoother lines are for the buffered dirtiers. The bottom 2 lines are > for the direct writers. As you may notice, the two direct writers are > somehow stalled for 1-2 times, which increases the gaps between the > lines. Otherwise, the algorithm is working as expected to distribute > the bandwidth to each task. > > The current code's target is to satisfy the more realistic user demand > of distributing bandwidth equally to each cgroup, and inside each > cgroup, distribute bandwidth equally to buffered/direct writes. On top > of which, weights can be specified to change the default distribution. > > The implementation involves adding "weight for direct IO" to the cfq > groups and "weight for buffered writes" to the root cgroup. Note that > current cfq proportional IO conroller does not offer explicit control > over the direct:buffered ratio. > > When there are both direct/buffered writers in the cgroup, > balance_dirty_pages() will kick in and adjust the weights for cfq to > execute. Note that cfq will continue to send all flusher IOs to the > root cgroup. balance_dirty_pages() will compute the overall async > weight for it so that in the above test case, the computed weights > will be I think having separate weigths for sync IO groups and async IO is not very appealing. There should be one notion of group weight and bandwidth distrubuted among groups according to their weight. Now one can argue that with-in a group, there might be one knob in CFQ which allows to change the share or sync/async IO. Also Tejun and Jan have expressed the desire that once we have figured out a way to communicate the submitter's context for async IO, we would like to account that IO in associated cgroup instead of root cgroup (as we do today). > > - 1000 async weight for the root cgroup (2 buffered dds) > - 500 dio weight for cgroup A (1 direct dd) > - 500 dio weight for cgroup B (1 direct dd) > > The second graph shows result for another test case: > - cgroup A, weight 300: 1 buffered cp > - cgroup B, weight 600: 1 buffered dd + 1 direct dd > - cgroup C, weight 300: 1 direct dd > which is also working as expected. > > Once the cfq properly grants total async IO share to the flusher, > balance_dirty_pages() will then do its original job of distributing > the buffered write bandwidth among the buffered dd tasks. > > It will have to assume that the devices under the same bdi are > "symmetry". It also needs further stats feedback on IOPS or disk time > in order to do IOPS/time based IO distribution. Anyway it would be > interesting to see how far this scheme can go. I'll cleanup the code > and post it soon. Your proposal relies on few things. - Bandwidth needs to be divided eually among sync and async IO. - Flusher thread async IO will always to go to root cgroup. - I am not sure how this scheme is going to work when we introduce hierarchical blkio cgroups. - cgroup weights for sync IO seems to be being controlled by user and somehow root cgroup weight seems to be controlled by this async IO logic silently. Overall sounds very odd design to me. I am not sure what are we achieving by this. In current scheme one should be able to just adjust the weight of root cgroup using cgroup interface and achieve same results which you are showing. So where is the need of dynamically changing it inside kernel. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-16 14:57 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: [..] > Yeah the backpressure idea would work nicely with all possible > intermediate stacking between the bdi and leaf devices. In my attempt > to do combined IO bandwidth control for > > - buffered writes, in balance_dirty_pages() > - direct IO, in the cfq IO scheduler > > I have to look into the cfq code in the past days to get an idea how > the two throttling layers can cooperate (and suffer from the pains > arise from the violations of layers). It's also rather tricky to get > two previously independent throttling mechanisms to work seamlessly > with each other for providing the desired _unified_ user interface. It > took a lot of reasoning and experiments to work the basic scheme out... > > But here is the first result. The attached graph shows progress of 4 > tasks: > - cgroup A: 1 direct dd + 1 buffered dd > - cgroup B: 1 direct dd + 1 buffered dd > > The 4 tasks are mostly progressing at the same pace. The top 2 > smoother lines are for the buffered dirtiers. The bottom 2 lines are > for the direct writers. As you may notice, the two direct writers are > somehow stalled for 1-2 times, which increases the gaps between the > lines. Otherwise, the algorithm is working as expected to distribute > the bandwidth to each task. > > The current code's target is to satisfy the more realistic user demand > of distributing bandwidth equally to each cgroup, and inside each > cgroup, distribute bandwidth equally to buffered/direct writes. On top > of which, weights can be specified to change the default distribution. > > The implementation involves adding "weight for direct IO" to the cfq > groups and "weight for buffered writes" to the root cgroup. Note that > current cfq proportional IO conroller does not offer explicit control > over the direct:buffered ratio. > > When there are both direct/buffered writers in the cgroup, > balance_dirty_pages() will kick in and adjust the weights for cfq to > execute. Note that cfq will continue to send all flusher IOs to the > root cgroup. balance_dirty_pages() will compute the overall async > weight for it so that in the above test case, the computed weights > will be I think having separate weigths for sync IO groups and async IO is not very appealing. There should be one notion of group weight and bandwidth distrubuted among groups according to their weight. Now one can argue that with-in a group, there might be one knob in CFQ which allows to change the share or sync/async IO. Also Tejun and Jan have expressed the desire that once we have figured out a way to communicate the submitter's context for async IO, we would like to account that IO in associated cgroup instead of root cgroup (as we do today). > > - 1000 async weight for the root cgroup (2 buffered dds) > - 500 dio weight for cgroup A (1 direct dd) > - 500 dio weight for cgroup B (1 direct dd) > > The second graph shows result for another test case: > - cgroup A, weight 300: 1 buffered cp > - cgroup B, weight 600: 1 buffered dd + 1 direct dd > - cgroup C, weight 300: 1 direct dd > which is also working as expected. > > Once the cfq properly grants total async IO share to the flusher, > balance_dirty_pages() will then do its original job of distributing > the buffered write bandwidth among the buffered dd tasks. > > It will have to assume that the devices under the same bdi are > "symmetry". It also needs further stats feedback on IOPS or disk time > in order to do IOPS/time based IO distribution. Anyway it would be > interesting to see how far this scheme can go. I'll cleanup the code > and post it soon. Your proposal relies on few things. - Bandwidth needs to be divided eually among sync and async IO. - Flusher thread async IO will always to go to root cgroup. - I am not sure how this scheme is going to work when we introduce hierarchical blkio cgroups. - cgroup weights for sync IO seems to be being controlled by user and somehow root cgroup weight seems to be controlled by this async IO logic silently. Overall sounds very odd design to me. I am not sure what are we achieving by this. In current scheme one should be able to just adjust the weight of root cgroup using cgroup interface and achieve same results which you are showing. So where is the need of dynamically changing it inside kernel. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-16 14:57 ` Vivek Goyal @ 2012-04-24 11:33 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-24 11:33 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > [..] > > Yeah the backpressure idea would work nicely with all possible > > intermediate stacking between the bdi and leaf devices. In my attempt > > to do combined IO bandwidth control for > > > > - buffered writes, in balance_dirty_pages() > > - direct IO, in the cfq IO scheduler > > > > I have to look into the cfq code in the past days to get an idea how > > the two throttling layers can cooperate (and suffer from the pains > > arise from the violations of layers). It's also rather tricky to get > > two previously independent throttling mechanisms to work seamlessly > > with each other for providing the desired _unified_ user interface. It > > took a lot of reasoning and experiments to work the basic scheme out... > > > > But here is the first result. The attached graph shows progress of 4 > > tasks: > > - cgroup A: 1 direct dd + 1 buffered dd > > - cgroup B: 1 direct dd + 1 buffered dd > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > for the direct writers. As you may notice, the two direct writers are > > somehow stalled for 1-2 times, which increases the gaps between the > > lines. Otherwise, the algorithm is working as expected to distribute > > the bandwidth to each task. > > > > The current code's target is to satisfy the more realistic user demand > > of distributing bandwidth equally to each cgroup, and inside each > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > of which, weights can be specified to change the default distribution. > > > > The implementation involves adding "weight for direct IO" to the cfq > > groups and "weight for buffered writes" to the root cgroup. Note that > > current cfq proportional IO conroller does not offer explicit control > > over the direct:buffered ratio. > > > > When there are both direct/buffered writers in the cgroup, > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > execute. Note that cfq will continue to send all flusher IOs to the > > root cgroup. balance_dirty_pages() will compute the overall async > > weight for it so that in the above test case, the computed weights > > will be > > I think having separate weigths for sync IO groups and async IO is not > very appealing. There should be one notion of group weight and bandwidth > distrubuted among groups according to their weight. There have to be some scheme, either explicitly or implicitly. Maybe you are baring in mind some "equal split among queues" policy? For example, if the cgroup has 9 active sync queues and 1 async queue, split the weight equally to the 10 queues? So the sync IOs get 90% share, and the async writes get 10% share. For dirty throttling w/o cgroup awareness, balance_dirty_pages() splits the writeout bandwidth equally among all dirtier tasks. Since cfq works with queues, it seems most natural for it to do equal split among all queues (inside the cgroup). I'm not sure when there are N dd tasks doing direct IO, cfq will continuously run N sync queues for them (without many dynamic queue deletion and recreations). If that is the case, it should be trivial to support the queue based fair split in the global async queue scheme. Otherwise I'll have some trouble detecting the N value when trying to do the N:1 sync:async weight split. > Now one can argue that with-in a group, there might be one knob in CFQ > which allows to change the share or sync/async IO. Yeah. I suspect typical users don't care about the split policy or fairness inside the cgroup, otherwise there may be complains on any existing policies: "I want split this way" "I want that way"... ;-) Anyway I'm not sure about the possible use cases.. > Also Tejun and Jan have expressed the desire that once we have figured > out a way to communicate the submitter's context for async IO, we would > like to account that IO in associated cgroup instead of root cgroup (as > we do today). Understand. Accounting should always be attributed to the corresponding cgroup. I'll also need this to send feedback information to the async IO submitter's cgroups. > > - 1000 async weight for the root cgroup (2 buffered dds) > > - 500 dio weight for cgroup A (1 direct dd) > > - 500 dio weight for cgroup B (1 direct dd) > > > > The second graph shows result for another test case: > > - cgroup A, weight 300: 1 buffered cp > > - cgroup B, weight 600: 1 buffered dd + 1 direct dd > > - cgroup C, weight 300: 1 direct dd > > which is also working as expected. > > > > Once the cfq properly grants total async IO share to the flusher, > > balance_dirty_pages() will then do its original job of distributing > > the buffered write bandwidth among the buffered dd tasks. > > > > It will have to assume that the devices under the same bdi are > > "symmetry". It also needs further stats feedback on IOPS or disk time > > in order to do IOPS/time based IO distribution. Anyway it would be > > interesting to see how far this scheme can go. I'll cleanup the code > > and post it soon. > > Your proposal relies on few things. > > - Bandwidth needs to be divided eually among sync and async IO. Yeah balance_dirty_pages() always works on the basis of bandwidth. The plan is that once we get the feedback information on each stream's bandwidth:disk_time (or IOPS) ratio, the bandwidth target can be adjusted to achieve disk time or IOPS based fair share among the buffered dirtiers. For the sync:async split, it's operating on the cfqg->weight. So it's automatically disk time based. Look at this graph, the 4 dd tasks are granted the same weight (2 of them are buffered writes). I guess the 2 buffered dd tasks managed to progress much faster than the 2 direct dd tasks just because the async IOs are much more efficient than the bs=64k direct IOs. https://github.com/fengguang/io-controller-tests/raw/master/log/bay/xfs/mixed-write-2.2012-04-19-10-42/balance_dirty_pages-task-bw.png > - Flusher thread async IO will always to go to root cgroup. Right. This is actually my main target: to avoid splitting up the async streams throughout the IO path, for the good of performance. > - I am not sure how this scheme is going to work when we introduce > hierarchical blkio cgroups. I think it's still viable. balance_dirty_pages() works by estimating the N (number of dd tasks) value and splitting the writeout bandwidth equally among the tasks: task_ratelimit = write_bandwidth / N It becomes a proportional weight IO controller if change the formula to task_ratelimit = weight * write_bandwidth / N_w Here lies the beauty of the bdi_update_dirty_ratelimit() algorithm: it can automatically adapt N to the proper "weighted" N_w to keep things in balance, given whatever weights applied to each task. If further use blkcg_ratelimit = weight * write_bandwidth / N_w task_ratelimit = weight * blkcg_ratelimit / M_w It's turned into a cgroup IO controller. This change further makes it a hierarchical IO controller: blkcg_ratelimit = weight * parent_blkcg_ratelimit / M_w We'll also need to hierarchically de-compose the async weights from inner cgroup levels to outer levels, and finally add them to the root cgroup that holds the async queue. This looks feasible, too. > - cgroup weights for sync IO seems to be being controlled by user and > somehow root cgroup weight seems to be controlled by this async IO > logic silently. In the current state I do assume no IO tasks in the root cgroup except for the flusher. However in general the root cgroup can be treated the same as other cgroups: its weight can also be split up into dio_weight and async weight. The general idea is - cfqg->weight is given by user - cfqg->dio_weight is used for sync slices on vdisktime calculation. - total_async_weight collects all async IO weights from each cgroup, including the root cgroup. They are the "credits" for the flusher for doing the async IOs in delegate of all the cgroups. > Overall sounds very odd design to me. I am not sure what are we achieving > by this. In current scheme one should be able to just adjust the weight > of root cgroup using cgroup interface and achieve same results which you > are showing. So where is the need of dynamically changing it inside > kernel. The "dynamically changing weights" are for the in-cgroup equal split between sync/async IOs. It does feel like an arbitrary added policy.. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-24 11:33 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-24 11:33 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > [..] > > Yeah the backpressure idea would work nicely with all possible > > intermediate stacking between the bdi and leaf devices. In my attempt > > to do combined IO bandwidth control for > > > > - buffered writes, in balance_dirty_pages() > > - direct IO, in the cfq IO scheduler > > > > I have to look into the cfq code in the past days to get an idea how > > the two throttling layers can cooperate (and suffer from the pains > > arise from the violations of layers). It's also rather tricky to get > > two previously independent throttling mechanisms to work seamlessly > > with each other for providing the desired _unified_ user interface. It > > took a lot of reasoning and experiments to work the basic scheme out... > > > > But here is the first result. The attached graph shows progress of 4 > > tasks: > > - cgroup A: 1 direct dd + 1 buffered dd > > - cgroup B: 1 direct dd + 1 buffered dd > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > for the direct writers. As you may notice, the two direct writers are > > somehow stalled for 1-2 times, which increases the gaps between the > > lines. Otherwise, the algorithm is working as expected to distribute > > the bandwidth to each task. > > > > The current code's target is to satisfy the more realistic user demand > > of distributing bandwidth equally to each cgroup, and inside each > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > of which, weights can be specified to change the default distribution. > > > > The implementation involves adding "weight for direct IO" to the cfq > > groups and "weight for buffered writes" to the root cgroup. Note that > > current cfq proportional IO conroller does not offer explicit control > > over the direct:buffered ratio. > > > > When there are both direct/buffered writers in the cgroup, > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > execute. Note that cfq will continue to send all flusher IOs to the > > root cgroup. balance_dirty_pages() will compute the overall async > > weight for it so that in the above test case, the computed weights > > will be > > I think having separate weigths for sync IO groups and async IO is not > very appealing. There should be one notion of group weight and bandwidth > distrubuted among groups according to their weight. There have to be some scheme, either explicitly or implicitly. Maybe you are baring in mind some "equal split among queues" policy? For example, if the cgroup has 9 active sync queues and 1 async queue, split the weight equally to the 10 queues? So the sync IOs get 90% share, and the async writes get 10% share. For dirty throttling w/o cgroup awareness, balance_dirty_pages() splits the writeout bandwidth equally among all dirtier tasks. Since cfq works with queues, it seems most natural for it to do equal split among all queues (inside the cgroup). I'm not sure when there are N dd tasks doing direct IO, cfq will continuously run N sync queues for them (without many dynamic queue deletion and recreations). If that is the case, it should be trivial to support the queue based fair split in the global async queue scheme. Otherwise I'll have some trouble detecting the N value when trying to do the N:1 sync:async weight split. > Now one can argue that with-in a group, there might be one knob in CFQ > which allows to change the share or sync/async IO. Yeah. I suspect typical users don't care about the split policy or fairness inside the cgroup, otherwise there may be complains on any existing policies: "I want split this way" "I want that way"... ;-) Anyway I'm not sure about the possible use cases.. > Also Tejun and Jan have expressed the desire that once we have figured > out a way to communicate the submitter's context for async IO, we would > like to account that IO in associated cgroup instead of root cgroup (as > we do today). Understand. Accounting should always be attributed to the corresponding cgroup. I'll also need this to send feedback information to the async IO submitter's cgroups. > > - 1000 async weight for the root cgroup (2 buffered dds) > > - 500 dio weight for cgroup A (1 direct dd) > > - 500 dio weight for cgroup B (1 direct dd) > > > > The second graph shows result for another test case: > > - cgroup A, weight 300: 1 buffered cp > > - cgroup B, weight 600: 1 buffered dd + 1 direct dd > > - cgroup C, weight 300: 1 direct dd > > which is also working as expected. > > > > Once the cfq properly grants total async IO share to the flusher, > > balance_dirty_pages() will then do its original job of distributing > > the buffered write bandwidth among the buffered dd tasks. > > > > It will have to assume that the devices under the same bdi are > > "symmetry". It also needs further stats feedback on IOPS or disk time > > in order to do IOPS/time based IO distribution. Anyway it would be > > interesting to see how far this scheme can go. I'll cleanup the code > > and post it soon. > > Your proposal relies on few things. > > - Bandwidth needs to be divided eually among sync and async IO. Yeah balance_dirty_pages() always works on the basis of bandwidth. The plan is that once we get the feedback information on each stream's bandwidth:disk_time (or IOPS) ratio, the bandwidth target can be adjusted to achieve disk time or IOPS based fair share among the buffered dirtiers. For the sync:async split, it's operating on the cfqg->weight. So it's automatically disk time based. Look at this graph, the 4 dd tasks are granted the same weight (2 of them are buffered writes). I guess the 2 buffered dd tasks managed to progress much faster than the 2 direct dd tasks just because the async IOs are much more efficient than the bs=64k direct IOs. https://github.com/fengguang/io-controller-tests/raw/master/log/bay/xfs/mixed-write-2.2012-04-19-10-42/balance_dirty_pages-task-bw.png > - Flusher thread async IO will always to go to root cgroup. Right. This is actually my main target: to avoid splitting up the async streams throughout the IO path, for the good of performance. > - I am not sure how this scheme is going to work when we introduce > hierarchical blkio cgroups. I think it's still viable. balance_dirty_pages() works by estimating the N (number of dd tasks) value and splitting the writeout bandwidth equally among the tasks: task_ratelimit = write_bandwidth / N It becomes a proportional weight IO controller if change the formula to task_ratelimit = weight * write_bandwidth / N_w Here lies the beauty of the bdi_update_dirty_ratelimit() algorithm: it can automatically adapt N to the proper "weighted" N_w to keep things in balance, given whatever weights applied to each task. If further use blkcg_ratelimit = weight * write_bandwidth / N_w task_ratelimit = weight * blkcg_ratelimit / M_w It's turned into a cgroup IO controller. This change further makes it a hierarchical IO controller: blkcg_ratelimit = weight * parent_blkcg_ratelimit / M_w We'll also need to hierarchically de-compose the async weights from inner cgroup levels to outer levels, and finally add them to the root cgroup that holds the async queue. This looks feasible, too. > - cgroup weights for sync IO seems to be being controlled by user and > somehow root cgroup weight seems to be controlled by this async IO > logic silently. In the current state I do assume no IO tasks in the root cgroup except for the flusher. However in general the root cgroup can be treated the same as other cgroups: its weight can also be split up into dio_weight and async weight. The general idea is - cfqg->weight is given by user - cfqg->dio_weight is used for sync slices on vdisktime calculation. - total_async_weight collects all async IO weights from each cgroup, including the root cgroup. They are the "credits" for the flusher for doing the async IOs in delegate of all the cgroups. > Overall sounds very odd design to me. I am not sure what are we achieving > by this. In current scheme one should be able to just adjust the weight > of root cgroup using cgroup interface and achieve same results which you > are showing. So where is the need of dynamically changing it inside > kernel. The "dynamically changing weights" are for the in-cgroup equal split between sync/async IOs. It does feel like an arbitrary added policy.. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-24 11:33 ` Fengguang Wu (?) @ 2012-04-24 14:56 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > [..] > > > Yeah the backpressure idea would work nicely with all possible > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > to do combined IO bandwidth control for > > > > > > - buffered writes, in balance_dirty_pages() > > > - direct IO, in the cfq IO scheduler > > > > > > I have to look into the cfq code in the past days to get an idea how > > > the two throttling layers can cooperate (and suffer from the pains > > > arise from the violations of layers). It's also rather tricky to get > > > two previously independent throttling mechanisms to work seamlessly > > > with each other for providing the desired _unified_ user interface. It > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > But here is the first result. The attached graph shows progress of 4 > > > tasks: > > > - cgroup A: 1 direct dd + 1 buffered dd > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > for the direct writers. As you may notice, the two direct writers are > > > somehow stalled for 1-2 times, which increases the gaps between the > > > lines. Otherwise, the algorithm is working as expected to distribute > > > the bandwidth to each task. > > > > > > The current code's target is to satisfy the more realistic user demand > > > of distributing bandwidth equally to each cgroup, and inside each > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > of which, weights can be specified to change the default distribution. > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > current cfq proportional IO conroller does not offer explicit control > > > over the direct:buffered ratio. > > > > > > When there are both direct/buffered writers in the cgroup, > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > execute. Note that cfq will continue to send all flusher IOs to the > > > root cgroup. balance_dirty_pages() will compute the overall async > > > weight for it so that in the above test case, the computed weights > > > will be > > > > I think having separate weigths for sync IO groups and async IO is not > > very appealing. There should be one notion of group weight and bandwidth > > distrubuted among groups according to their weight. > > There have to be some scheme, either explicitly or implicitly. Maybe > you are baring in mind some "equal split among queues" policy? For > example, if the cgroup has 9 active sync queues and 1 async queue, > split the weight equally to the 10 queues? So the sync IOs get 90% > share, and the async writes get 10% share. Maybe I misunderstand but there doesn't have to be (and in fact isn't) any split among sync / async IO in CFQ. At each moment, we choose a queue with the highest score and dispatch a couple of requests from it. Then we go and choose again. The score of the queue depends on several factors (like age of requests, whether the queue is sync or async, IO priority, etc.). Practically, over a longer period system will stabilize on some ratio but that's dependent on the load so your system should not impose some artificial direct/buffered split but rather somehow deal with the reality how IO scheduler decides to dispatch requests... > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > splits the writeout bandwidth equally among all dirtier tasks. Since > cfq works with queues, it seems most natural for it to do equal split > among all queues (inside the cgroup). Well, but we also have IO priorities which change which queue should get preference. > I'm not sure when there are N dd tasks doing direct IO, cfq will > continuously run N sync queues for them (without many dynamic queue > deletion and recreations). If that is the case, it should be trivial > to support the queue based fair split in the global async queue > scheme. Otherwise I'll have some trouble detecting the N value when > trying to do the N:1 sync:async weight split. And also sync queues for several processes can get merged when CFQ observes these processes cooperate together on one area of disk and get split again when processes stop cooperating. I don't think you really want to second-guess what CFQ does inside... > Look at this graph, the 4 dd tasks are granted the same weight (2 of > them are buffered writes). I guess the 2 buffered dd tasks managed to > progress much faster than the 2 direct dd tasks just because the async > IOs are much more efficient than the bs=64k direct IOs. Likely because 64k is too low to get good bandwidth with direct IO. If it was 4M, I believe you would get similar throughput for buffered and direct IO. So essentially you are right, small IO benefits from caching effects since they allow you to submit larger requests to the device which is more efficient. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-24 11:33 ` Fengguang Wu (?) @ 2012-04-24 14:56 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw) To: Fengguang Wu Cc: Vivek Goyal, Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > [..] > > > Yeah the backpressure idea would work nicely with all possible > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > to do combined IO bandwidth control for > > > > > > - buffered writes, in balance_dirty_pages() > > > - direct IO, in the cfq IO scheduler > > > > > > I have to look into the cfq code in the past days to get an idea how > > > the two throttling layers can cooperate (and suffer from the pains > > > arise from the violations of layers). It's also rather tricky to get > > > two previously independent throttling mechanisms to work seamlessly > > > with each other for providing the desired _unified_ user interface. It > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > But here is the first result. The attached graph shows progress of 4 > > > tasks: > > > - cgroup A: 1 direct dd + 1 buffered dd > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > for the direct writers. As you may notice, the two direct writers are > > > somehow stalled for 1-2 times, which increases the gaps between the > > > lines. Otherwise, the algorithm is working as expected to distribute > > > the bandwidth to each task. > > > > > > The current code's target is to satisfy the more realistic user demand > > > of distributing bandwidth equally to each cgroup, and inside each > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > of which, weights can be specified to change the default distribution. > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > current cfq proportional IO conroller does not offer explicit control > > > over the direct:buffered ratio. > > > > > > When there are both direct/buffered writers in the cgroup, > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > execute. Note that cfq will continue to send all flusher IOs to the > > > root cgroup. balance_dirty_pages() will compute the overall async > > > weight for it so that in the above test case, the computed weights > > > will be > > > > I think having separate weigths for sync IO groups and async IO is not > > very appealing. There should be one notion of group weight and bandwidth > > distrubuted among groups according to their weight. > > There have to be some scheme, either explicitly or implicitly. Maybe > you are baring in mind some "equal split among queues" policy? For > example, if the cgroup has 9 active sync queues and 1 async queue, > split the weight equally to the 10 queues? So the sync IOs get 90% > share, and the async writes get 10% share. Maybe I misunderstand but there doesn't have to be (and in fact isn't) any split among sync / async IO in CFQ. At each moment, we choose a queue with the highest score and dispatch a couple of requests from it. Then we go and choose again. The score of the queue depends on several factors (like age of requests, whether the queue is sync or async, IO priority, etc.). Practically, over a longer period system will stabilize on some ratio but that's dependent on the load so your system should not impose some artificial direct/buffered split but rather somehow deal with the reality how IO scheduler decides to dispatch requests... > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > splits the writeout bandwidth equally among all dirtier tasks. Since > cfq works with queues, it seems most natural for it to do equal split > among all queues (inside the cgroup). Well, but we also have IO priorities which change which queue should get preference. > I'm not sure when there are N dd tasks doing direct IO, cfq will > continuously run N sync queues for them (without many dynamic queue > deletion and recreations). If that is the case, it should be trivial > to support the queue based fair split in the global async queue > scheme. Otherwise I'll have some trouble detecting the N value when > trying to do the N:1 sync:async weight split. And also sync queues for several processes can get merged when CFQ observes these processes cooperate together on one area of disk and get split again when processes stop cooperating. I don't think you really want to second-guess what CFQ does inside... > Look at this graph, the 4 dd tasks are granted the same weight (2 of > them are buffered writes). I guess the 2 buffered dd tasks managed to > progress much faster than the 2 direct dd tasks just because the async > IOs are much more efficient than the bs=64k direct IOs. Likely because 64k is too low to get good bandwidth with direct IO. If it was 4M, I believe you would get similar throughput for buffered and direct IO. So essentially you are right, small IO benefits from caching effects since they allow you to submit larger requests to the device which is more efficient. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-24 14:56 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw) To: Fengguang Wu Cc: Vivek Goyal, Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > [..] > > > Yeah the backpressure idea would work nicely with all possible > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > to do combined IO bandwidth control for > > > > > > - buffered writes, in balance_dirty_pages() > > > - direct IO, in the cfq IO scheduler > > > > > > I have to look into the cfq code in the past days to get an idea how > > > the two throttling layers can cooperate (and suffer from the pains > > > arise from the violations of layers). It's also rather tricky to get > > > two previously independent throttling mechanisms to work seamlessly > > > with each other for providing the desired _unified_ user interface. It > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > But here is the first result. The attached graph shows progress of 4 > > > tasks: > > > - cgroup A: 1 direct dd + 1 buffered dd > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > for the direct writers. As you may notice, the two direct writers are > > > somehow stalled for 1-2 times, which increases the gaps between the > > > lines. Otherwise, the algorithm is working as expected to distribute > > > the bandwidth to each task. > > > > > > The current code's target is to satisfy the more realistic user demand > > > of distributing bandwidth equally to each cgroup, and inside each > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > of which, weights can be specified to change the default distribution. > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > current cfq proportional IO conroller does not offer explicit control > > > over the direct:buffered ratio. > > > > > > When there are both direct/buffered writers in the cgroup, > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > execute. Note that cfq will continue to send all flusher IOs to the > > > root cgroup. balance_dirty_pages() will compute the overall async > > > weight for it so that in the above test case, the computed weights > > > will be > > > > I think having separate weigths for sync IO groups and async IO is not > > very appealing. There should be one notion of group weight and bandwidth > > distrubuted among groups according to their weight. > > There have to be some scheme, either explicitly or implicitly. Maybe > you are baring in mind some "equal split among queues" policy? For > example, if the cgroup has 9 active sync queues and 1 async queue, > split the weight equally to the 10 queues? So the sync IOs get 90% > share, and the async writes get 10% share. Maybe I misunderstand but there doesn't have to be (and in fact isn't) any split among sync / async IO in CFQ. At each moment, we choose a queue with the highest score and dispatch a couple of requests from it. Then we go and choose again. The score of the queue depends on several factors (like age of requests, whether the queue is sync or async, IO priority, etc.). Practically, over a longer period system will stabilize on some ratio but that's dependent on the load so your system should not impose some artificial direct/buffered split but rather somehow deal with the reality how IO scheduler decides to dispatch requests... > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > splits the writeout bandwidth equally among all dirtier tasks. Since > cfq works with queues, it seems most natural for it to do equal split > among all queues (inside the cgroup). Well, but we also have IO priorities which change which queue should get preference. > I'm not sure when there are N dd tasks doing direct IO, cfq will > continuously run N sync queues for them (without many dynamic queue > deletion and recreations). If that is the case, it should be trivial > to support the queue based fair split in the global async queue > scheme. Otherwise I'll have some trouble detecting the N value when > trying to do the N:1 sync:async weight split. And also sync queues for several processes can get merged when CFQ observes these processes cooperate together on one area of disk and get split again when processes stop cooperating. I don't think you really want to second-guess what CFQ does inside... > Look at this graph, the 4 dd tasks are granted the same weight (2 of > them are buffered writes). I guess the 2 buffered dd tasks managed to > progress much faster than the 2 direct dd tasks just because the async > IOs are much more efficient than the bs=64k direct IOs. Likely because 64k is too low to get good bandwidth with direct IO. If it was 4M, I believe you would get similar throughput for buffered and direct IO. So essentially you are right, small IO benefits from caching effects since they allow you to submit larger requests to the device which is more efficient. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-24 14:56 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw) To: Fengguang Wu Cc: Vivek Goyal, Tejun Heo, Jan Kara, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > [..] > > > Yeah the backpressure idea would work nicely with all possible > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > to do combined IO bandwidth control for > > > > > > - buffered writes, in balance_dirty_pages() > > > - direct IO, in the cfq IO scheduler > > > > > > I have to look into the cfq code in the past days to get an idea how > > > the two throttling layers can cooperate (and suffer from the pains > > > arise from the violations of layers). It's also rather tricky to get > > > two previously independent throttling mechanisms to work seamlessly > > > with each other for providing the desired _unified_ user interface. It > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > But here is the first result. The attached graph shows progress of 4 > > > tasks: > > > - cgroup A: 1 direct dd + 1 buffered dd > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > for the direct writers. As you may notice, the two direct writers are > > > somehow stalled for 1-2 times, which increases the gaps between the > > > lines. Otherwise, the algorithm is working as expected to distribute > > > the bandwidth to each task. > > > > > > The current code's target is to satisfy the more realistic user demand > > > of distributing bandwidth equally to each cgroup, and inside each > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > of which, weights can be specified to change the default distribution. > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > current cfq proportional IO conroller does not offer explicit control > > > over the direct:buffered ratio. > > > > > > When there are both direct/buffered writers in the cgroup, > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > execute. Note that cfq will continue to send all flusher IOs to the > > > root cgroup. balance_dirty_pages() will compute the overall async > > > weight for it so that in the above test case, the computed weights > > > will be > > > > I think having separate weigths for sync IO groups and async IO is not > > very appealing. There should be one notion of group weight and bandwidth > > distrubuted among groups according to their weight. > > There have to be some scheme, either explicitly or implicitly. Maybe > you are baring in mind some "equal split among queues" policy? For > example, if the cgroup has 9 active sync queues and 1 async queue, > split the weight equally to the 10 queues? So the sync IOs get 90% > share, and the async writes get 10% share. Maybe I misunderstand but there doesn't have to be (and in fact isn't) any split among sync / async IO in CFQ. At each moment, we choose a queue with the highest score and dispatch a couple of requests from it. Then we go and choose again. The score of the queue depends on several factors (like age of requests, whether the queue is sync or async, IO priority, etc.). Practically, over a longer period system will stabilize on some ratio but that's dependent on the load so your system should not impose some artificial direct/buffered split but rather somehow deal with the reality how IO scheduler decides to dispatch requests... > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > splits the writeout bandwidth equally among all dirtier tasks. Since > cfq works with queues, it seems most natural for it to do equal split > among all queues (inside the cgroup). Well, but we also have IO priorities which change which queue should get preference. > I'm not sure when there are N dd tasks doing direct IO, cfq will > continuously run N sync queues for them (without many dynamic queue > deletion and recreations). If that is the case, it should be trivial > to support the queue based fair split in the global async queue > scheme. Otherwise I'll have some trouble detecting the N value when > trying to do the N:1 sync:async weight split. And also sync queues for several processes can get merged when CFQ observes these processes cooperate together on one area of disk and get split again when processes stop cooperating. I don't think you really want to second-guess what CFQ does inside... > Look at this graph, the 4 dd tasks are granted the same weight (2 of > them are buffered writes). I guess the 2 buffered dd tasks managed to > progress much faster than the 2 direct dd tasks just because the async > IOs are much more efficient than the bs=64k direct IOs. Likely because 64k is too low to get good bandwidth with direct IO. If it was 4M, I believe you would get similar throughput for buffered and direct IO. So essentially you are right, small IO benefits from caching effects since they allow you to submit larger requests to the device which is more efficient. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-24 14:56 ` Jan Kara @ 2012-04-24 15:58 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-24 15:58 UTC (permalink / raw) To: Jan Kara Cc: Fengguang Wu, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: [..] > > > I think having separate weigths for sync IO groups and async IO is not > > > very appealing. There should be one notion of group weight and bandwidth > > > distrubuted among groups according to their weight. > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > you are baring in mind some "equal split among queues" policy? For > > example, if the cgroup has 9 active sync queues and 1 async queue, > > split the weight equally to the 10 queues? So the sync IOs get 90% > > share, and the async writes get 10% share. > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > any split among sync / async IO in CFQ. At each moment, we choose a queue > with the highest score and dispatch a couple of requests from it. Then we > go and choose again. The score of the queue depends on several factors > (like age of requests, whether the queue is sync or async, IO priority, > etc.). > > Practically, over a longer period system will stabilize on some ratio > but that's dependent on the load so your system should not impose some > artificial direct/buffered split but rather somehow deal with the reality > how IO scheduler decides to dispatch requests... Yes. CFQ does not have the notion of giving a fixed share to async requests. In fact right now it is so biased in favor of sync reqeusts, that in some cases it can starve async writes or introduce long delays resulting in "task hung for 120 second" warnings. So if there are issues w.r.t how disk is shared between sync/async IO with in a cgroup, that should be handled at IO scheduler level. Writeback code trying to dictate that ratio, sounds odd. > > > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > > splits the writeout bandwidth equally among all dirtier tasks. Since > > cfq works with queues, it seems most natural for it to do equal split > > among all queues (inside the cgroup). > Well, but we also have IO priorities which change which queue should get > preference. > > > I'm not sure when there are N dd tasks doing direct IO, cfq will > > continuously run N sync queues for them (without many dynamic queue > > deletion and recreations). If that is the case, it should be trivial > > to support the queue based fair split in the global async queue > > scheme. Otherwise I'll have some trouble detecting the N value when > > trying to do the N:1 sync:async weight split. > And also sync queues for several processes can get merged when CFQ > observes these processes cooperate together on one area of disk and get > split again when processes stop cooperating. I don't think you really want > to second-guess what CFQ does inside... Agreed. Trying to predict what CFQ will do and then trying to influence sync/async ration based on root cgroup weight does not seem to be the right way. Especially that will also mean either assuming that everything in root group is sync or we shall have to split sync/async weight notion. sync/async ratio is a IO scheduler thing and is not fixed. So writeback layer making assumptions and changing weigths sounds very awkward to me. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-24 15:58 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-24 15:58 UTC (permalink / raw) To: Jan Kara Cc: Fengguang Wu, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: [..] > > > I think having separate weigths for sync IO groups and async IO is not > > > very appealing. There should be one notion of group weight and bandwidth > > > distrubuted among groups according to their weight. > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > you are baring in mind some "equal split among queues" policy? For > > example, if the cgroup has 9 active sync queues and 1 async queue, > > split the weight equally to the 10 queues? So the sync IOs get 90% > > share, and the async writes get 10% share. > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > any split among sync / async IO in CFQ. At each moment, we choose a queue > with the highest score and dispatch a couple of requests from it. Then we > go and choose again. The score of the queue depends on several factors > (like age of requests, whether the queue is sync or async, IO priority, > etc.). > > Practically, over a longer period system will stabilize on some ratio > but that's dependent on the load so your system should not impose some > artificial direct/buffered split but rather somehow deal with the reality > how IO scheduler decides to dispatch requests... Yes. CFQ does not have the notion of giving a fixed share to async requests. In fact right now it is so biased in favor of sync reqeusts, that in some cases it can starve async writes or introduce long delays resulting in "task hung for 120 second" warnings. So if there are issues w.r.t how disk is shared between sync/async IO with in a cgroup, that should be handled at IO scheduler level. Writeback code trying to dictate that ratio, sounds odd. > > > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > > splits the writeout bandwidth equally among all dirtier tasks. Since > > cfq works with queues, it seems most natural for it to do equal split > > among all queues (inside the cgroup). > Well, but we also have IO priorities which change which queue should get > preference. > > > I'm not sure when there are N dd tasks doing direct IO, cfq will > > continuously run N sync queues for them (without many dynamic queue > > deletion and recreations). If that is the case, it should be trivial > > to support the queue based fair split in the global async queue > > scheme. Otherwise I'll have some trouble detecting the N value when > > trying to do the N:1 sync:async weight split. > And also sync queues for several processes can get merged when CFQ > observes these processes cooperate together on one area of disk and get > split again when processes stop cooperating. I don't think you really want > to second-guess what CFQ does inside... Agreed. Trying to predict what CFQ will do and then trying to influence sync/async ration based on root cgroup weight does not seem to be the right way. Especially that will also mean either assuming that everything in root group is sync or we shall have to split sync/async weight notion. sync/async ratio is a IO scheduler thing and is not fixed. So writeback layer making assumptions and changing weigths sounds very awkward to me. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-25 2:42 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 2:42 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > [..] > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > Yes. CFQ does not have the notion of giving a fixed share to async > requests. In fact right now it is so biased in favor of sync reqeusts, > that in some cases it can starve async writes or introduce long delays > resulting in "task hung for 120 second" warnings. > > So if there are issues w.r.t how disk is shared between sync/async IO > with in a cgroup, that should be handled at IO scheduler level. Writeback > code trying to dictate that ratio, sounds odd. Indeed it sounds odd.. However it does look that there need some sync/async ratio to avoid livelock issues, say 80:20 or whatever. What's you original plan to deal with this in the IO scheduler? > > > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > > > splits the writeout bandwidth equally among all dirtier tasks. Since > > > cfq works with queues, it seems most natural for it to do equal split > > > among all queues (inside the cgroup). > > Well, but we also have IO priorities which change which queue should get > > preference. > > > > > I'm not sure when there are N dd tasks doing direct IO, cfq will > > > continuously run N sync queues for them (without many dynamic queue > > > deletion and recreations). If that is the case, it should be trivial > > > to support the queue based fair split in the global async queue > > > scheme. Otherwise I'll have some trouble detecting the N value when > > > trying to do the N:1 sync:async weight split. > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Agreed. Trying to predict what CFQ will do and then trying to influence > sync/async ration based on root cgroup weight does not seem to be the > right way. Especially that will also mean either assuming that everything > in root group is sync or we shall have to split sync/async weight notion. It seems there is some misunderstanding to the sync/async split. No, root cgroup tasks won't be any special wrt the weight split. Although in the current patch I does make assumption that no IO is happening in the root cgroup. To make it look easier, we may as well move the flusher thread to a standalone cgroup. Then if the root cgroup has both aggressive sync/async IOs, the split will be carried out the same way as other cgroups: rootcg->dio_weight = rootcg->weight / 2 flushercg->async_weight += rootcg->weight / 2 > sync/async ratio is a IO scheduler thing and is not fixed. So writeback > layer making assumptions and changing weigths sounds very awkward to me. OK the ratio is not fixed, so I'm not going to do the guess work. However there is still the question: how are we going to fix the sync-starve-async IO problem without some guaranteed ratio? Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-25 2:42 ` Fengguang Wu @ 2012-04-25 2:42 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 2:42 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > [..] > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > Yes. CFQ does not have the notion of giving a fixed share to async > requests. In fact right now it is so biased in favor of sync reqeusts, > that in some cases it can starve async writes or introduce long delays > resulting in "task hung for 120 second" warnings. > > So if there are issues w.r.t how disk is shared between sync/async IO > with in a cgroup, that should be handled at IO scheduler level. Writeback > code trying to dictate that ratio, sounds odd. Indeed it sounds odd.. However it does look that there need some sync/async ratio to avoid livelock issues, say 80:20 or whatever. What's you original plan to deal with this in the IO scheduler? > > > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > > > splits the writeout bandwidth equally among all dirtier tasks. Since > > > cfq works with queues, it seems most natural for it to do equal split > > > among all queues (inside the cgroup). > > Well, but we also have IO priorities which change which queue should get > > preference. > > > > > I'm not sure when there are N dd tasks doing direct IO, cfq will > > > continuously run N sync queues for them (without many dynamic queue > > > deletion and recreations). If that is the case, it should be trivial > > > to support the queue based fair split in the global async queue > > > scheme. Otherwise I'll have some trouble detecting the N value when > > > trying to do the N:1 sync:async weight split. > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Agreed. Trying to predict what CFQ will do and then trying to influence > sync/async ration based on root cgroup weight does not seem to be the > right way. Especially that will also mean either assuming that everything > in root group is sync or we shall have to split sync/async weight notion. It seems there is some misunderstanding to the sync/async split. No, root cgroup tasks won't be any special wrt the weight split. Although in the current patch I does make assumption that no IO is happening in the root cgroup. To make it look easier, we may as well move the flusher thread to a standalone cgroup. Then if the root cgroup has both aggressive sync/async IOs, the split will be carried out the same way as other cgroups: rootcg->dio_weight = rootcg->weight / 2 flushercg->async_weight += rootcg->weight / 2 > sync/async ratio is a IO scheduler thing and is not fixed. So writeback > layer making assumptions and changing weigths sounds very awkward to me. OK the ratio is not fixed, so I'm not going to do the guess work. However there is still the question: how are we going to fix the sync-starve-async IO problem without some guaranteed ratio? Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-25 2:42 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 2:42 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > [..] > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > Yes. CFQ does not have the notion of giving a fixed share to async > requests. In fact right now it is so biased in favor of sync reqeusts, > that in some cases it can starve async writes or introduce long delays > resulting in "task hung for 120 second" warnings. > > So if there are issues w.r.t how disk is shared between sync/async IO > with in a cgroup, that should be handled at IO scheduler level. Writeback > code trying to dictate that ratio, sounds odd. Indeed it sounds odd.. However it does look that there need some sync/async ratio to avoid livelock issues, say 80:20 or whatever. What's you original plan to deal with this in the IO scheduler? > > > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > > > splits the writeout bandwidth equally among all dirtier tasks. Since > > > cfq works with queues, it seems most natural for it to do equal split > > > among all queues (inside the cgroup). > > Well, but we also have IO priorities which change which queue should get > > preference. > > > > > I'm not sure when there are N dd tasks doing direct IO, cfq will > > > continuously run N sync queues for them (without many dynamic queue > > > deletion and recreations). If that is the case, it should be trivial > > > to support the queue based fair split in the global async queue > > > scheme. Otherwise I'll have some trouble detecting the N value when > > > trying to do the N:1 sync:async weight split. > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Agreed. Trying to predict what CFQ will do and then trying to influence > sync/async ration based on root cgroup weight does not seem to be the > right way. Especially that will also mean either assuming that everything > in root group is sync or we shall have to split sync/async weight notion. It seems there is some misunderstanding to the sync/async split. No, root cgroup tasks won't be any special wrt the weight split. Although in the current patch I does make assumption that no IO is happening in the root cgroup. To make it look easier, we may as well move the flusher thread to a standalone cgroup. Then if the root cgroup has both aggressive sync/async IOs, the split will be carried out the same way as other cgroups: rootcg->dio_weight = rootcg->weight / 2 flushercg->async_weight += rootcg->weight / 2 > sync/async ratio is a IO scheduler thing and is not fixed. So writeback > layer making assumptions and changing weigths sounds very awkward to me. OK the ratio is not fixed, so I'm not going to do the guess work. However there is still the question: how are we going to fix the sync-starve-async IO problem without some guaranteed ratio? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-25 2:42 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 2:42 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Tejun Heo, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > [..] > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > Yes. CFQ does not have the notion of giving a fixed share to async > requests. In fact right now it is so biased in favor of sync reqeusts, > that in some cases it can starve async writes or introduce long delays > resulting in "task hung for 120 second" warnings. > > So if there are issues w.r.t how disk is shared between sync/async IO > with in a cgroup, that should be handled at IO scheduler level. Writeback > code trying to dictate that ratio, sounds odd. Indeed it sounds odd.. However it does look that there need some sync/async ratio to avoid livelock issues, say 80:20 or whatever. What's you original plan to deal with this in the IO scheduler? > > > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > > > splits the writeout bandwidth equally among all dirtier tasks. Since > > > cfq works with queues, it seems most natural for it to do equal split > > > among all queues (inside the cgroup). > > Well, but we also have IO priorities which change which queue should get > > preference. > > > > > I'm not sure when there are N dd tasks doing direct IO, cfq will > > > continuously run N sync queues for them (without many dynamic queue > > > deletion and recreations). If that is the case, it should be trivial > > > to support the queue based fair split in the global async queue > > > scheme. Otherwise I'll have some trouble detecting the N value when > > > trying to do the N:1 sync:async weight split. > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Agreed. Trying to predict what CFQ will do and then trying to influence > sync/async ration based on root cgroup weight does not seem to be the > right way. Especially that will also mean either assuming that everything > in root group is sync or we shall have to split sync/async weight notion. It seems there is some misunderstanding to the sync/async split. No, root cgroup tasks won't be any special wrt the weight split. Although in the current patch I does make assumption that no IO is happening in the root cgroup. To make it look easier, we may as well move the flusher thread to a standalone cgroup. Then if the root cgroup has both aggressive sync/async IOs, the split will be carried out the same way as other cgroups: rootcg->dio_weight = rootcg->weight / 2 flushercg->async_weight += rootcg->weight / 2 > sync/async ratio is a IO scheduler thing and is not fixed. So writeback > layer making assumptions and changing weigths sounds very awkward to me. OK the ratio is not fixed, so I'm not going to do the guess work. However there is still the question: how are we going to fix the sync-starve-async IO problem without some guaranteed ratio? Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120424145655.GA1474-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120424145655.GA1474-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-24 15:58 ` Vivek Goyal 2012-04-25 3:16 ` Fengguang Wu 1 sibling, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-24 15:58 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: [..] > > > I think having separate weigths for sync IO groups and async IO is not > > > very appealing. There should be one notion of group weight and bandwidth > > > distrubuted among groups according to their weight. > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > you are baring in mind some "equal split among queues" policy? For > > example, if the cgroup has 9 active sync queues and 1 async queue, > > split the weight equally to the 10 queues? So the sync IOs get 90% > > share, and the async writes get 10% share. > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > any split among sync / async IO in CFQ. At each moment, we choose a queue > with the highest score and dispatch a couple of requests from it. Then we > go and choose again. The score of the queue depends on several factors > (like age of requests, whether the queue is sync or async, IO priority, > etc.). > > Practically, over a longer period system will stabilize on some ratio > but that's dependent on the load so your system should not impose some > artificial direct/buffered split but rather somehow deal with the reality > how IO scheduler decides to dispatch requests... Yes. CFQ does not have the notion of giving a fixed share to async requests. In fact right now it is so biased in favor of sync reqeusts, that in some cases it can starve async writes or introduce long delays resulting in "task hung for 120 second" warnings. So if there are issues w.r.t how disk is shared between sync/async IO with in a cgroup, that should be handled at IO scheduler level. Writeback code trying to dictate that ratio, sounds odd. > > > For dirty throttling w/o cgroup awareness, balance_dirty_pages() > > splits the writeout bandwidth equally among all dirtier tasks. Since > > cfq works with queues, it seems most natural for it to do equal split > > among all queues (inside the cgroup). > Well, but we also have IO priorities which change which queue should get > preference. > > > I'm not sure when there are N dd tasks doing direct IO, cfq will > > continuously run N sync queues for them (without many dynamic queue > > deletion and recreations). If that is the case, it should be trivial > > to support the queue based fair split in the global async queue > > scheme. Otherwise I'll have some trouble detecting the N value when > > trying to do the N:1 sync:async weight split. > And also sync queues for several processes can get merged when CFQ > observes these processes cooperate together on one area of disk and get > split again when processes stop cooperating. I don't think you really want > to second-guess what CFQ does inside... Agreed. Trying to predict what CFQ will do and then trying to influence sync/async ration based on root cgroup weight does not seem to be the right way. Especially that will also mean either assuming that everything in root group is sync or we shall have to split sync/async weight notion. sync/async ratio is a IO scheduler thing and is not fixed. So writeback layer making assumptions and changing weigths sounds very awkward to me. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-24 14:56 ` Jan Kara @ 2012-04-25 3:16 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 3:16 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal [-- Attachment #1: Type: text/plain, Size: 5979 bytes --] On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > > > [..] > > > > Yeah the backpressure idea would work nicely with all possible > > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > > to do combined IO bandwidth control for > > > > > > > > - buffered writes, in balance_dirty_pages() > > > > - direct IO, in the cfq IO scheduler > > > > > > > > I have to look into the cfq code in the past days to get an idea how > > > > the two throttling layers can cooperate (and suffer from the pains > > > > arise from the violations of layers). It's also rather tricky to get > > > > two previously independent throttling mechanisms to work seamlessly > > > > with each other for providing the desired _unified_ user interface. It > > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > > > But here is the first result. The attached graph shows progress of 4 > > > > tasks: > > > > - cgroup A: 1 direct dd + 1 buffered dd > > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > > for the direct writers. As you may notice, the two direct writers are > > > > somehow stalled for 1-2 times, which increases the gaps between the > > > > lines. Otherwise, the algorithm is working as expected to distribute > > > > the bandwidth to each task. > > > > > > > > The current code's target is to satisfy the more realistic user demand > > > > of distributing bandwidth equally to each cgroup, and inside each > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > > of which, weights can be specified to change the default distribution. > > > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > > current cfq proportional IO conroller does not offer explicit control > > > > over the direct:buffered ratio. > > > > > > > > When there are both direct/buffered writers in the cgroup, > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > > execute. Note that cfq will continue to send all flusher IOs to the > > > > root cgroup. balance_dirty_pages() will compute the overall async > > > > weight for it so that in the above test case, the computed weights > > > > will be > > > > > > I think having separate weigths for sync IO groups and async IO is not > > > very appealing. There should be one notion of group weight and bandwidth > > > distrubuted among groups according to their weight. > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > you are baring in mind some "equal split among queues" policy? For > > example, if the cgroup has 9 active sync queues and 1 async queue, > > split the weight equally to the 10 queues? So the sync IOs get 90% > > share, and the async writes get 10% share. > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > any split among sync / async IO in CFQ. At each moment, we choose a queue > with the highest score and dispatch a couple of requests from it. Then we > go and choose again. The score of the queue depends on several factors > (like age of requests, whether the queue is sync or async, IO priority, > etc.). > > Practically, over a longer period system will stabilize on some ratio > but that's dependent on the load so your system should not impose some > artificial direct/buffered split but rather somehow deal with the reality > how IO scheduler decides to dispatch requests... > Well, but we also have IO priorities which change which queue should get > preference. > And also sync queues for several processes can get merged when CFQ > observes these processes cooperate together on one area of disk and get > split again when processes stop cooperating. I don't think you really want > to second-guess what CFQ does inside... Good points, thank you! So the cfq behavior is pretty undetermined. I more or less realize this from the experiments. For example, when starting 2+ "dd oflag=direct" tasks in one single cgroup, they _sometimes_ progress at different rates. See the attached graphs for two such examples on XFS. ext4 is fine. The 2-dd test case is: mkdir /cgroup/dd echo $$ > /cgroup/dd/tasks dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & The 6-dd test case is similar. > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > progress much faster than the 2 direct dd tasks just because the async > > IOs are much more efficient than the bs=64k direct IOs. > Likely because 64k is too low to get good bandwidth with direct IO. If > it was 4M, I believe you would get similar throughput for buffered and > direct IO. So essentially you are right, small IO benefits from caching > effects since they allow you to submit larger requests to the device which > is more efficient. I didn't direct compare the effects, however here is an example of doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has marginal benefits of 64k, assuming cfq is behaving well. https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png The test case is: # cgroup 1 echo 500 > /cgroup/cp/blkio.weight dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & # cgroup 2 echo 1000 > /cgroup/dd/blkio.weight dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 55134 bytes --] [-- Attachment #3: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 61243 bytes --] [-- Attachment #4: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-25 3:16 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 3:16 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf [-- Attachment #1: Type: text/plain, Size: 5979 bytes --] On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > > > [..] > > > > Yeah the backpressure idea would work nicely with all possible > > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > > to do combined IO bandwidth control for > > > > > > > > - buffered writes, in balance_dirty_pages() > > > > - direct IO, in the cfq IO scheduler > > > > > > > > I have to look into the cfq code in the past days to get an idea how > > > > the two throttling layers can cooperate (and suffer from the pains > > > > arise from the violations of layers). It's also rather tricky to get > > > > two previously independent throttling mechanisms to work seamlessly > > > > with each other for providing the desired _unified_ user interface. It > > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > > > But here is the first result. The attached graph shows progress of 4 > > > > tasks: > > > > - cgroup A: 1 direct dd + 1 buffered dd > > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > > for the direct writers. As you may notice, the two direct writers are > > > > somehow stalled for 1-2 times, which increases the gaps between the > > > > lines. Otherwise, the algorithm is working as expected to distribute > > > > the bandwidth to each task. > > > > > > > > The current code's target is to satisfy the more realistic user demand > > > > of distributing bandwidth equally to each cgroup, and inside each > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > > of which, weights can be specified to change the default distribution. > > > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > > current cfq proportional IO conroller does not offer explicit control > > > > over the direct:buffered ratio. > > > > > > > > When there are both direct/buffered writers in the cgroup, > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > > execute. Note that cfq will continue to send all flusher IOs to the > > > > root cgroup. balance_dirty_pages() will compute the overall async > > > > weight for it so that in the above test case, the computed weights > > > > will be > > > > > > I think having separate weigths for sync IO groups and async IO is not > > > very appealing. There should be one notion of group weight and bandwidth > > > distrubuted among groups according to their weight. > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > you are baring in mind some "equal split among queues" policy? For > > example, if the cgroup has 9 active sync queues and 1 async queue, > > split the weight equally to the 10 queues? So the sync IOs get 90% > > share, and the async writes get 10% share. > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > any split among sync / async IO in CFQ. At each moment, we choose a queue > with the highest score and dispatch a couple of requests from it. Then we > go and choose again. The score of the queue depends on several factors > (like age of requests, whether the queue is sync or async, IO priority, > etc.). > > Practically, over a longer period system will stabilize on some ratio > but that's dependent on the load so your system should not impose some > artificial direct/buffered split but rather somehow deal with the reality > how IO scheduler decides to dispatch requests... > Well, but we also have IO priorities which change which queue should get > preference. > And also sync queues for several processes can get merged when CFQ > observes these processes cooperate together on one area of disk and get > split again when processes stop cooperating. I don't think you really want > to second-guess what CFQ does inside... Good points, thank you! So the cfq behavior is pretty undetermined. I more or less realize this from the experiments. For example, when starting 2+ "dd oflag=direct" tasks in one single cgroup, they _sometimes_ progress at different rates. See the attached graphs for two such examples on XFS. ext4 is fine. The 2-dd test case is: mkdir /cgroup/dd echo $$ > /cgroup/dd/tasks dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & The 6-dd test case is similar. > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > progress much faster than the 2 direct dd tasks just because the async > > IOs are much more efficient than the bs=64k direct IOs. > Likely because 64k is too low to get good bandwidth with direct IO. If > it was 4M, I believe you would get similar throughput for buffered and > direct IO. So essentially you are right, small IO benefits from caching > effects since they allow you to submit larger requests to the device which > is more efficient. I didn't direct compare the effects, however here is an example of doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has marginal benefits of 64k, assuming cfq is behaving well. https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png The test case is: # cgroup 1 echo 500 > /cgroup/cp/blkio.weight dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & # cgroup 2 echo 1000 > /cgroup/dd/blkio.weight dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 55134 bytes --] [-- Attachment #3: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 61243 bytes --] ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-25 3:16 ` Fengguang Wu (?) @ 2012-04-25 9:01 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-25 9:01 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 25-04-12 11:16:35, Wu Fengguang wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > > > > > [..] > > > > > Yeah the backpressure idea would work nicely with all possible > > > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > > > to do combined IO bandwidth control for > > > > > > > > > > - buffered writes, in balance_dirty_pages() > > > > > - direct IO, in the cfq IO scheduler > > > > > > > > > > I have to look into the cfq code in the past days to get an idea how > > > > > the two throttling layers can cooperate (and suffer from the pains > > > > > arise from the violations of layers). It's also rather tricky to get > > > > > two previously independent throttling mechanisms to work seamlessly > > > > > with each other for providing the desired _unified_ user interface. It > > > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > > > > > But here is the first result. The attached graph shows progress of 4 > > > > > tasks: > > > > > - cgroup A: 1 direct dd + 1 buffered dd > > > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > > > for the direct writers. As you may notice, the two direct writers are > > > > > somehow stalled for 1-2 times, which increases the gaps between the > > > > > lines. Otherwise, the algorithm is working as expected to distribute > > > > > the bandwidth to each task. > > > > > > > > > > The current code's target is to satisfy the more realistic user demand > > > > > of distributing bandwidth equally to each cgroup, and inside each > > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > > > of which, weights can be specified to change the default distribution. > > > > > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > > > current cfq proportional IO conroller does not offer explicit control > > > > > over the direct:buffered ratio. > > > > > > > > > > When there are both direct/buffered writers in the cgroup, > > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > > > execute. Note that cfq will continue to send all flusher IOs to the > > > > > root cgroup. balance_dirty_pages() will compute the overall async > > > > > weight for it so that in the above test case, the computed weights > > > > > will be > > > > > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > > Well, but we also have IO priorities which change which queue should get > > preference. > > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Good points, thank you! > > So the cfq behavior is pretty undetermined. I more or less realize > this from the experiments. For example, when starting 2+ "dd oflag=direct" > tasks in one single cgroup, they _sometimes_ progress at different rates. > See the attached graphs for two such examples on XFS. ext4 is fine. > > The 2-dd test case is: > > mkdir /cgroup/dd > echo $$ > /cgroup/dd/tasks > > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & > > The 6-dd test case is similar. Hum, interesting. I would not expect that. Maybe it's because files are allocated at the different area of the disk. But even then the difference should not be *that* big. > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > > progress much faster than the 2 direct dd tasks just because the async > > > IOs are much more efficient than the bs=64k direct IOs. > > Likely because 64k is too low to get good bandwidth with direct IO. If > > it was 4M, I believe you would get similar throughput for buffered and > > direct IO. So essentially you are right, small IO benefits from caching > > effects since they allow you to submit larger requests to the device which > > is more efficient. > > I didn't direct compare the effects, however here is an example of > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has > marginal benefits of 64k, assuming cfq is behaving well. > > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png > > The test case is: > > # cgroup 1 > echo 500 > /cgroup/cp/blkio.weight > > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & > > # cgroup 2 > echo 1000 > /cgroup/dd/blkio.weight > > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & > dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & Um, I'm not completely sure what you tried to test in the above test. What I wanted to point out is that direct IO is not necessarily less efficient than buffered IO. Look: xen-node0:~ # uname -a Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s So both direct and buffered IO are about the same. Note that I used conv=fsync flag to erase the effect that part of buffered write still remains in the cache when dd is done writing which is unfair to direct writer... And actually 64k vs 1M makes a big difference on my machine: xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync 16384+0 records in 16384+0 records out 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s Honza ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-25 9:01 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-25 9:01 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed 25-04-12 11:16:35, Wu Fengguang wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > > > > > [..] > > > > > Yeah the backpressure idea would work nicely with all possible > > > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > > > to do combined IO bandwidth control for > > > > > > > > > > - buffered writes, in balance_dirty_pages() > > > > > - direct IO, in the cfq IO scheduler > > > > > > > > > > I have to look into the cfq code in the past days to get an idea how > > > > > the two throttling layers can cooperate (and suffer from the pains > > > > > arise from the violations of layers). It's also rather tricky to get > > > > > two previously independent throttling mechanisms to work seamlessly > > > > > with each other for providing the desired _unified_ user interface. It > > > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > > > > > But here is the first result. The attached graph shows progress of 4 > > > > > tasks: > > > > > - cgroup A: 1 direct dd + 1 buffered dd > > > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > > > for the direct writers. As you may notice, the two direct writers are > > > > > somehow stalled for 1-2 times, which increases the gaps between the > > > > > lines. Otherwise, the algorithm is working as expected to distribute > > > > > the bandwidth to each task. > > > > > > > > > > The current code's target is to satisfy the more realistic user demand > > > > > of distributing bandwidth equally to each cgroup, and inside each > > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > > > of which, weights can be specified to change the default distribution. > > > > > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > > > current cfq proportional IO conroller does not offer explicit control > > > > > over the direct:buffered ratio. > > > > > > > > > > When there are both direct/buffered writers in the cgroup, > > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > > > execute. Note that cfq will continue to send all flusher IOs to the > > > > > root cgroup. balance_dirty_pages() will compute the overall async > > > > > weight for it so that in the above test case, the computed weights > > > > > will be > > > > > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > > Well, but we also have IO priorities which change which queue should get > > preference. > > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Good points, thank you! > > So the cfq behavior is pretty undetermined. I more or less realize > this from the experiments. For example, when starting 2+ "dd oflag=direct" > tasks in one single cgroup, they _sometimes_ progress at different rates. > See the attached graphs for two such examples on XFS. ext4 is fine. > > The 2-dd test case is: > > mkdir /cgroup/dd > echo $$ > /cgroup/dd/tasks > > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & > > The 6-dd test case is similar. Hum, interesting. I would not expect that. Maybe it's because files are allocated at the different area of the disk. But even then the difference should not be *that* big. > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > > progress much faster than the 2 direct dd tasks just because the async > > > IOs are much more efficient than the bs=64k direct IOs. > > Likely because 64k is too low to get good bandwidth with direct IO. If > > it was 4M, I believe you would get similar throughput for buffered and > > direct IO. So essentially you are right, small IO benefits from caching > > effects since they allow you to submit larger requests to the device which > > is more efficient. > > I didn't direct compare the effects, however here is an example of > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has > marginal benefits of 64k, assuming cfq is behaving well. > > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png > > The test case is: > > # cgroup 1 > echo 500 > /cgroup/cp/blkio.weight > > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & > > # cgroup 2 > echo 1000 > /cgroup/dd/blkio.weight > > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & > dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & Um, I'm not completely sure what you tried to test in the above test. What I wanted to point out is that direct IO is not necessarily less efficient than buffered IO. Look: xen-node0:~ # uname -a Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s So both direct and buffered IO are about the same. Note that I used conv=fsync flag to erase the effect that part of buffered write still remains in the cache when dd is done writing which is unfair to direct writer... And actually 64k vs 1M makes a big difference on my machine: xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync 16384+0 records in 16384+0 records out 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-25 9:01 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-25 9:01 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Wed 25-04-12 11:16:35, Wu Fengguang wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > > > > > [..] > > > > > Yeah the backpressure idea would work nicely with all possible > > > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > > > to do combined IO bandwidth control for > > > > > > > > > > - buffered writes, in balance_dirty_pages() > > > > > - direct IO, in the cfq IO scheduler > > > > > > > > > > I have to look into the cfq code in the past days to get an idea how > > > > > the two throttling layers can cooperate (and suffer from the pains > > > > > arise from the violations of layers). It's also rather tricky to get > > > > > two previously independent throttling mechanisms to work seamlessly > > > > > with each other for providing the desired _unified_ user interface. It > > > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > > > > > But here is the first result. The attached graph shows progress of 4 > > > > > tasks: > > > > > - cgroup A: 1 direct dd + 1 buffered dd > > > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > > > for the direct writers. As you may notice, the two direct writers are > > > > > somehow stalled for 1-2 times, which increases the gaps between the > > > > > lines. Otherwise, the algorithm is working as expected to distribute > > > > > the bandwidth to each task. > > > > > > > > > > The current code's target is to satisfy the more realistic user demand > > > > > of distributing bandwidth equally to each cgroup, and inside each > > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > > > of which, weights can be specified to change the default distribution. > > > > > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > > > current cfq proportional IO conroller does not offer explicit control > > > > > over the direct:buffered ratio. > > > > > > > > > > When there are both direct/buffered writers in the cgroup, > > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > > > execute. Note that cfq will continue to send all flusher IOs to the > > > > > root cgroup. balance_dirty_pages() will compute the overall async > > > > > weight for it so that in the above test case, the computed weights > > > > > will be > > > > > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > > Well, but we also have IO priorities which change which queue should get > > preference. > > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Good points, thank you! > > So the cfq behavior is pretty undetermined. I more or less realize > this from the experiments. For example, when starting 2+ "dd oflag=direct" > tasks in one single cgroup, they _sometimes_ progress at different rates. > See the attached graphs for two such examples on XFS. ext4 is fine. > > The 2-dd test case is: > > mkdir /cgroup/dd > echo $$ > /cgroup/dd/tasks > > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & > > The 6-dd test case is similar. Hum, interesting. I would not expect that. Maybe it's because files are allocated at the different area of the disk. But even then the difference should not be *that* big. > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > > progress much faster than the 2 direct dd tasks just because the async > > > IOs are much more efficient than the bs=64k direct IOs. > > Likely because 64k is too low to get good bandwidth with direct IO. If > > it was 4M, I believe you would get similar throughput for buffered and > > direct IO. So essentially you are right, small IO benefits from caching > > effects since they allow you to submit larger requests to the device which > > is more efficient. > > I didn't direct compare the effects, however here is an example of > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has > marginal benefits of 64k, assuming cfq is behaving well. > > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png > > The test case is: > > # cgroup 1 > echo 500 > /cgroup/cp/blkio.weight > > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & > > # cgroup 2 > echo 1000 > /cgroup/dd/blkio.weight > > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & > dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & Um, I'm not completely sure what you tried to test in the above test. What I wanted to point out is that direct IO is not necessarily less efficient than buffered IO. Look: xen-node0:~ # uname -a Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s So both direct and buffered IO are about the same. Note that I used conv=fsync flag to erase the effect that part of buffered write still remains in the cache when dd is done writing which is unfair to direct writer... And actually 64k vs 1M makes a big difference on my machine: xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync 16384+0 records in 16384+0 records out 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s Honza ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120425090156.GB12568-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-25 9:01 ` Jan Kara @ 2012-04-25 12:05 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 12:05 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal [-- Attachment #1: Type: text/plain, Size: 4319 bytes --] > > So the cfq behavior is pretty undetermined. I more or less realize > > this from the experiments. For example, when starting 2+ "dd oflag=direct" > > tasks in one single cgroup, they _sometimes_ progress at different rates. > > See the attached graphs for two such examples on XFS. ext4 is fine. > > > > The 2-dd test case is: > > > > mkdir /cgroup/dd > > echo $$ > /cgroup/dd/tasks > > > > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & > > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & > > > > The 6-dd test case is similar. > Hum, interesting. I would not expect that. Maybe it's because files are > allocated at the different area of the disk. But even then the difference > should not be *that* big. Agreed. > > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > > > progress much faster than the 2 direct dd tasks just because the async > > > > IOs are much more efficient than the bs=64k direct IOs. > > > Likely because 64k is too low to get good bandwidth with direct IO. If > > > it was 4M, I believe you would get similar throughput for buffered and > > > direct IO. So essentially you are right, small IO benefits from caching > > > effects since they allow you to submit larger requests to the device which > > > is more efficient. > > > > I didn't direct compare the effects, however here is an example of > > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has > > marginal benefits of 64k, assuming cfq is behaving well. > > > > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png > > > > The test case is: > > > > # cgroup 1 > > echo 500 > /cgroup/cp/blkio.weight > > > > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & > > > > # cgroup 2 > > echo 1000 > /cgroup/dd/blkio.weight > > > > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & > > dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & > Um, I'm not completely sure what you tried to test in the above test. Yeah it's not a good test case. I've changed it to run the 3 dd tasks in 3 cgroups with equal weight. Attached the new results (looks the same as the original one). > What I wanted to point out is that direct IO is not necessarily less > efficient than buffered IO. Look: > xen-node0:~ # uname -a > Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012 > x86_64 x86_64 x86_64 GNU/Linux > xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s > xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s > > So both direct and buffered IO are about the same. Note that I used > conv=fsync flag to erase the effect that part of buffered write still > remains in the cache when dd is done writing which is unfair to direct > writer... OK, I also find direct write being a bit faster than buffered write: root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync 1073741824 bytes (1.1 GB) copied, 10.4039 s, 103 MB/s 1073741824 bytes (1.1 GB) copied, 10.4143 s, 103 MB/s root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync 1073741824 bytes (1.1 GB) copied, 9.9006 s, 108 MB/s 1073741824 bytes (1.1 GB) copied, 9.55173 s, 112 MB/s root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync 1073741824 bytes (1.1 GB) copied, 9.83902 s, 109 MB/s 1073741824 bytes (1.1 GB) copied, 9.61725 s, 112 MB/s > And actually 64k vs 1M makes a big difference on my machine: > xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync > 16384+0 records in > 16384+0 records out > 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s Interestingly, my 64k direct writes are as fast as 1M direct writes... and 4k writes run at ~1/4 speed: root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=4k count=$((256<<10)) oflag=direct conv=fsync 1073741824 bytes (1.1 GB) copied, 42.0726 s, 25.5 MB/s Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 61279 bytes --] [-- Attachment #3: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-25 12:05 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-25 12:05 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf [-- Attachment #1: Type: text/plain, Size: 4319 bytes --] > > So the cfq behavior is pretty undetermined. I more or less realize > > this from the experiments. For example, when starting 2+ "dd oflag=direct" > > tasks in one single cgroup, they _sometimes_ progress at different rates. > > See the attached graphs for two such examples on XFS. ext4 is fine. > > > > The 2-dd test case is: > > > > mkdir /cgroup/dd > > echo $$ > /cgroup/dd/tasks > > > > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & > > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & > > > > The 6-dd test case is similar. > Hum, interesting. I would not expect that. Maybe it's because files are > allocated at the different area of the disk. But even then the difference > should not be *that* big. Agreed. > > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > > > progress much faster than the 2 direct dd tasks just because the async > > > > IOs are much more efficient than the bs=64k direct IOs. > > > Likely because 64k is too low to get good bandwidth with direct IO. If > > > it was 4M, I believe you would get similar throughput for buffered and > > > direct IO. So essentially you are right, small IO benefits from caching > > > effects since they allow you to submit larger requests to the device which > > > is more efficient. > > > > I didn't direct compare the effects, however here is an example of > > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has > > marginal benefits of 64k, assuming cfq is behaving well. > > > > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png > > > > The test case is: > > > > # cgroup 1 > > echo 500 > /cgroup/cp/blkio.weight > > > > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & > > > > # cgroup 2 > > echo 1000 > /cgroup/dd/blkio.weight > > > > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & > > dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & > Um, I'm not completely sure what you tried to test in the above test. Yeah it's not a good test case. I've changed it to run the 3 dd tasks in 3 cgroups with equal weight. Attached the new results (looks the same as the original one). > What I wanted to point out is that direct IO is not necessarily less > efficient than buffered IO. Look: > xen-node0:~ # uname -a > Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012 > x86_64 x86_64 x86_64 GNU/Linux > xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s > xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s > > So both direct and buffered IO are about the same. Note that I used > conv=fsync flag to erase the effect that part of buffered write still > remains in the cache when dd is done writing which is unfair to direct > writer... OK, I also find direct write being a bit faster than buffered write: root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync 1073741824 bytes (1.1 GB) copied, 10.4039 s, 103 MB/s 1073741824 bytes (1.1 GB) copied, 10.4143 s, 103 MB/s root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync 1073741824 bytes (1.1 GB) copied, 9.9006 s, 108 MB/s 1073741824 bytes (1.1 GB) copied, 9.55173 s, 112 MB/s root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync 1073741824 bytes (1.1 GB) copied, 9.83902 s, 109 MB/s 1073741824 bytes (1.1 GB) copied, 9.61725 s, 112 MB/s > And actually 64k vs 1M makes a big difference on my machine: > xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync > 16384+0 records in > 16384+0 records out > 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s Interestingly, my 64k direct writes are as fast as 1M direct writes... and 4k writes run at ~1/4 speed: root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=4k count=$((256<<10)) oflag=direct conv=fsync 1073741824 bytes (1.1 GB) copied, 42.0726 s, 25.5 MB/s Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-task-bw.png --] [-- Type: image/png, Size: 61279 bytes --] ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-25 3:16 ` Fengguang Wu (?) (?) @ 2012-04-25 9:01 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-25 9:01 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Wed 25-04-12 11:16:35, Wu Fengguang wrote: > On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote: > > On Tue 24-04-12 19:33:40, Wu Fengguang wrote: > > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > > > > > > > [..] > > > > > Yeah the backpressure idea would work nicely with all possible > > > > > intermediate stacking between the bdi and leaf devices. In my attempt > > > > > to do combined IO bandwidth control for > > > > > > > > > > - buffered writes, in balance_dirty_pages() > > > > > - direct IO, in the cfq IO scheduler > > > > > > > > > > I have to look into the cfq code in the past days to get an idea how > > > > > the two throttling layers can cooperate (and suffer from the pains > > > > > arise from the violations of layers). It's also rather tricky to get > > > > > two previously independent throttling mechanisms to work seamlessly > > > > > with each other for providing the desired _unified_ user interface. It > > > > > took a lot of reasoning and experiments to work the basic scheme out... > > > > > > > > > > But here is the first result. The attached graph shows progress of 4 > > > > > tasks: > > > > > - cgroup A: 1 direct dd + 1 buffered dd > > > > > - cgroup B: 1 direct dd + 1 buffered dd > > > > > > > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > > > > for the direct writers. As you may notice, the two direct writers are > > > > > somehow stalled for 1-2 times, which increases the gaps between the > > > > > lines. Otherwise, the algorithm is working as expected to distribute > > > > > the bandwidth to each task. > > > > > > > > > > The current code's target is to satisfy the more realistic user demand > > > > > of distributing bandwidth equally to each cgroup, and inside each > > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > > > > of which, weights can be specified to change the default distribution. > > > > > > > > > > The implementation involves adding "weight for direct IO" to the cfq > > > > > groups and "weight for buffered writes" to the root cgroup. Note that > > > > > current cfq proportional IO conroller does not offer explicit control > > > > > over the direct:buffered ratio. > > > > > > > > > > When there are both direct/buffered writers in the cgroup, > > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > > > > execute. Note that cfq will continue to send all flusher IOs to the > > > > > root cgroup. balance_dirty_pages() will compute the overall async > > > > > weight for it so that in the above test case, the computed weights > > > > > will be > > > > > > > > I think having separate weigths for sync IO groups and async IO is not > > > > very appealing. There should be one notion of group weight and bandwidth > > > > distrubuted among groups according to their weight. > > > > > > There have to be some scheme, either explicitly or implicitly. Maybe > > > you are baring in mind some "equal split among queues" policy? For > > > example, if the cgroup has 9 active sync queues and 1 async queue, > > > split the weight equally to the 10 queues? So the sync IOs get 90% > > > share, and the async writes get 10% share. > > Maybe I misunderstand but there doesn't have to be (and in fact isn't) > > any split among sync / async IO in CFQ. At each moment, we choose a queue > > with the highest score and dispatch a couple of requests from it. Then we > > go and choose again. The score of the queue depends on several factors > > (like age of requests, whether the queue is sync or async, IO priority, > > etc.). > > > > Practically, over a longer period system will stabilize on some ratio > > but that's dependent on the load so your system should not impose some > > artificial direct/buffered split but rather somehow deal with the reality > > how IO scheduler decides to dispatch requests... > > > Well, but we also have IO priorities which change which queue should get > > preference. > > > And also sync queues for several processes can get merged when CFQ > > observes these processes cooperate together on one area of disk and get > > split again when processes stop cooperating. I don't think you really want > > to second-guess what CFQ does inside... > > Good points, thank you! > > So the cfq behavior is pretty undetermined. I more or less realize > this from the experiments. For example, when starting 2+ "dd oflag=direct" > tasks in one single cgroup, they _sometimes_ progress at different rates. > See the attached graphs for two such examples on XFS. ext4 is fine. > > The 2-dd test case is: > > mkdir /cgroup/dd > echo $$ > /cgroup/dd/tasks > > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct & > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct & > > The 6-dd test case is similar. Hum, interesting. I would not expect that. Maybe it's because files are allocated at the different area of the disk. But even then the difference should not be *that* big. > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of > > > them are buffered writes). I guess the 2 buffered dd tasks managed to > > > progress much faster than the 2 direct dd tasks just because the async > > > IOs are much more efficient than the bs=64k direct IOs. > > Likely because 64k is too low to get good bandwidth with direct IO. If > > it was 4M, I believe you would get similar throughput for buffered and > > direct IO. So essentially you are right, small IO benefits from caching > > effects since they allow you to submit larger requests to the device which > > is more efficient. > > I didn't direct compare the effects, however here is an example of > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has > marginal benefits of 64k, assuming cfq is behaving well. > > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png > > The test case is: > > # cgroup 1 > echo 500 > /cgroup/cp/blkio.weight > > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct & > > # cgroup 2 > echo 1000 > /cgroup/dd/blkio.weight > > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct & > dd if=/dev/zero of=/fs/zero-4k bs=4k oflag=direct & Um, I'm not completely sure what you tried to test in the above test. What I wanted to point out is that direct IO is not necessarily less efficient than buffered IO. Look: xen-node0:~ # uname -a Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s So both direct and buffered IO are about the same. Note that I used conv=fsync flag to erase the effect that part of buffered write still remains in the cache when dd is done writing which is unfair to direct writer... And actually 64k vs 1M makes a big difference on my machine: xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync 16384+0 records in 16384+0 records out 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s Honza ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120416145744.GA15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120416145744.GA15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-24 11:33 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-24 11:33 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Hi Vivek, On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote: > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: > > [..] > > Yeah the backpressure idea would work nicely with all possible > > intermediate stacking between the bdi and leaf devices. In my attempt > > to do combined IO bandwidth control for > > > > - buffered writes, in balance_dirty_pages() > > - direct IO, in the cfq IO scheduler > > > > I have to look into the cfq code in the past days to get an idea how > > the two throttling layers can cooperate (and suffer from the pains > > arise from the violations of layers). It's also rather tricky to get > > two previously independent throttling mechanisms to work seamlessly > > with each other for providing the desired _unified_ user interface. It > > took a lot of reasoning and experiments to work the basic scheme out... > > > > But here is the first result. The attached graph shows progress of 4 > > tasks: > > - cgroup A: 1 direct dd + 1 buffered dd > > - cgroup B: 1 direct dd + 1 buffered dd > > > > The 4 tasks are mostly progressing at the same pace. The top 2 > > smoother lines are for the buffered dirtiers. The bottom 2 lines are > > for the direct writers. As you may notice, the two direct writers are > > somehow stalled for 1-2 times, which increases the gaps between the > > lines. Otherwise, the algorithm is working as expected to distribute > > the bandwidth to each task. > > > > The current code's target is to satisfy the more realistic user demand > > of distributing bandwidth equally to each cgroup, and inside each > > cgroup, distribute bandwidth equally to buffered/direct writes. On top > > of which, weights can be specified to change the default distribution. > > > > The implementation involves adding "weight for direct IO" to the cfq > > groups and "weight for buffered writes" to the root cgroup. Note that > > current cfq proportional IO conroller does not offer explicit control > > over the direct:buffered ratio. > > > > When there are both direct/buffered writers in the cgroup, > > balance_dirty_pages() will kick in and adjust the weights for cfq to > > execute. Note that cfq will continue to send all flusher IOs to the > > root cgroup. balance_dirty_pages() will compute the overall async > > weight for it so that in the above test case, the computed weights > > will be > > I think having separate weigths for sync IO groups and async IO is not > very appealing. There should be one notion of group weight and bandwidth > distrubuted among groups according to their weight. There have to be some scheme, either explicitly or implicitly. Maybe you are baring in mind some "equal split among queues" policy? For example, if the cgroup has 9 active sync queues and 1 async queue, split the weight equally to the 10 queues? So the sync IOs get 90% share, and the async writes get 10% share. For dirty throttling w/o cgroup awareness, balance_dirty_pages() splits the writeout bandwidth equally among all dirtier tasks. Since cfq works with queues, it seems most natural for it to do equal split among all queues (inside the cgroup). I'm not sure when there are N dd tasks doing direct IO, cfq will continuously run N sync queues for them (without many dynamic queue deletion and recreations). If that is the case, it should be trivial to support the queue based fair split in the global async queue scheme. Otherwise I'll have some trouble detecting the N value when trying to do the N:1 sync:async weight split. > Now one can argue that with-in a group, there might be one knob in CFQ > which allows to change the share or sync/async IO. Yeah. I suspect typical users don't care about the split policy or fairness inside the cgroup, otherwise there may be complains on any existing policies: "I want split this way" "I want that way"... ;-) Anyway I'm not sure about the possible use cases.. > Also Tejun and Jan have expressed the desire that once we have figured > out a way to communicate the submitter's context for async IO, we would > like to account that IO in associated cgroup instead of root cgroup (as > we do today). Understand. Accounting should always be attributed to the corresponding cgroup. I'll also need this to send feedback information to the async IO submitter's cgroups. > > - 1000 async weight for the root cgroup (2 buffered dds) > > - 500 dio weight for cgroup A (1 direct dd) > > - 500 dio weight for cgroup B (1 direct dd) > > > > The second graph shows result for another test case: > > - cgroup A, weight 300: 1 buffered cp > > - cgroup B, weight 600: 1 buffered dd + 1 direct dd > > - cgroup C, weight 300: 1 direct dd > > which is also working as expected. > > > > Once the cfq properly grants total async IO share to the flusher, > > balance_dirty_pages() will then do its original job of distributing > > the buffered write bandwidth among the buffered dd tasks. > > > > It will have to assume that the devices under the same bdi are > > "symmetry". It also needs further stats feedback on IOPS or disk time > > in order to do IOPS/time based IO distribution. Anyway it would be > > interesting to see how far this scheme can go. I'll cleanup the code > > and post it soon. > > Your proposal relies on few things. > > - Bandwidth needs to be divided eually among sync and async IO. Yeah balance_dirty_pages() always works on the basis of bandwidth. The plan is that once we get the feedback information on each stream's bandwidth:disk_time (or IOPS) ratio, the bandwidth target can be adjusted to achieve disk time or IOPS based fair share among the buffered dirtiers. For the sync:async split, it's operating on the cfqg->weight. So it's automatically disk time based. Look at this graph, the 4 dd tasks are granted the same weight (2 of them are buffered writes). I guess the 2 buffered dd tasks managed to progress much faster than the 2 direct dd tasks just because the async IOs are much more efficient than the bs=64k direct IOs. https://github.com/fengguang/io-controller-tests/raw/master/log/bay/xfs/mixed-write-2.2012-04-19-10-42/balance_dirty_pages-task-bw.png > - Flusher thread async IO will always to go to root cgroup. Right. This is actually my main target: to avoid splitting up the async streams throughout the IO path, for the good of performance. > - I am not sure how this scheme is going to work when we introduce > hierarchical blkio cgroups. I think it's still viable. balance_dirty_pages() works by estimating the N (number of dd tasks) value and splitting the writeout bandwidth equally among the tasks: task_ratelimit = write_bandwidth / N It becomes a proportional weight IO controller if change the formula to task_ratelimit = weight * write_bandwidth / N_w Here lies the beauty of the bdi_update_dirty_ratelimit() algorithm: it can automatically adapt N to the proper "weighted" N_w to keep things in balance, given whatever weights applied to each task. If further use blkcg_ratelimit = weight * write_bandwidth / N_w task_ratelimit = weight * blkcg_ratelimit / M_w It's turned into a cgroup IO controller. This change further makes it a hierarchical IO controller: blkcg_ratelimit = weight * parent_blkcg_ratelimit / M_w We'll also need to hierarchically de-compose the async weights from inner cgroup levels to outer levels, and finally add them to the root cgroup that holds the async queue. This looks feasible, too. > - cgroup weights for sync IO seems to be being controlled by user and > somehow root cgroup weight seems to be controlled by this async IO > logic silently. In the current state I do assume no IO tasks in the root cgroup except for the flusher. However in general the root cgroup can be treated the same as other cgroups: its weight can also be split up into dio_weight and async weight. The general idea is - cfqg->weight is given by user - cfqg->dio_weight is used for sync slices on vdisktime calculation. - total_async_weight collects all async IO weights from each cgroup, including the root cgroup. They are the "credits" for the flusher for doing the async IOs in delegate of all the cgroups. > Overall sounds very odd design to me. I am not sure what are we achieving > by this. In current scheme one should be able to just adjust the weight > of root cgroup using cgroup interface and achieve same results which you > are showing. So where is the need of dynamically changing it inside > kernel. The "dynamically changing weights" are for the in-cgroup equal split between sync/async IOs. It does feel like an arbitrary added policy.. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-14 14:36 ` Fengguang Wu (?) (?) @ 2012-04-16 14:57 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote: [..] > Yeah the backpressure idea would work nicely with all possible > intermediate stacking between the bdi and leaf devices. In my attempt > to do combined IO bandwidth control for > > - buffered writes, in balance_dirty_pages() > - direct IO, in the cfq IO scheduler > > I have to look into the cfq code in the past days to get an idea how > the two throttling layers can cooperate (and suffer from the pains > arise from the violations of layers). It's also rather tricky to get > two previously independent throttling mechanisms to work seamlessly > with each other for providing the desired _unified_ user interface. It > took a lot of reasoning and experiments to work the basic scheme out... > > But here is the first result. The attached graph shows progress of 4 > tasks: > - cgroup A: 1 direct dd + 1 buffered dd > - cgroup B: 1 direct dd + 1 buffered dd > > The 4 tasks are mostly progressing at the same pace. The top 2 > smoother lines are for the buffered dirtiers. The bottom 2 lines are > for the direct writers. As you may notice, the two direct writers are > somehow stalled for 1-2 times, which increases the gaps between the > lines. Otherwise, the algorithm is working as expected to distribute > the bandwidth to each task. > > The current code's target is to satisfy the more realistic user demand > of distributing bandwidth equally to each cgroup, and inside each > cgroup, distribute bandwidth equally to buffered/direct writes. On top > of which, weights can be specified to change the default distribution. > > The implementation involves adding "weight for direct IO" to the cfq > groups and "weight for buffered writes" to the root cgroup. Note that > current cfq proportional IO conroller does not offer explicit control > over the direct:buffered ratio. > > When there are both direct/buffered writers in the cgroup, > balance_dirty_pages() will kick in and adjust the weights for cfq to > execute. Note that cfq will continue to send all flusher IOs to the > root cgroup. balance_dirty_pages() will compute the overall async > weight for it so that in the above test case, the computed weights > will be I think having separate weigths for sync IO groups and async IO is not very appealing. There should be one notion of group weight and bandwidth distrubuted among groups according to their weight. Now one can argue that with-in a group, there might be one knob in CFQ which allows to change the share or sync/async IO. Also Tejun and Jan have expressed the desire that once we have figured out a way to communicate the submitter's context for async IO, we would like to account that IO in associated cgroup instead of root cgroup (as we do today). > > - 1000 async weight for the root cgroup (2 buffered dds) > - 500 dio weight for cgroup A (1 direct dd) > - 500 dio weight for cgroup B (1 direct dd) > > The second graph shows result for another test case: > - cgroup A, weight 300: 1 buffered cp > - cgroup B, weight 600: 1 buffered dd + 1 direct dd > - cgroup C, weight 300: 1 direct dd > which is also working as expected. > > Once the cfq properly grants total async IO share to the flusher, > balance_dirty_pages() will then do its original job of distributing > the buffered write bandwidth among the buffered dd tasks. > > It will have to assume that the devices under the same bdi are > "symmetry". It also needs further stats feedback on IOPS or disk time > in order to do IOPS/time based IO distribution. Anyway it would be > interesting to see how far this scheme can go. I'll cleanup the code > and post it soon. Your proposal relies on few things. - Bandwidth needs to be divided eually among sync and async IO. - Flusher thread async IO will always to go to root cgroup. - I am not sure how this scheme is going to work when we introduce hierarchical blkio cgroups. - cgroup weights for sync IO seems to be being controlled by user and somehow root cgroup weight seems to be controlled by this async IO logic silently. Overall sounds very odd design to me. I am not sure what are we achieving by this. In current scheme one should be able to just adjust the weight of root cgroup using cgroup interface and achieve same results which you are showing. So where is the need of dynamically changing it inside kernel. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-12 20:37 ` Vivek Goyal (?) @ 2012-04-15 11:37 ` Peter Zijlstra -1 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote: > If yes, how does one map a filesystem's bdi we want to put rules on? > /proc/self/mountinfo has the required bits ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-15 11:37 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote: > If yes, how does one map a filesystem's bdi we want to put rules on? > /proc/self/mountinfo has the required bits -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-15 11:37 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote: > If yes, how does one map a filesystem's bdi we want to put rules on? > /proc/self/mountinfo has the required bits -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-12 20:51 ` Tejun Heo 2012-04-15 11:37 ` [Lsf] " Peter Zijlstra 1 sibling, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hello, Vivek. On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote: > I mean how are we supposed to put cgroup throttling rules using cgroup > interface for network filesystems and for btrfs global bdi. Using "dev_t" > associated with bdi? I see that all the bdi's are showing up in > /sys/class/bdi, but how do I know which one I am intereste in or which > one belongs to filesystem I am interestd in putting throttling rule on. > > For block devices, we simply use "major:min limit" format to write to > a cgroup file and this configuration will sit in one of the per queue > per cgroup data structure. > > I am assuming that when you say throttling should happen at bdi, you > are thinking of maintaining per cgroup per bdi data structures and user > is somehow supposed to pass "bdi_maj:bdi_min limit" through cgroup files? > If yes, how does one map a filesystem's bdi we want to put rules on? I think you're worrying way too much. One of the biggest reasons we have layers and abstractions is to avoid worrying about everything from everywhere. Let block device implement per-device limits. Let writeback work from the backpressure it gets from the relevant IO channel, bdi-cgroup combination in this case. For stacked or combined devices, let the combining layer deal with piping the congestion information. If it's per-file split, the combined bdi can simply forward information from the matching underlying device. If the file is striped / duplicated somehow, the *only* layer which knows what to do is and should be the layer performing the striping and duplication. There's no need to worry about it from blkcg and if you get the layering correct it isn't difficult to slice such logic inbetween. In fact, most of it (backpressure propagation) would just happen as part of the usual buffering between layers. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-12 20:51 ` Tejun Heo @ 2012-04-15 11:37 ` Peter Zijlstra 1 sibling, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw) To: Vivek Goyal Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote: > If yes, how does one map a filesystem's bdi we want to put rules on? > /proc/self/mountinfo has the required bits ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-12 20:37 ` Vivek Goyal @ 2012-04-17 22:01 ` Tejun Heo 1 sibling, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal Hello, On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: > > So all the metadata IO will happen thorough journaling thread and that > > will be in root group which should remain unthrottled. So any journal > > IO going to disk should remain unthrottled. > > Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't > have to have the journal thread (as is the case of reiserfs where random > writer may end up doing commit) but let's not complicate things > unnecessarily. Why can't journal entries keep track of the originator so that bios can be attributed to the originator while committing? That shouldn't be too difficult to implement, no? > > Now, IIRC, fsync problem with throttling was that we had opened a > > transaction but could not write it back to disk because we had to > > wait for all the cached data to go to disk (which is throttled). So > > my question is, can't we first wait for all the data to be flushed > > to disk and then open a transaction for metadata. metadata will be > > unthrottled so filesystem will not have to do any tricks like bdi is > > congested or not. > > Actually that's what's happening. We first do filemap_write_and_wait() > which syncs all the data and then we go and force transaction commit to > make sure all metadata got to stable storage. The problem is that writeout > of data may need to allocate new blocks and that starts a transaction and > while the transaction is started we may need to do some reads (e.g. of > bitmaps etc.) which may be throttled and at that moment the whole > filesystem is blocked. I don't remember the stack traces you showed me so > I'm not sure it this is what your observed but it's certainly one possible > scenario. The reason why fsync triggers problems is simply that it's the > only place where process normally does significant amount of writing. In > most cases flusher thread / journal thread do it so this effect is not > visible. And to precede your question, it would be rather hard to avoid IO > while the transaction is started due to locking. Probably we should mark all IOs issued inside transaction as META (or whatever which tells blkcg to avoid throttling it). We're gonna need overcharging for metadata writes anyway, so I don't think this will make too much of a difference. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-12 20:37 ` Vivek Goyal @ 2012-04-17 22:01 ` Tejun Heo 1 sibling, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: > > So all the metadata IO will happen thorough journaling thread and that > > will be in root group which should remain unthrottled. So any journal > > IO going to disk should remain unthrottled. > > Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't > have to have the journal thread (as is the case of reiserfs where random > writer may end up doing commit) but let's not complicate things > unnecessarily. Why can't journal entries keep track of the originator so that bios can be attributed to the originator while committing? That shouldn't be too difficult to implement, no? > > Now, IIRC, fsync problem with throttling was that we had opened a > > transaction but could not write it back to disk because we had to > > wait for all the cached data to go to disk (which is throttled). So > > my question is, can't we first wait for all the data to be flushed > > to disk and then open a transaction for metadata. metadata will be > > unthrottled so filesystem will not have to do any tricks like bdi is > > congested or not. > > Actually that's what's happening. We first do filemap_write_and_wait() > which syncs all the data and then we go and force transaction commit to > make sure all metadata got to stable storage. The problem is that writeout > of data may need to allocate new blocks and that starts a transaction and > while the transaction is started we may need to do some reads (e.g. of > bitmaps etc.) which may be throttled and at that moment the whole > filesystem is blocked. I don't remember the stack traces you showed me so > I'm not sure it this is what your observed but it's certainly one possible > scenario. The reason why fsync triggers problems is simply that it's the > only place where process normally does significant amount of writing. In > most cases flusher thread / journal thread do it so this effect is not > visible. And to precede your question, it would be rather hard to avoid IO > while the transaction is started due to locking. Probably we should mark all IOs issued inside transaction as META (or whatever which tells blkcg to avoid throttling it). We're gonna need overcharging for metadata writes anyway, so I don't think this will make too much of a difference. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-17 22:01 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: > > So all the metadata IO will happen thorough journaling thread and that > > will be in root group which should remain unthrottled. So any journal > > IO going to disk should remain unthrottled. > > Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't > have to have the journal thread (as is the case of reiserfs where random > writer may end up doing commit) but let's not complicate things > unnecessarily. Why can't journal entries keep track of the originator so that bios can be attributed to the originator while committing? That shouldn't be too difficult to implement, no? > > Now, IIRC, fsync problem with throttling was that we had opened a > > transaction but could not write it back to disk because we had to > > wait for all the cached data to go to disk (which is throttled). So > > my question is, can't we first wait for all the data to be flushed > > to disk and then open a transaction for metadata. metadata will be > > unthrottled so filesystem will not have to do any tricks like bdi is > > congested or not. > > Actually that's what's happening. We first do filemap_write_and_wait() > which syncs all the data and then we go and force transaction commit to > make sure all metadata got to stable storage. The problem is that writeout > of data may need to allocate new blocks and that starts a transaction and > while the transaction is started we may need to do some reads (e.g. of > bitmaps etc.) which may be throttled and at that moment the whole > filesystem is blocked. I don't remember the stack traces you showed me so > I'm not sure it this is what your observed but it's certainly one possible > scenario. The reason why fsync triggers problems is simply that it's the > only place where process normally does significant amount of writing. In > most cases flusher thread / journal thread do it so this effect is not > visible. And to precede your question, it would be rather hard to avoid IO > while the transaction is started due to locking. Probably we should mark all IOs issued inside transaction as META (or whatever which tells blkcg to avoid throttling it). We're gonna need overcharging for metadata writes anyway, so I don't think this will make too much of a difference. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-17 22:01 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: > > So all the metadata IO will happen thorough journaling thread and that > > will be in root group which should remain unthrottled. So any journal > > IO going to disk should remain unthrottled. > > Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't > have to have the journal thread (as is the case of reiserfs where random > writer may end up doing commit) but let's not complicate things > unnecessarily. Why can't journal entries keep track of the originator so that bios can be attributed to the originator while committing? That shouldn't be too difficult to implement, no? > > Now, IIRC, fsync problem with throttling was that we had opened a > > transaction but could not write it back to disk because we had to > > wait for all the cached data to go to disk (which is throttled). So > > my question is, can't we first wait for all the data to be flushed > > to disk and then open a transaction for metadata. metadata will be > > unthrottled so filesystem will not have to do any tricks like bdi is > > congested or not. > > Actually that's what's happening. We first do filemap_write_and_wait() > which syncs all the data and then we go and force transaction commit to > make sure all metadata got to stable storage. The problem is that writeout > of data may need to allocate new blocks and that starts a transaction and > while the transaction is started we may need to do some reads (e.g. of > bitmaps etc.) which may be throttled and at that moment the whole > filesystem is blocked. I don't remember the stack traces you showed me so > I'm not sure it this is what your observed but it's certainly one possible > scenario. The reason why fsync triggers problems is simply that it's the > only place where process normally does significant amount of writing. In > most cases flusher thread / journal thread do it so this effect is not > visible. And to precede your question, it would be rather hard to avoid IO > while the transaction is started due to locking. Probably we should mark all IOs issued inside transaction as META (or whatever which tells blkcg to avoid throttling it). We're gonna need overcharging for metadata writes anyway, so I don't think this will make too much of a difference. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-17 22:01 ` Tejun Heo @ 2012-04-18 6:30 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-18 6:30 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, On Tue 17-04-12 15:01:06, Tejun Heo wrote: > On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: > > > So all the metadata IO will happen thorough journaling thread and that > > > will be in root group which should remain unthrottled. So any journal > > > IO going to disk should remain unthrottled. > > > > Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't > > have to have the journal thread (as is the case of reiserfs where random > > writer may end up doing commit) but let's not complicate things > > unnecessarily. > > Why can't journal entries keep track of the originator so that bios > can be attributed to the originator while committing? That shouldn't > be too difficult to implement, no? I think I was just describing the current state but yes, in future we can track which cgroup first attached a buffer to a transaction. > > > Now, IIRC, fsync problem with throttling was that we had opened a > > > transaction but could not write it back to disk because we had to > > > wait for all the cached data to go to disk (which is throttled). So > > > my question is, can't we first wait for all the data to be flushed > > > to disk and then open a transaction for metadata. metadata will be > > > unthrottled so filesystem will not have to do any tricks like bdi is > > > congested or not. > > > > Actually that's what's happening. We first do filemap_write_and_wait() > > which syncs all the data and then we go and force transaction commit to > > make sure all metadata got to stable storage. The problem is that writeout > > of data may need to allocate new blocks and that starts a transaction and > > while the transaction is started we may need to do some reads (e.g. of > > bitmaps etc.) which may be throttled and at that moment the whole > > filesystem is blocked. I don't remember the stack traces you showed me so > > I'm not sure it this is what your observed but it's certainly one possible > > scenario. The reason why fsync triggers problems is simply that it's the > > only place where process normally does significant amount of writing. In > > most cases flusher thread / journal thread do it so this effect is not > > visible. And to precede your question, it would be rather hard to avoid IO > > while the transaction is started due to locking. > > Probably we should mark all IOs issued inside transaction as META (or > whatever which tells blkcg to avoid throttling it). We're gonna need > overcharging for metadata writes anyway, so I don't think this will > make too much of a difference. Agreed. Honza ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-18 6:30 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-18 6:30 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, On Tue 17-04-12 15:01:06, Tejun Heo wrote: > On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: > > > So all the metadata IO will happen thorough journaling thread and that > > > will be in root group which should remain unthrottled. So any journal > > > IO going to disk should remain unthrottled. > > > > Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't > > have to have the journal thread (as is the case of reiserfs where random > > writer may end up doing commit) but let's not complicate things > > unnecessarily. > > Why can't journal entries keep track of the originator so that bios > can be attributed to the originator while committing? That shouldn't > be too difficult to implement, no? I think I was just describing the current state but yes, in future we can track which cgroup first attached a buffer to a transaction. > > > Now, IIRC, fsync problem with throttling was that we had opened a > > > transaction but could not write it back to disk because we had to > > > wait for all the cached data to go to disk (which is throttled). So > > > my question is, can't we first wait for all the data to be flushed > > > to disk and then open a transaction for metadata. metadata will be > > > unthrottled so filesystem will not have to do any tricks like bdi is > > > congested or not. > > > > Actually that's what's happening. We first do filemap_write_and_wait() > > which syncs all the data and then we go and force transaction commit to > > make sure all metadata got to stable storage. The problem is that writeout > > of data may need to allocate new blocks and that starts a transaction and > > while the transaction is started we may need to do some reads (e.g. of > > bitmaps etc.) which may be throttled and at that moment the whole > > filesystem is blocked. I don't remember the stack traces you showed me so > > I'm not sure it this is what your observed but it's certainly one possible > > scenario. The reason why fsync triggers problems is simply that it's the > > only place where process normally does significant amount of writing. In > > most cases flusher thread / journal thread do it so this effect is not > > visible. And to precede your question, it would be rather hard to avoid IO > > while the transaction is started due to locking. > > Probably we should mark all IOs issued inside transaction as META (or > whatever which tells blkcg to avoid throttling it). We're gonna need > overcharging for metadata writes anyway, so I don't think this will > make too much of a difference. Agreed. Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-04-18 6:30 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-18 6:30 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal Hello, On Tue 17-04-12 15:01:06, Tejun Heo wrote: > On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote: > > > So all the metadata IO will happen thorough journaling thread and that > > > will be in root group which should remain unthrottled. So any journal > > > IO going to disk should remain unthrottled. > > > > Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't > > have to have the journal thread (as is the case of reiserfs where random > > writer may end up doing commit) but let's not complicate things > > unnecessarily. > > Why can't journal entries keep track of the originator so that bios > can be attributed to the originator while committing? That shouldn't > be too difficult to implement, no? I think I was just describing the current state but yes, in future we can track which cgroup first attached a buffer to a transaction. > > > Now, IIRC, fsync problem with throttling was that we had opened a > > > transaction but could not write it back to disk because we had to > > > wait for all the cached data to go to disk (which is throttled). So > > > my question is, can't we first wait for all the data to be flushed > > > to disk and then open a transaction for metadata. metadata will be > > > unthrottled so filesystem will not have to do any tricks like bdi is > > > congested or not. > > > > Actually that's what's happening. We first do filemap_write_and_wait() > > which syncs all the data and then we go and force transaction commit to > > make sure all metadata got to stable storage. The problem is that writeout > > of data may need to allocate new blocks and that starts a transaction and > > while the transaction is started we may need to do some reads (e.g. of > > bitmaps etc.) which may be throttled and at that moment the whole > > filesystem is blocked. I don't remember the stack traces you showed me so > > I'm not sure it this is what your observed but it's certainly one possible > > scenario. The reason why fsync triggers problems is simply that it's the > > only place where process normally does significant amount of writing. In > > most cases flusher thread / journal thread do it so this effect is not > > visible. And to precede your question, it would be rather hard to avoid IO > > while the transaction is started due to locking. > > Probably we should mark all IOs issued inside transaction as META (or > whatever which tells blkcg to avoid throttling it). We're gonna need > overcharging for metadata writes anyway, so I don't think this will > make too much of a difference. Agreed. Honza ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-11 15:40 ` Vivek Goyal (?) @ 2012-04-14 12:25 ` Peter Zijlstra -1 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > Ok, that's good to know. How would we configure this special bdi? I am > assuming there is no backing device visible in /sys/block/<device>/queue/? > Same is true for network file systems. root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done ls: cannot access /sys/class/bdi/0:20/: No such file or directory total 0 drwxr-xr-x 3 root root 0 2012-03-27 23:18 . drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio drwxr-xr-x 2 root root 0 2012-04-14 14:22 power -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-14 12:25 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > Ok, that's good to know. How would we configure this special bdi? I am > assuming there is no backing device visible in /sys/block/<device>/queue/? > Same is true for network file systems. root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done ls: cannot access /sys/class/bdi/0:20/: No such file or directory total 0 drwxr-xr-x 3 root root 0 2012-03-27 23:18 . drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio drwxr-xr-x 2 root root 0 2012-04-14 14:22 power -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-14 12:25 ` Peter Zijlstra 0 siblings, 0 replies; 261+ messages in thread From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > Ok, that's good to know. How would we configure this special bdi? I am > assuming there is no backing device visible in /sys/block/<device>/queue/? > Same is true for network file systems. root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done ls: cannot access /sys/class/bdi/0:20/: No such file or directory total 0 drwxr-xr-x 3 root root 0 2012-03-27 23:18 . drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio drwxr-xr-x 2 root root 0 2012-04-14 14:22 power -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-14 12:25 ` Peter Zijlstra (?) @ 2012-04-16 12:54 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw) To: Peter Zijlstra Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > Ok, that's good to know. How would we configure this special bdi? I am > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > Same is true for network file systems. > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > total 0 > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent Ok, got it. So /proc/self/mountinfo has the information about st_dev and one can use that to reach to associated bdi. Thanks Peter. Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-16 12:54 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > Ok, that's good to know. How would we configure this special bdi? I am > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > Same is true for network file systems. > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > total 0 > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent Ok, got it. So /proc/self/mountinfo has the information about st_dev and one can use that to reach to associated bdi. Thanks Peter. Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-16 12:54 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > Ok, that's good to know. How would we configure this special bdi? I am > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > Same is true for network file systems. > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > total 0 > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent Ok, got it. So /proc/self/mountinfo has the information about st_dev and one can use that to reach to associated bdi. Thanks Peter. Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-16 12:54 ` Vivek Goyal @ 2012-04-16 13:07 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote: > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > > > Ok, that's good to know. How would we configure this special bdi? I am > > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > > Same is true for network file systems. > > > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > > total 0 > > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent > > Ok, got it. So /proc/self/mountinfo has the information about st_dev and > one can use that to reach to associated bdi. Thanks Peter. Vivek, I noticed these lines in cfq code sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); Why not use bdi->dev->devt? The problem is that dev_name() will return "btrfs-X" for btrfs rather than "major:minor". Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-16 13:07 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote: > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > > > Ok, that's good to know. How would we configure this special bdi? I am > > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > > Same is true for network file systems. > > > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > > total 0 > > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent > > Ok, got it. So /proc/self/mountinfo has the information about st_dev and > one can use that to reach to associated bdi. Thanks Peter. Vivek, I noticed these lines in cfq code sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); Why not use bdi->dev->devt? The problem is that dev_name() will return "btrfs-X" for btrfs rather than "major:minor". Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-16 13:07 ` Fengguang Wu (?) @ 2012-04-16 14:19 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw) To: Vivek Goyal Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: > On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote: > > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > > > > > Ok, that's good to know. How would we configure this special bdi? I am > > > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > > > Same is true for network file systems. > > > > > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > > > total 0 > > > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > > > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > > > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > > > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > > > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent > > > > Ok, got it. So /proc/self/mountinfo has the information about st_dev and > > one can use that to reach to associated bdi. Thanks Peter. > > Vivek, I noticed these lines in cfq code > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > Why not use bdi->dev->devt? The problem is that dev_name() will > return "btrfs-X" for btrfs rather than "major:minor". Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi to upper layer which is different from the bdi's for btrfs_fs_info.fs_devices.devices saw by cfq. It's the faked btrfs bdi that is named "btrfs-X" by this function: setup_bdi(): bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY); It does impose difficulties to interpret btrfs mountinfo, where you cannot directly get the block device major/minor numbers: 35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-16 13:07 ` Fengguang Wu @ 2012-04-16 14:19 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: > On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote: > > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > > > > > Ok, that's good to know. How would we configure this special bdi? I am > > > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > > > Same is true for network file systems. > > > > > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > > > total 0 > > > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > > > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > > > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > > > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > > > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent > > > > Ok, got it. So /proc/self/mountinfo has the information about st_dev and > > one can use that to reach to associated bdi. Thanks Peter. > > Vivek, I noticed these lines in cfq code > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > Why not use bdi->dev->devt? The problem is that dev_name() will > return "btrfs-X" for btrfs rather than "major:minor". Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi to upper layer which is different from the bdi's for btrfs_fs_info.fs_devices.devices saw by cfq. It's the faked btrfs bdi that is named "btrfs-X" by this function: setup_bdi(): bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY); It does impose difficulties to interpret btrfs mountinfo, where you cannot directly get the block device major/minor numbers: 35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-16 14:19 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: > On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote: > > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > > > > > Ok, that's good to know. How would we configure this special bdi? I am > > > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > > > Same is true for network file systems. > > > > > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > > > total 0 > > > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > > > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > > > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > > > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > > > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent > > > > Ok, got it. So /proc/self/mountinfo has the information about st_dev and > > one can use that to reach to associated bdi. Thanks Peter. > > Vivek, I noticed these lines in cfq code > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > Why not use bdi->dev->devt? The problem is that dev_name() will > return "btrfs-X" for btrfs rather than "major:minor". Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi to upper layer which is different from the bdi's for btrfs_fs_info.fs_devices.devices saw by cfq. It's the faked btrfs bdi that is named "btrfs-X" by this function: setup_bdi(): bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY); It does impose difficulties to interpret btrfs mountinfo, where you cannot directly get the block device major/minor numbers: 35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-16 13:07 ` Fengguang Wu ` (2 preceding siblings ...) (?) @ 2012-04-16 15:52 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw) To: Fengguang Wu Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: [..] > Vivek, I noticed these lines in cfq code > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > Why not use bdi->dev->devt? The problem is that dev_name() will > return "btrfs-X" for btrfs rather than "major:minor". Isn't bdi->dev->devt 0? I see following code. add_disk() bdi_register_dev() bdi_register() device_create_vargs(MKDEV(0,0)) dev->devt = devt = MKDEV(0,0); So for normal block devices, I think bdi->dev->devt will be zero, that's why probably we don't use it. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-16 13:07 ` Fengguang Wu @ 2012-04-16 15:52 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw) To: Fengguang Wu Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: [..] > Vivek, I noticed these lines in cfq code > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > Why not use bdi->dev->devt? The problem is that dev_name() will > return "btrfs-X" for btrfs rather than "major:minor". Isn't bdi->dev->devt 0? I see following code. add_disk() bdi_register_dev() bdi_register() device_create_vargs(MKDEV(0,0)) dev->devt = devt = MKDEV(0,0); So for normal block devices, I think bdi->dev->devt will be zero, that's why probably we don't use it. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-16 15:52 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw) To: Fengguang Wu Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: [..] > Vivek, I noticed these lines in cfq code > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > Why not use bdi->dev->devt? The problem is that dev_name() will > return "btrfs-X" for btrfs rather than "major:minor". Isn't bdi->dev->devt 0? I see following code. add_disk() bdi_register_dev() bdi_register() device_create_vargs(MKDEV(0,0)) dev->devt = devt = MKDEV(0,0); So for normal block devices, I think bdi->dev->devt will be zero, that's why probably we don't use it. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120416155207.GB15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup 2012-04-16 15:52 ` Vivek Goyal (?) @ 2012-04-17 2:14 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-17 2:14 UTC (permalink / raw) To: Vivek Goyal Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote: > On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: > > [..] > > Vivek, I noticed these lines in cfq code > > > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > > > Why not use bdi->dev->devt? The problem is that dev_name() will > > return "btrfs-X" for btrfs rather than "major:minor". > > Isn't bdi->dev->devt 0? I see following code. > > add_disk() > bdi_register_dev() > bdi_register() > device_create_vargs(MKDEV(0,0)) > dev->devt = devt = MKDEV(0,0); > > So for normal block devices, I think bdi->dev->devt will be zero, that's > why probably we don't use it. Yes indeed. I can confirm this with tracing. There are two main cases - some filesystems do not have a real device for the bdi. - add_disk() calls bdi_register_dev() with the devt, however this information is not passed down for some reason. device_create_vargs() will try to create a sysfs dev file if the devt is not MKDEV(0,0). Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-17 2:14 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-17 2:14 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote: > On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: > > [..] > > Vivek, I noticed these lines in cfq code > > > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > > > Why not use bdi->dev->devt? The problem is that dev_name() will > > return "btrfs-X" for btrfs rather than "major:minor". > > Isn't bdi->dev->devt 0? I see following code. > > add_disk() > bdi_register_dev() > bdi_register() > device_create_vargs(MKDEV(0,0)) > dev->devt = devt = MKDEV(0,0); > > So for normal block devices, I think bdi->dev->devt will be zero, that's > why probably we don't use it. Yes indeed. I can confirm this with tracing. There are two main cases - some filesystems do not have a real device for the bdi. - add_disk() calls bdi_register_dev() with the devt, however this information is not passed down for some reason. device_create_vargs() will try to create a sysfs dev file if the devt is not MKDEV(0,0). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [Lsf] [RFC] writeback and cgroup @ 2012-04-17 2:14 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-17 2:14 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote: > On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote: > > [..] > > Vivek, I noticed these lines in cfq code > > > > sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); > > > > Why not use bdi->dev->devt? The problem is that dev_name() will > > return "btrfs-X" for btrfs rather than "major:minor". > > Isn't bdi->dev->devt 0? I see following code. > > add_disk() > bdi_register_dev() > bdi_register() > device_create_vargs(MKDEV(0,0)) > dev->devt = devt = MKDEV(0,0); > > So for normal block devices, I think bdi->dev->devt will be zero, that's > why probably we don't use it. Yes indeed. I can confirm this with tracing. There are two main cases - some filesystems do not have a real device for the bdi. - add_disk() calls bdi_register_dev() with the devt, however this information is not passed down for some reason. device_create_vargs() will try to create a sysfs dev file if the devt is not MKDEV(0,0). Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [Lsf] [RFC] writeback and cgroup [not found] ` <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-16 13:07 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw) To: Vivek Goyal Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote: > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote: > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote: > > > > > > Ok, that's good to know. How would we configure this special bdi? I am > > > assuming there is no backing device visible in /sys/block/<device>/queue/? > > > Same is true for network file systems. > > > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory > > total 0 > > drwxr-xr-x 3 root root 0 2012-03-27 23:18 . > > drwxr-xr-x 35 root root 0 2012-03-27 23:02 .. > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 max_ratio > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 min_ratio > > drwxr-xr-x 2 root root 0 2012-04-14 14:22 power > > -rw-r--r-- 1 root root 4096 2012-04-14 14:22 read_ahead_kb > > lrwxrwxrwx 1 root root 0 2012-03-27 23:18 subsystem -> ../../../../class/bdi > > -rw-r--r-- 1 root root 4096 2012-03-27 23:18 uevent > > Ok, got it. So /proc/self/mountinfo has the information about st_dev and > one can use that to reach to associated bdi. Thanks Peter. Vivek, I noticed these lines in cfq code sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor); Why not use bdi->dev->devt? The problem is that dev_name() will return "btrfs-X" for btrfs rather than "major:minor". Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> @ 2012-04-04 14:51 ` Vivek Goyal 2012-04-04 17:51 ` Fengguang Wu 1 sibling, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: Hi Tejun, Thanks for the RFC and looking into this issue. Few thoughts inline. [..] > IIUC, without cgroup, the current writeback code works more or less > like this. Throwing in cgroup doesn't really change the fundamental > design. Instead of a single pipe going down, we just have multiple > pipes to the same device, each of which should be treated separately. > Of course, a spinning disk can't be divided that easily and their > performance characteristics will be inter-dependent, but the place to > solve that problem is where the problem is, the block layer. How do you take care of thorottling IO to NFS case in this model? Current throttling logic is tied to block device and in case of NFS, there is no block device. [..] > In the discussion, for such implementation, the following obstacles > were identified. > > * There are a lot of cases where IOs are issued by a task which isn't > the originiator. ie. Writeback issues IOs for pages which are > dirtied by some other tasks. So, by the time an IO reaches the > block layer, we don't know which cgroup the IO belongs to. > > Recently, block layer has grown support to attach a task to a bio > which causes the bio to be handled as if it were issued by the > associated task regardless of the actual issuing task. It currently > only allows attaching %current to a bio - bio_associate_current() - > but changing it to support other tasks is trivial. > > We'll need to update the async issuers to tag the IOs they issue but > the mechanism is already there. Most likely this tagging will take place in "struct page" and I am not sure if we will be allowed to grow size of "struct page" for this reason. > > * There's a single request pool shared by all issuers per a request > queue. This can lead to priority inversion among cgroups. Note > that problem also exists without cgroups. Lower ioprio issuer may > be holding a request holding back highprio issuer. > > We'll need to make request allocation cgroup (and hopefully ioprio) > aware. Probably in the form of separate request pools. This will > take some work but I don't think this will be too challenging. I'll > work on it. This should be doable. I had implemented it long back with single request pool but internal limits for each group. That is block the task in the group if group has enough pending requests allocated from the pool. But separate request pool should work equally well. Just that it conflits a bit with current definition of q->nr_requests. Which specifies number of total outstanding requests on the queue. Once you make the pool per queue, I guess this limit will have to be transformed into per group upper limit. > > * cfq cgroup policy throws all async IOs, which all buffered writes > are, into the shared cgroup regardless of the actual cgroup. This > behavior is, I believe, mostly historical and changing it isn't > difficult. Prolly only few tens of lines of changes. This may > cause significant changes to actual IO behavior with cgroups tho. I > personally think the previous behavior was too wrong to keep (the > weight was completely ignored for buffered writes) but we may want > to introduce a switch to toggle between the two behaviors. I had kept all buffered writes in in same cgroup (root cgroup) for few reasons. - Because of single request descriptor pool for writes, anyway one writer gets backlogged behind other. So creating separate async queues per group is not going to help. - Writeback logic was not cgroup aware. So it might not send enough IO from each writer to maintain parallelism. So creating separate async queues did not make sense till that was fixed. - As you said, it is historical also. We prioritize READS at the expense of writes. Now by putting buffered/async writes in a separate group, we will might end up prioritizing a group's async write over other group's synchronous read. How many people really want that behavior? To me keeping service differentiation among the sync IO matters most. Even if all async IO is treated same, I guess not many people might care. > > Note that blk-throttle doesn't have this problem. I am not sure what are you trying to say here. But primarily blk-throttle will throttle read and direct IO. Buffered writes will go to root cgroup which is typically unthrottled. > > * Unlike dirty data pages, metadata tends to have strict ordering > requirements and thus is susceptible to priority inversion. Two > solutions were suggested - 1. allow overdrawl for metadata writes so > that low prio metadata writes don't block the whole FS, 2. provide > an interface to query and wait for bdi-cgroup congestion which can > be called from FS metadata paths to throttle metadata operations > before they enter the stream of ordered operations. So that probably will mean changing the order of operations also. IIUC, in case of fsync (ordered mode), we opened a meta data transaction first, then tried to flush all the cached data and then flush metadata. So if fsync is throttled, all the metadata operations behind it will get serialized for ext3/ext4. So you seem to be suggesting that we change the design so that metadata operation does not thrown into ordered stream till we have finished writing all the data back to disk? I am not a filesystem developer, so I don't know how feasible this change is. This is just one of the points. In the past while talking to Dave Chinner, he mentioned that in XFS, if two cgroups fall into same allocation group then there were cases where IO of one cgroup can get serialized behind other. In general, the core of the issue is that filesystems are not cgroup aware and if you do throttling below filesystems, then invariably one or other serialization issue will come up and I am concerned that we will be constantly fixing those serialization issues. Or the desgin point could be so central to filesystem design that it can't be changed. In general, if you do throttling deeper in the stakc and build back pressure, then all the layers sitting above should be cgroup aware to avoid problems. Two layers identified so far are writeback and filesystems. Is it really worth the complexity. How about doing throttling in higher layers when IO is entering the kernel and keep proportional IO logic at the lowest level and current mechanism of building pressure continues to work? Why to split. Proportional IO logic is work conserving so even if some serialization happens, that situation should clear up pretty soon as IO from other cgroup will dry up and IO from the group causing serialization will make progress and at max we will lose fairness for certain duration. With throttling limits come from the user and one can put really low artificial limits. So even if the underlying resources are free the IO from throttled cgroup might not make any progress in turn choking every other cgroup which is serialized behind it. So in general throttling at block layer and building back pressure is fine. I am concerned about two cases. - How to handle NFS. - Do filesystem developers agree with this approach and are they willing to address any serialization issues arising due to this design. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-03 18:36 ` Tejun Heo (?) @ 2012-04-04 17:51 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA Hi Tejun, On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > Hello, guys. > > So, during LSF, I, Fengguang and Jan had a chance to sit down and talk > about how to cgroup support to writeback. Here's what I got from it. > > Fengguang's opinion is that the throttling algorithm implemented in > writeback is good enough and blkcg parameters can be exposed to > writeback such that those limits can be applied from writeback. As > for reads and direct IOs, Fengguang opined that the algorithm can > easily be extended to cover those cases and IIUC all IOs, whether > buffered writes, reads or direct IOs can eventually all go through > writeback layer which will be the one layer controlling all IOs. Yeah it should be trivial to apply the balance_dirty_pages() throttling algorithm to the read/direct IOs. However up to now I don't see much added value to *duplicate* the current block IO controller functionalities, assuming the current users and developers are happy with it. I did the buffered write IO controller mainly to fill the gap. If I happen to stand in your way, sorry that's not my initial intention. It's a pity and surprise that Google as a big user does not buy in this simple solution. You may prefer more comprehensive controls which may not be easily achievable with the simple scheme. However the complexities and overheads involved in throttling the flusher IOs really upsets me. The sweet split point would be for balance_dirty_pages() to do cgroup aware buffered write throttling and leave other IOs to the current blkcg. For this to work well as a total solution for end users, I hope we can cooperate and figure out ways for the two throttling entities to work well with each other. What I'm interested is, what's Google and other users' use schemes in practice. What's their desired interfaces. Whether and how the combined bdp+blkcg throttling can fulfill the goals. > Unfortunately, I don't agree with that at all. I think it's a gross > layering violation and lacks any longterm design. We have a well > working model of applying and propagating resource pressure - we apply > the pressure where the resource exists and propagates the back > pressure through buffers to upper layers upto the originator. Think > about network, the pressure exists or is applied at the in/egress > points which gets propagated through socket buffers and eventually > throttles the originator. > > Writeback, without cgroup, isn't different. It consists a part of the > pressure propagation chain anchored at the IO device. IO devices > these days generate very high pressure, which gets propgated through > the IO sched and buffered requests, which in turn creates pressure at > writeback. Here, the buffering happens in page cache and pressure at > writeback increases the amount of dirty page cache. Propagating this > IO pressure to the dirtying task is one of the biggest > responsibililties of the writeback code, and this is the underlying > design of the whole thing. > > IIUC, without cgroup, the current writeback code works more or less > like this. Throwing in cgroup doesn't really change the fundamental > design. Instead of a single pipe going down, we just have multiple > pipes to the same device, each of which should be treated separately. > Of course, a spinning disk can't be divided that easily and their > performance characteristics will be inter-dependent, but the place to > solve that problem is where the problem is, the block layer. > > We may have to look for optimizations and expose some details to > improve the overall behavior and such optimizations may require some > deviation from the fundamental design, but such optimizations should > be justified and such deviations kept at minimum, so, no, I don't > think we're gonna be expose blkcg / block / elevator parameters > directly to writeback. Unless someone can *really* convince me > otherwise, I'll be vetoing any change toward that direction. > > Let's please keep the layering clear. IO limitations will be applied > at the block layer and pressure will be formed there and then > propagated upwards eventually to the originator. Sure, exposing the > whole information might result in better behavior for certain > workloads, but down the road, say, in three or five years, devices > which can be shared without worrying too much about seeks might be > commonplace and we could be swearing at a disgusting structural mess, > and sadly various cgroup support seems to be a prominent source of > such design failures. Super fast storages are coming which will make us regret to make the IO path over complex. Spinning disks are not going away anytime soon. I doubt Google is willing to afford the disk seek costs on its millions of disks and has the patience to wait until switching all of the spin disks to SSD years later (if it will ever happen). Sorry, I won't buy in the layering arguments and analog to networking. Yeah network is a good way to show your "push back" idea, however writeback has its own metadata, seeking, etc. problems. I'd prefer we base our discussions on real things like complexities, overheads, performance as well as user demands. It's obvious that your below proposal involves a lot of complexities, overheads, and will hurt performance. It basically involves - running concurrent flusher threads for cgroups, which adds back the disk seeks and lock contentions. And still has problems with sync and shared inodes. - splitting device queue for cgroups, possibly scaling up the pool of writeback pages (and locked pages in the case of stable pages) which could stall random processes in the system - the mess of metadata handling - unnecessarily coupled with memcg, in order to take advantage of the per-memcg dirty limits for balance_dirty_pages() to actually convert the "pushed back" dirty pages pressure into lowered dirty rate. Why the hell the users *have to* setup memcg (suffering from all the inconvenience and overheads) in order to do IO throttling? Please, this is really ugly! And the "back pressure" may constantly push the memcg dirty pages to the limits. I'm not going to support *miss use* of per-memcg dirty limits like this! I cannot believe you would keep overlooking all the problems without good reasons. Please do tell us the reasons that matter. Thanks, Fengguang > IMHO, treating cgroup - device/bdi pair as a separate device should > suffice as the underlying design. After all, blkio cgroup support's > ultimate goal is dividing the IO resource into separate bins. > Implementation details might change as underlying technology changes > and we learn more about how to do it better but that is the goal which > we'll always try to keep close to. Writeback should (be able to) > treat them as separate devices. We surely will need adjustments and > optimizations to make things work at least somewhat reasonably but > that is the baseline. > > In the discussion, for such implementation, the following obstacles > were identified. > > * There are a lot of cases where IOs are issued by a task which isn't > the originiator. ie. Writeback issues IOs for pages which are > dirtied by some other tasks. So, by the time an IO reaches the > block layer, we don't know which cgroup the IO belongs to. > > Recently, block layer has grown support to attach a task to a bio > which causes the bio to be handled as if it were issued by the > associated task regardless of the actual issuing task. It currently > only allows attaching %current to a bio - bio_associate_current() - > but changing it to support other tasks is trivial. > > We'll need to update the async issuers to tag the IOs they issue but > the mechanism is already there. > > * There's a single request pool shared by all issuers per a request > queue. This can lead to priority inversion among cgroups. Note > that problem also exists without cgroups. Lower ioprio issuer may > be holding a request holding back highprio issuer. > > We'll need to make request allocation cgroup (and hopefully ioprio) > aware. Probably in the form of separate request pools. This will > take some work but I don't think this will be too challenging. I'll > work on it. > > * cfq cgroup policy throws all async IOs, which all buffered writes > are, into the shared cgroup regardless of the actual cgroup. This > behavior is, I believe, mostly historical and changing it isn't > difficult. Prolly only few tens of lines of changes. This may > cause significant changes to actual IO behavior with cgroups tho. I > personally think the previous behavior was too wrong to keep (the > weight was completely ignored for buffered writes) but we may want > to introduce a switch to toggle between the two behaviors. > > Note that blk-throttle doesn't have this problem. > > * Unlike dirty data pages, metadata tends to have strict ordering > requirements and thus is susceptible to priority inversion. Two > solutions were suggested - 1. allow overdrawl for metadata writes so > that low prio metadata writes don't block the whole FS, 2. provide > an interface to query and wait for bdi-cgroup congestion which can > be called from FS metadata paths to throttle metadata operations > before they enter the stream of ordered operations. > > I think combination of the above two should be enough for solving > the problem. I *think* the second can be implemented as part of > cgroup aware request allocation update. The first one needs a bit > more thinking but there can be easier interim solutions (e.g. throw > META writes to the head of the cgroup queue or just plain ignore > cgroup limits for META writes) for now. > > * I'm sure there are a lot of design choices to be made in the > writeback implementation but IIUC Jan seems to agree that the > simplest would be simply deal different cgroup-bdi pairs as > completely separate which shouldn't add too much complexity to the > already intricate writeback code. > > So, I think we have something which sounds like a plan, which at least > I can agree with and seems doable without adding a lot of complexity. > > Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's > side and IIUC Fengguang doesn't agree with this approach too much, so > please voice your opinions & comments. > > Thank you. > > -- > tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 17:51 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Tejun, On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > Hello, guys. > > So, during LSF, I, Fengguang and Jan had a chance to sit down and talk > about how to cgroup support to writeback. Here's what I got from it. > > Fengguang's opinion is that the throttling algorithm implemented in > writeback is good enough and blkcg parameters can be exposed to > writeback such that those limits can be applied from writeback. As > for reads and direct IOs, Fengguang opined that the algorithm can > easily be extended to cover those cases and IIUC all IOs, whether > buffered writes, reads or direct IOs can eventually all go through > writeback layer which will be the one layer controlling all IOs. Yeah it should be trivial to apply the balance_dirty_pages() throttling algorithm to the read/direct IOs. However up to now I don't see much added value to *duplicate* the current block IO controller functionalities, assuming the current users and developers are happy with it. I did the buffered write IO controller mainly to fill the gap. If I happen to stand in your way, sorry that's not my initial intention. It's a pity and surprise that Google as a big user does not buy in this simple solution. You may prefer more comprehensive controls which may not be easily achievable with the simple scheme. However the complexities and overheads involved in throttling the flusher IOs really upsets me. The sweet split point would be for balance_dirty_pages() to do cgroup aware buffered write throttling and leave other IOs to the current blkcg. For this to work well as a total solution for end users, I hope we can cooperate and figure out ways for the two throttling entities to work well with each other. What I'm interested is, what's Google and other users' use schemes in practice. What's their desired interfaces. Whether and how the combined bdp+blkcg throttling can fulfill the goals. > Unfortunately, I don't agree with that at all. I think it's a gross > layering violation and lacks any longterm design. We have a well > working model of applying and propagating resource pressure - we apply > the pressure where the resource exists and propagates the back > pressure through buffers to upper layers upto the originator. Think > about network, the pressure exists or is applied at the in/egress > points which gets propagated through socket buffers and eventually > throttles the originator. > > Writeback, without cgroup, isn't different. It consists a part of the > pressure propagation chain anchored at the IO device. IO devices > these days generate very high pressure, which gets propgated through > the IO sched and buffered requests, which in turn creates pressure at > writeback. Here, the buffering happens in page cache and pressure at > writeback increases the amount of dirty page cache. Propagating this > IO pressure to the dirtying task is one of the biggest > responsibililties of the writeback code, and this is the underlying > design of the whole thing. > > IIUC, without cgroup, the current writeback code works more or less > like this. Throwing in cgroup doesn't really change the fundamental > design. Instead of a single pipe going down, we just have multiple > pipes to the same device, each of which should be treated separately. > Of course, a spinning disk can't be divided that easily and their > performance characteristics will be inter-dependent, but the place to > solve that problem is where the problem is, the block layer. > > We may have to look for optimizations and expose some details to > improve the overall behavior and such optimizations may require some > deviation from the fundamental design, but such optimizations should > be justified and such deviations kept at minimum, so, no, I don't > think we're gonna be expose blkcg / block / elevator parameters > directly to writeback. Unless someone can *really* convince me > otherwise, I'll be vetoing any change toward that direction. > > Let's please keep the layering clear. IO limitations will be applied > at the block layer and pressure will be formed there and then > propagated upwards eventually to the originator. Sure, exposing the > whole information might result in better behavior for certain > workloads, but down the road, say, in three or five years, devices > which can be shared without worrying too much about seeks might be > commonplace and we could be swearing at a disgusting structural mess, > and sadly various cgroup support seems to be a prominent source of > such design failures. Super fast storages are coming which will make us regret to make the IO path over complex. Spinning disks are not going away anytime soon. I doubt Google is willing to afford the disk seek costs on its millions of disks and has the patience to wait until switching all of the spin disks to SSD years later (if it will ever happen). Sorry, I won't buy in the layering arguments and analog to networking. Yeah network is a good way to show your "push back" idea, however writeback has its own metadata, seeking, etc. problems. I'd prefer we base our discussions on real things like complexities, overheads, performance as well as user demands. It's obvious that your below proposal involves a lot of complexities, overheads, and will hurt performance. It basically involves - running concurrent flusher threads for cgroups, which adds back the disk seeks and lock contentions. And still has problems with sync and shared inodes. - splitting device queue for cgroups, possibly scaling up the pool of writeback pages (and locked pages in the case of stable pages) which could stall random processes in the system - the mess of metadata handling - unnecessarily coupled with memcg, in order to take advantage of the per-memcg dirty limits for balance_dirty_pages() to actually convert the "pushed back" dirty pages pressure into lowered dirty rate. Why the hell the users *have to* setup memcg (suffering from all the inconvenience and overheads) in order to do IO throttling? Please, this is really ugly! And the "back pressure" may constantly push the memcg dirty pages to the limits. I'm not going to support *miss use* of per-memcg dirty limits like this! I cannot believe you would keep overlooking all the problems without good reasons. Please do tell us the reasons that matter. Thanks, Fengguang > IMHO, treating cgroup - device/bdi pair as a separate device should > suffice as the underlying design. After all, blkio cgroup support's > ultimate goal is dividing the IO resource into separate bins. > Implementation details might change as underlying technology changes > and we learn more about how to do it better but that is the goal which > we'll always try to keep close to. Writeback should (be able to) > treat them as separate devices. We surely will need adjustments and > optimizations to make things work at least somewhat reasonably but > that is the baseline. > > In the discussion, for such implementation, the following obstacles > were identified. > > * There are a lot of cases where IOs are issued by a task which isn't > the originiator. ie. Writeback issues IOs for pages which are > dirtied by some other tasks. So, by the time an IO reaches the > block layer, we don't know which cgroup the IO belongs to. > > Recently, block layer has grown support to attach a task to a bio > which causes the bio to be handled as if it were issued by the > associated task regardless of the actual issuing task. It currently > only allows attaching %current to a bio - bio_associate_current() - > but changing it to support other tasks is trivial. > > We'll need to update the async issuers to tag the IOs they issue but > the mechanism is already there. > > * There's a single request pool shared by all issuers per a request > queue. This can lead to priority inversion among cgroups. Note > that problem also exists without cgroups. Lower ioprio issuer may > be holding a request holding back highprio issuer. > > We'll need to make request allocation cgroup (and hopefully ioprio) > aware. Probably in the form of separate request pools. This will > take some work but I don't think this will be too challenging. I'll > work on it. > > * cfq cgroup policy throws all async IOs, which all buffered writes > are, into the shared cgroup regardless of the actual cgroup. This > behavior is, I believe, mostly historical and changing it isn't > difficult. Prolly only few tens of lines of changes. This may > cause significant changes to actual IO behavior with cgroups tho. I > personally think the previous behavior was too wrong to keep (the > weight was completely ignored for buffered writes) but we may want > to introduce a switch to toggle between the two behaviors. > > Note that blk-throttle doesn't have this problem. > > * Unlike dirty data pages, metadata tends to have strict ordering > requirements and thus is susceptible to priority inversion. Two > solutions were suggested - 1. allow overdrawl for metadata writes so > that low prio metadata writes don't block the whole FS, 2. provide > an interface to query and wait for bdi-cgroup congestion which can > be called from FS metadata paths to throttle metadata operations > before they enter the stream of ordered operations. > > I think combination of the above two should be enough for solving > the problem. I *think* the second can be implemented as part of > cgroup aware request allocation update. The first one needs a bit > more thinking but there can be easier interim solutions (e.g. throw > META writes to the head of the cgroup queue or just plain ignore > cgroup limits for META writes) for now. > > * I'm sure there are a lot of design choices to be made in the > writeback implementation but IIUC Jan seems to agree that the > simplest would be simply deal different cgroup-bdi pairs as > completely separate which shouldn't add too much complexity to the > already intricate writeback code. > > So, I think we have something which sounds like a plan, which at least > I can agree with and seems doable without adding a lot of complexity. > > Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's > side and IIUC Fengguang doesn't agree with this approach too much, so > please voice your opinions & comments. > > Thank you. > > -- > tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 17:51 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Tejun, On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote: > Hello, guys. > > So, during LSF, I, Fengguang and Jan had a chance to sit down and talk > about how to cgroup support to writeback. Here's what I got from it. > > Fengguang's opinion is that the throttling algorithm implemented in > writeback is good enough and blkcg parameters can be exposed to > writeback such that those limits can be applied from writeback. As > for reads and direct IOs, Fengguang opined that the algorithm can > easily be extended to cover those cases and IIUC all IOs, whether > buffered writes, reads or direct IOs can eventually all go through > writeback layer which will be the one layer controlling all IOs. Yeah it should be trivial to apply the balance_dirty_pages() throttling algorithm to the read/direct IOs. However up to now I don't see much added value to *duplicate* the current block IO controller functionalities, assuming the current users and developers are happy with it. I did the buffered write IO controller mainly to fill the gap. If I happen to stand in your way, sorry that's not my initial intention. It's a pity and surprise that Google as a big user does not buy in this simple solution. You may prefer more comprehensive controls which may not be easily achievable with the simple scheme. However the complexities and overheads involved in throttling the flusher IOs really upsets me. The sweet split point would be for balance_dirty_pages() to do cgroup aware buffered write throttling and leave other IOs to the current blkcg. For this to work well as a total solution for end users, I hope we can cooperate and figure out ways for the two throttling entities to work well with each other. What I'm interested is, what's Google and other users' use schemes in practice. What's their desired interfaces. Whether and how the combined bdp+blkcg throttling can fulfill the goals. > Unfortunately, I don't agree with that at all. I think it's a gross > layering violation and lacks any longterm design. We have a well > working model of applying and propagating resource pressure - we apply > the pressure where the resource exists and propagates the back > pressure through buffers to upper layers upto the originator. Think > about network, the pressure exists or is applied at the in/egress > points which gets propagated through socket buffers and eventually > throttles the originator. > > Writeback, without cgroup, isn't different. It consists a part of the > pressure propagation chain anchored at the IO device. IO devices > these days generate very high pressure, which gets propgated through > the IO sched and buffered requests, which in turn creates pressure at > writeback. Here, the buffering happens in page cache and pressure at > writeback increases the amount of dirty page cache. Propagating this > IO pressure to the dirtying task is one of the biggest > responsibililties of the writeback code, and this is the underlying > design of the whole thing. > > IIUC, without cgroup, the current writeback code works more or less > like this. Throwing in cgroup doesn't really change the fundamental > design. Instead of a single pipe going down, we just have multiple > pipes to the same device, each of which should be treated separately. > Of course, a spinning disk can't be divided that easily and their > performance characteristics will be inter-dependent, but the place to > solve that problem is where the problem is, the block layer. > > We may have to look for optimizations and expose some details to > improve the overall behavior and such optimizations may require some > deviation from the fundamental design, but such optimizations should > be justified and such deviations kept at minimum, so, no, I don't > think we're gonna be expose blkcg / block / elevator parameters > directly to writeback. Unless someone can *really* convince me > otherwise, I'll be vetoing any change toward that direction. > > Let's please keep the layering clear. IO limitations will be applied > at the block layer and pressure will be formed there and then > propagated upwards eventually to the originator. Sure, exposing the > whole information might result in better behavior for certain > workloads, but down the road, say, in three or five years, devices > which can be shared without worrying too much about seeks might be > commonplace and we could be swearing at a disgusting structural mess, > and sadly various cgroup support seems to be a prominent source of > such design failures. Super fast storages are coming which will make us regret to make the IO path over complex. Spinning disks are not going away anytime soon. I doubt Google is willing to afford the disk seek costs on its millions of disks and has the patience to wait until switching all of the spin disks to SSD years later (if it will ever happen). Sorry, I won't buy in the layering arguments and analog to networking. Yeah network is a good way to show your "push back" idea, however writeback has its own metadata, seeking, etc. problems. I'd prefer we base our discussions on real things like complexities, overheads, performance as well as user demands. It's obvious that your below proposal involves a lot of complexities, overheads, and will hurt performance. It basically involves - running concurrent flusher threads for cgroups, which adds back the disk seeks and lock contentions. And still has problems with sync and shared inodes. - splitting device queue for cgroups, possibly scaling up the pool of writeback pages (and locked pages in the case of stable pages) which could stall random processes in the system - the mess of metadata handling - unnecessarily coupled with memcg, in order to take advantage of the per-memcg dirty limits for balance_dirty_pages() to actually convert the "pushed back" dirty pages pressure into lowered dirty rate. Why the hell the users *have to* setup memcg (suffering from all the inconvenience and overheads) in order to do IO throttling? Please, this is really ugly! And the "back pressure" may constantly push the memcg dirty pages to the limits. I'm not going to support *miss use* of per-memcg dirty limits like this! I cannot believe you would keep overlooking all the problems without good reasons. Please do tell us the reasons that matter. Thanks, Fengguang > IMHO, treating cgroup - device/bdi pair as a separate device should > suffice as the underlying design. After all, blkio cgroup support's > ultimate goal is dividing the IO resource into separate bins. > Implementation details might change as underlying technology changes > and we learn more about how to do it better but that is the goal which > we'll always try to keep close to. Writeback should (be able to) > treat them as separate devices. We surely will need adjustments and > optimizations to make things work at least somewhat reasonably but > that is the baseline. > > In the discussion, for such implementation, the following obstacles > were identified. > > * There are a lot of cases where IOs are issued by a task which isn't > the originiator. ie. Writeback issues IOs for pages which are > dirtied by some other tasks. So, by the time an IO reaches the > block layer, we don't know which cgroup the IO belongs to. > > Recently, block layer has grown support to attach a task to a bio > which causes the bio to be handled as if it were issued by the > associated task regardless of the actual issuing task. It currently > only allows attaching %current to a bio - bio_associate_current() - > but changing it to support other tasks is trivial. > > We'll need to update the async issuers to tag the IOs they issue but > the mechanism is already there. > > * There's a single request pool shared by all issuers per a request > queue. This can lead to priority inversion among cgroups. Note > that problem also exists without cgroups. Lower ioprio issuer may > be holding a request holding back highprio issuer. > > We'll need to make request allocation cgroup (and hopefully ioprio) > aware. Probably in the form of separate request pools. This will > take some work but I don't think this will be too challenging. I'll > work on it. > > * cfq cgroup policy throws all async IOs, which all buffered writes > are, into the shared cgroup regardless of the actual cgroup. This > behavior is, I believe, mostly historical and changing it isn't > difficult. Prolly only few tens of lines of changes. This may > cause significant changes to actual IO behavior with cgroups tho. I > personally think the previous behavior was too wrong to keep (the > weight was completely ignored for buffered writes) but we may want > to introduce a switch to toggle between the two behaviors. > > Note that blk-throttle doesn't have this problem. > > * Unlike dirty data pages, metadata tends to have strict ordering > requirements and thus is susceptible to priority inversion. Two > solutions were suggested - 1. allow overdrawl for metadata writes so > that low prio metadata writes don't block the whole FS, 2. provide > an interface to query and wait for bdi-cgroup congestion which can > be called from FS metadata paths to throttle metadata operations > before they enter the stream of ordered operations. > > I think combination of the above two should be enough for solving > the problem. I *think* the second can be implemented as part of > cgroup aware request allocation update. The first one needs a bit > more thinking but there can be easier interim solutions (e.g. throw > META writes to the head of the cgroup queue or just plain ignore > cgroup limits for META writes) for now. > > * I'm sure there are a lot of design choices to be made in the > writeback implementation but IIUC Jan seems to agree that the > simplest would be simply deal different cgroup-bdi pairs as > completely separate which shouldn't add too much complexity to the > already intricate writeback code. > > So, I think we have something which sounds like a plan, which at least > I can agree with and seems doable without adding a lot of complexity. > > Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's > side and IIUC Fengguang doesn't agree with this approach too much, so > please voice your opinions & comments. > > Thank you. > > -- > tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 17:51 ` Fengguang Wu @ 2012-04-04 18:35 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: [..] > The sweet split point would be for balance_dirty_pages() to do cgroup > aware buffered write throttling and leave other IOs to the current > blkcg. For this to work well as a total solution for end users, I hope > we can cooperate and figure out ways for the two throttling entities > to work well with each other. Throttling read + direct IO, higher up has few issues too. Users will not like that a task got blocked as it tried to submit a read from a throttled group. Current async behavior works well where we queue up the bio from the task in throttled group and let task do other things. Same is true for AIO where we would not like to block in bio submission. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 18:35 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: [..] > The sweet split point would be for balance_dirty_pages() to do cgroup > aware buffered write throttling and leave other IOs to the current > blkcg. For this to work well as a total solution for end users, I hope > we can cooperate and figure out ways for the two throttling entities > to work well with each other. Throttling read + direct IO, higher up has few issues too. Users will not like that a task got blocked as it tried to submit a read from a throttled group. Current async behavior works well where we queue up the bio from the task in throttled group and let task do other things. Same is true for AIO where we would not like to block in bio submission. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404183528.GJ12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-04 18:35 ` Vivek Goyal (?) @ 2012-04-04 21:42 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > [..] > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > Throttling read + direct IO, higher up has few issues too. Users will Yeah I have a bit worry about high layer throttling, too. Anyway here are the ideas. > not like that a task got blocked as it tried to submit a read from a > throttled group. That's not the same issue I worried about :) Throttling is about inserting small sleep/waits into selected points. For reads, the ideal sleep point is immediately after readahead IO is summited, at the end of __do_page_cache_readahead(). The same should be applicable to direct IO. > Current async behavior works well where we queue up the > bio from the task in throttled group and let task do other things. Same > is true for AIO where we would not like to block in bio submission. For AIO, we'll need to delay the IO completion notification or status update, which may involve computing some delay time and delay the calls to io_complete() with the help of some delayed work queue. There may be more issues to deal with as I didn't look into aio.c carefully. The thing worried me is that in the proportional throttling case, the high level throttling works on the *estimated* task_ratelimit = disk_bandwidth / N, where N is the number of read IO tasks. When N suddenly changes from 2 to 1, it may take 1 second for the estimated task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, during which time the disk won't get 100% utilized because of the temporally over-throttling of the remaining IO task. This is not a problem when throttling at the block/cfq layer, since it has the full information of pending requests and should not depend on such estimations. The workaround I can think of, is to put the throttled task into a wait queue, and let block layer wake up the waiters when the IO queue runs empty. This should be able to avoid most disk idle time. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 21:42 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > [..] > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > Throttling read + direct IO, higher up has few issues too. Users will Yeah I have a bit worry about high layer throttling, too. Anyway here are the ideas. > not like that a task got blocked as it tried to submit a read from a > throttled group. That's not the same issue I worried about :) Throttling is about inserting small sleep/waits into selected points. For reads, the ideal sleep point is immediately after readahead IO is summited, at the end of __do_page_cache_readahead(). The same should be applicable to direct IO. > Current async behavior works well where we queue up the > bio from the task in throttled group and let task do other things. Same > is true for AIO where we would not like to block in bio submission. For AIO, we'll need to delay the IO completion notification or status update, which may involve computing some delay time and delay the calls to io_complete() with the help of some delayed work queue. There may be more issues to deal with as I didn't look into aio.c carefully. The thing worried me is that in the proportional throttling case, the high level throttling works on the *estimated* task_ratelimit = disk_bandwidth / N, where N is the number of read IO tasks. When N suddenly changes from 2 to 1, it may take 1 second for the estimated task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, during which time the disk won't get 100% utilized because of the temporally over-throttling of the remaining IO task. This is not a problem when throttling at the block/cfq layer, since it has the full information of pending requests and should not depend on such estimations. The workaround I can think of, is to put the throttled task into a wait queue, and let block layer wake up the waiters when the IO queue runs empty. This should be able to avoid most disk idle time. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 21:42 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > [..] > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > Throttling read + direct IO, higher up has few issues too. Users will Yeah I have a bit worry about high layer throttling, too. Anyway here are the ideas. > not like that a task got blocked as it tried to submit a read from a > throttled group. That's not the same issue I worried about :) Throttling is about inserting small sleep/waits into selected points. For reads, the ideal sleep point is immediately after readahead IO is summited, at the end of __do_page_cache_readahead(). The same should be applicable to direct IO. > Current async behavior works well where we queue up the > bio from the task in throttled group and let task do other things. Same > is true for AIO where we would not like to block in bio submission. For AIO, we'll need to delay the IO completion notification or status update, which may involve computing some delay time and delay the calls to io_complete() with the help of some delayed work queue. There may be more issues to deal with as I didn't look into aio.c carefully. The thing worried me is that in the proportional throttling case, the high level throttling works on the *estimated* task_ratelimit = disk_bandwidth / N, where N is the number of read IO tasks. When N suddenly changes from 2 to 1, it may take 1 second for the estimated task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, during which time the disk won't get 100% utilized because of the temporally over-throttling of the remaining IO task. This is not a problem when throttling at the block/cfq layer, since it has the full information of pending requests and should not depend on such estimations. The workaround I can think of, is to put the throttled task into a wait queue, and let block layer wake up the waiters when the IO queue runs empty. This should be able to avoid most disk idle time. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 21:42 ` Fengguang Wu @ 2012-04-05 15:10 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote: > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > > > [..] > > > The sweet split point would be for balance_dirty_pages() to do cgroup > > > aware buffered write throttling and leave other IOs to the current > > > blkcg. For this to work well as a total solution for end users, I hope > > > we can cooperate and figure out ways for the two throttling entities > > > to work well with each other. > > > > Throttling read + direct IO, higher up has few issues too. Users will > > Yeah I have a bit worry about high layer throttling, too. > Anyway here are the ideas. > > > not like that a task got blocked as it tried to submit a read from a > > throttled group. > > That's not the same issue I worried about :) Throttling is about > inserting small sleep/waits into selected points. For reads, the ideal > sleep point is immediately after readahead IO is summited, at the end > of __do_page_cache_readahead(). The same should be applicable to > direct IO. But after a read the process might want to process the read data and do something else altogether. So throttling the process after completing the read is not the best thing. > > > Current async behavior works well where we queue up the > > bio from the task in throttled group and let task do other things. Same > > is true for AIO where we would not like to block in bio submission. > > For AIO, we'll need to delay the IO completion notification or status > update, which may involve computing some delay time and delay the > calls to io_complete() with the help of some delayed work queue. There > may be more issues to deal with as I didn't look into aio.c carefully. I don't know but delaying compltion notifications sounds odd to me. So you don't throttle while submitting requests. That does not help with pressure on request queue as process can dump whole bunch of IO without waiting for completion? What I like better that AIO is allowed to submit bunch of IO till it hits the nr_requests limit on request queue and then it is blocked as request queue is too busy and not enough request descriptors are free. > > The thing worried me is that in the proportional throttling case, the > high level throttling works on the *estimated* task_ratelimit = > disk_bandwidth / N, where N is the number of read IO tasks. When N > suddenly changes from 2 to 1, it may take 1 second for the estimated > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, > during which time the disk won't get 100% utilized because of the > temporally over-throttling of the remaining IO task. I thought we were only considering the case of absolute throttling in higher layers. Proportional IO will continue to be in CFQ. I don't think we need to push proportional IO in higher layers. > > This is not a problem when throttling at the block/cfq layer, since it > has the full information of pending requests and should not depend on > such estimations. CFQ does not even look at pending requests. It just maintains bunch of IO queues and selects one queue to dispatch IO from based on its weight. So proportional IO comes very naturally to CFQ. > > The workaround I can think of, is to put the throttled task into a wait > queue, and let block layer wake up the waiters when the IO queue runs > empty. This should be able to avoid most disk idle time. Again, I am not convinced that proportional IO should go in higher layers. For fast devices we are already suffering from queue locking overhead and Jens seems to have patches for multi queue. Now by trying to implement something at higher layer, that locking overhead will show up there too and we will end up doing something similar to multi queue there and it is not desirable. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-05 15:10 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote: > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > > > [..] > > > The sweet split point would be for balance_dirty_pages() to do cgroup > > > aware buffered write throttling and leave other IOs to the current > > > blkcg. For this to work well as a total solution for end users, I hope > > > we can cooperate and figure out ways for the two throttling entities > > > to work well with each other. > > > > Throttling read + direct IO, higher up has few issues too. Users will > > Yeah I have a bit worry about high layer throttling, too. > Anyway here are the ideas. > > > not like that a task got blocked as it tried to submit a read from a > > throttled group. > > That's not the same issue I worried about :) Throttling is about > inserting small sleep/waits into selected points. For reads, the ideal > sleep point is immediately after readahead IO is summited, at the end > of __do_page_cache_readahead(). The same should be applicable to > direct IO. But after a read the process might want to process the read data and do something else altogether. So throttling the process after completing the read is not the best thing. > > > Current async behavior works well where we queue up the > > bio from the task in throttled group and let task do other things. Same > > is true for AIO where we would not like to block in bio submission. > > For AIO, we'll need to delay the IO completion notification or status > update, which may involve computing some delay time and delay the > calls to io_complete() with the help of some delayed work queue. There > may be more issues to deal with as I didn't look into aio.c carefully. I don't know but delaying compltion notifications sounds odd to me. So you don't throttle while submitting requests. That does not help with pressure on request queue as process can dump whole bunch of IO without waiting for completion? What I like better that AIO is allowed to submit bunch of IO till it hits the nr_requests limit on request queue and then it is blocked as request queue is too busy and not enough request descriptors are free. > > The thing worried me is that in the proportional throttling case, the > high level throttling works on the *estimated* task_ratelimit = > disk_bandwidth / N, where N is the number of read IO tasks. When N > suddenly changes from 2 to 1, it may take 1 second for the estimated > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, > during which time the disk won't get 100% utilized because of the > temporally over-throttling of the remaining IO task. I thought we were only considering the case of absolute throttling in higher layers. Proportional IO will continue to be in CFQ. I don't think we need to push proportional IO in higher layers. > > This is not a problem when throttling at the block/cfq layer, since it > has the full information of pending requests and should not depend on > such estimations. CFQ does not even look at pending requests. It just maintains bunch of IO queues and selects one queue to dispatch IO from based on its weight. So proportional IO comes very naturally to CFQ. > > The workaround I can think of, is to put the throttled task into a wait > queue, and let block layer wake up the waiters when the IO queue runs > empty. This should be able to avoid most disk idle time. Again, I am not convinced that proportional IO should go in higher layers. For fast devices we are already suffering from queue locking overhead and Jens seems to have patches for multi queue. Now by trying to implement something at higher layer, that locking overhead will show up there too and we will end up doing something similar to multi queue there and it is not desirable. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-06 0:32 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-06 0:32 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Vivek, I totally agree that direct IOs can be best handled in block/cfq layers. On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote: > > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > > > > > [..] > > > > The sweet split point would be for balance_dirty_pages() to do cgroup > > > > aware buffered write throttling and leave other IOs to the current > > > > blkcg. For this to work well as a total solution for end users, I hope > > > > we can cooperate and figure out ways for the two throttling entities > > > > to work well with each other. > > > > > > Throttling read + direct IO, higher up has few issues too. Users will > > > > Yeah I have a bit worry about high layer throttling, too. > > Anyway here are the ideas. > > > > > not like that a task got blocked as it tried to submit a read from a > > > throttled group. > > > > That's not the same issue I worried about :) Throttling is about > > inserting small sleep/waits into selected points. For reads, the ideal > > sleep point is immediately after readahead IO is summited, at the end > > of __do_page_cache_readahead(). The same should be applicable to > > direct IO. > > But after a read the process might want to process the read data and > do something else altogether. So throttling the process after completing > the read is not the best thing. __do_page_cache_readahead() returns immediately after queuing the read IOs. It may block occasionally on metadata IO but not data IO. > > > Current async behavior works well where we queue up the > > > bio from the task in throttled group and let task do other things. Same > > > is true for AIO where we would not like to block in bio submission. > > > > For AIO, we'll need to delay the IO completion notification or status > > update, which may involve computing some delay time and delay the > > calls to io_complete() with the help of some delayed work queue. There > > may be more issues to deal with as I didn't look into aio.c carefully. > > I don't know but delaying compltion notifications sounds odd to me. So > you don't throttle while submitting requests. That does not help with > pressure on request queue as process can dump whole bunch of IO without > waiting for completion? > > What I like better that AIO is allowed to submit bunch of IO till it > hits the nr_requests limit on request queue and then it is blocked as > request queue is too busy and not enough request descriptors are free. You are right. Throttling direct IO and AIO in high layer has the problem of added delays and less queue fullness. I suspect it may also lead to extra cfq anticipatory idling and disk idles. And it won't be able to deal with ioprio. All in all there are lots of problems actually. > > The thing worried me is that in the proportional throttling case, the > > high level throttling works on the *estimated* task_ratelimit = > > disk_bandwidth / N, where N is the number of read IO tasks. When N > > suddenly changes from 2 to 1, it may take 1 second for the estimated > > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, > > during which time the disk won't get 100% utilized because of the > > temporally over-throttling of the remaining IO task. > > I thought we were only considering the case of absolute throttling in > higher layers. Proportional IO will continue to be in CFQ. I don't think > we need to push proportional IO in higher layers. Agreed for direct IO. As for buffered writes, I'm seriously considering the possibility of doing proportional IO control in balance_dirty_pages(). I'd take this as the central problem of this thread. If the CFQ proportional IO controller can do its work well for direct IOs and leave the buffered writes to the balance_dirty_pages() proportional IO controller, it would result in a simple and efficient "feedback" system (comparing to the "push back" idea). I don't really know about any real use cases. However it seems to me (and perhaps Jan Kara) the most user friendly and manageable IO controller interfaces would allow the user to divide disk time (no matter it's used for reads or writes, direct or buffered IOs) among the cgroups. Then allow each cgroup to further split up disk time (or bps/iops) to different types of IO. For simplicity, let's assume only direct/buffered writes are happening and the user configures 3 blkio cgroups A, B, C with equal split of disk time and equal direct:buffered splits inside each cgroup. In the case of A: 1 direct write dd + 1 buffered write dd B: 1 direct write dd C: 1 buffered write dd The dd tasks should ideally be throttled to A.direct: 1/6 disk time A.buffered: 1/6 disk time B.direct: 1/3 disk time C.buffered: 1/3 disk time So is it possible for the proportional block IO controller to throttle direct IOs to A.direct: 1/6 disk time B.direct: 1/3 disk time and leave the remaining 1/2 disk time to buffered writes from the flusher thread? Then I promise that balance_dirty_pages() will be able to throttle the buffered writes to: A.buffered: 1/6 disk time C.buffered: 1/3 disk time thanks to the fact that the balance_dirty_pages() throttling algorithm is pretty adaptive. It will be able to work well with the blkio throttling to achieve the throttling goals. In the above case, equal split of disk time == equal split of write bandwidth since all cgroups run the same type of workload. balance_dirty_pages() will be able to work in that cooperative way after adding some direct IO rate accounting. In order to deal with mixed random/sequential workloads, balance_dirty_pages() will also need some disk time stats feedback. It will then throttle the dirtiers so that the disk time goals are matched in long run. > > This is not a problem when throttling at the block/cfq layer, since it > > has the full information of pending requests and should not depend on > > such estimations. > > CFQ does not even look at pending requests. It just maintains bunch > of IO queues and selects one queue to dispatch IO from based on its > weight. So proportional IO comes very naturally to CFQ. Sure. Nice work! > > > > The workaround I can think of, is to put the throttled task into a wait > > queue, and let block layer wake up the waiters when the IO queue runs > > empty. This should be able to avoid most disk idle time. > > Again, I am not convinced that proportional IO should go in higher layers. > > For fast devices we are already suffering from queue locking overhead and > Jens seems to have patches for multi queue. Now by trying to implement > something at higher layer, that locking overhead will show up there too > and we will end up doing something similar to multi queue there and it > is not desirable. Sure, yeah it's a hack. I was not really happy with it. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-05 15:10 ` Vivek Goyal @ 2012-04-06 0:32 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-06 0:32 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Vivek, I totally agree that direct IOs can be best handled in block/cfq layers. On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote: > > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > > > > > [..] > > > > The sweet split point would be for balance_dirty_pages() to do cgroup > > > > aware buffered write throttling and leave other IOs to the current > > > > blkcg. For this to work well as a total solution for end users, I hope > > > > we can cooperate and figure out ways for the two throttling entities > > > > to work well with each other. > > > > > > Throttling read + direct IO, higher up has few issues too. Users will > > > > Yeah I have a bit worry about high layer throttling, too. > > Anyway here are the ideas. > > > > > not like that a task got blocked as it tried to submit a read from a > > > throttled group. > > > > That's not the same issue I worried about :) Throttling is about > > inserting small sleep/waits into selected points. For reads, the ideal > > sleep point is immediately after readahead IO is summited, at the end > > of __do_page_cache_readahead(). The same should be applicable to > > direct IO. > > But after a read the process might want to process the read data and > do something else altogether. So throttling the process after completing > the read is not the best thing. __do_page_cache_readahead() returns immediately after queuing the read IOs. It may block occasionally on metadata IO but not data IO. > > > Current async behavior works well where we queue up the > > > bio from the task in throttled group and let task do other things. Same > > > is true for AIO where we would not like to block in bio submission. > > > > For AIO, we'll need to delay the IO completion notification or status > > update, which may involve computing some delay time and delay the > > calls to io_complete() with the help of some delayed work queue. There > > may be more issues to deal with as I didn't look into aio.c carefully. > > I don't know but delaying compltion notifications sounds odd to me. So > you don't throttle while submitting requests. That does not help with > pressure on request queue as process can dump whole bunch of IO without > waiting for completion? > > What I like better that AIO is allowed to submit bunch of IO till it > hits the nr_requests limit on request queue and then it is blocked as > request queue is too busy and not enough request descriptors are free. You are right. Throttling direct IO and AIO in high layer has the problem of added delays and less queue fullness. I suspect it may also lead to extra cfq anticipatory idling and disk idles. And it won't be able to deal with ioprio. All in all there are lots of problems actually. > > The thing worried me is that in the proportional throttling case, the > > high level throttling works on the *estimated* task_ratelimit = > > disk_bandwidth / N, where N is the number of read IO tasks. When N > > suddenly changes from 2 to 1, it may take 1 second for the estimated > > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, > > during which time the disk won't get 100% utilized because of the > > temporally over-throttling of the remaining IO task. > > I thought we were only considering the case of absolute throttling in > higher layers. Proportional IO will continue to be in CFQ. I don't think > we need to push proportional IO in higher layers. Agreed for direct IO. As for buffered writes, I'm seriously considering the possibility of doing proportional IO control in balance_dirty_pages(). I'd take this as the central problem of this thread. If the CFQ proportional IO controller can do its work well for direct IOs and leave the buffered writes to the balance_dirty_pages() proportional IO controller, it would result in a simple and efficient "feedback" system (comparing to the "push back" idea). I don't really know about any real use cases. However it seems to me (and perhaps Jan Kara) the most user friendly and manageable IO controller interfaces would allow the user to divide disk time (no matter it's used for reads or writes, direct or buffered IOs) among the cgroups. Then allow each cgroup to further split up disk time (or bps/iops) to different types of IO. For simplicity, let's assume only direct/buffered writes are happening and the user configures 3 blkio cgroups A, B, C with equal split of disk time and equal direct:buffered splits inside each cgroup. In the case of A: 1 direct write dd + 1 buffered write dd B: 1 direct write dd C: 1 buffered write dd The dd tasks should ideally be throttled to A.direct: 1/6 disk time A.buffered: 1/6 disk time B.direct: 1/3 disk time C.buffered: 1/3 disk time So is it possible for the proportional block IO controller to throttle direct IOs to A.direct: 1/6 disk time B.direct: 1/3 disk time and leave the remaining 1/2 disk time to buffered writes from the flusher thread? Then I promise that balance_dirty_pages() will be able to throttle the buffered writes to: A.buffered: 1/6 disk time C.buffered: 1/3 disk time thanks to the fact that the balance_dirty_pages() throttling algorithm is pretty adaptive. It will be able to work well with the blkio throttling to achieve the throttling goals. In the above case, equal split of disk time == equal split of write bandwidth since all cgroups run the same type of workload. balance_dirty_pages() will be able to work in that cooperative way after adding some direct IO rate accounting. In order to deal with mixed random/sequential workloads, balance_dirty_pages() will also need some disk time stats feedback. It will then throttle the dirtiers so that the disk time goals are matched in long run. > > This is not a problem when throttling at the block/cfq layer, since it > > has the full information of pending requests and should not depend on > > such estimations. > > CFQ does not even look at pending requests. It just maintains bunch > of IO queues and selects one queue to dispatch IO from based on its > weight. So proportional IO comes very naturally to CFQ. Sure. Nice work! > > > > The workaround I can think of, is to put the throttled task into a wait > > queue, and let block layer wake up the waiters when the IO queue runs > > empty. This should be able to avoid most disk idle time. > > Again, I am not convinced that proportional IO should go in higher layers. > > For fast devices we are already suffering from queue locking overhead and > Jens seems to have patches for multi queue. Now by trying to implement > something at higher layer, that locking overhead will show up there too > and we will end up doing something similar to multi queue there and it > is not desirable. Sure, yeah it's a hack. I was not really happy with it. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-06 0:32 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-06 0:32 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Vivek, I totally agree that direct IOs can be best handled in block/cfq layers. On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote: > > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > > > > > [..] > > > > The sweet split point would be for balance_dirty_pages() to do cgroup > > > > aware buffered write throttling and leave other IOs to the current > > > > blkcg. For this to work well as a total solution for end users, I hope > > > > we can cooperate and figure out ways for the two throttling entities > > > > to work well with each other. > > > > > > Throttling read + direct IO, higher up has few issues too. Users will > > > > Yeah I have a bit worry about high layer throttling, too. > > Anyway here are the ideas. > > > > > not like that a task got blocked as it tried to submit a read from a > > > throttled group. > > > > That's not the same issue I worried about :) Throttling is about > > inserting small sleep/waits into selected points. For reads, the ideal > > sleep point is immediately after readahead IO is summited, at the end > > of __do_page_cache_readahead(). The same should be applicable to > > direct IO. > > But after a read the process might want to process the read data and > do something else altogether. So throttling the process after completing > the read is not the best thing. __do_page_cache_readahead() returns immediately after queuing the read IOs. It may block occasionally on metadata IO but not data IO. > > > Current async behavior works well where we queue up the > > > bio from the task in throttled group and let task do other things. Same > > > is true for AIO where we would not like to block in bio submission. > > > > For AIO, we'll need to delay the IO completion notification or status > > update, which may involve computing some delay time and delay the > > calls to io_complete() with the help of some delayed work queue. There > > may be more issues to deal with as I didn't look into aio.c carefully. > > I don't know but delaying compltion notifications sounds odd to me. So > you don't throttle while submitting requests. That does not help with > pressure on request queue as process can dump whole bunch of IO without > waiting for completion? > > What I like better that AIO is allowed to submit bunch of IO till it > hits the nr_requests limit on request queue and then it is blocked as > request queue is too busy and not enough request descriptors are free. You are right. Throttling direct IO and AIO in high layer has the problem of added delays and less queue fullness. I suspect it may also lead to extra cfq anticipatory idling and disk idles. And it won't be able to deal with ioprio. All in all there are lots of problems actually. > > The thing worried me is that in the proportional throttling case, the > > high level throttling works on the *estimated* task_ratelimit = > > disk_bandwidth / N, where N is the number of read IO tasks. When N > > suddenly changes from 2 to 1, it may take 1 second for the estimated > > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, > > during which time the disk won't get 100% utilized because of the > > temporally over-throttling of the remaining IO task. > > I thought we were only considering the case of absolute throttling in > higher layers. Proportional IO will continue to be in CFQ. I don't think > we need to push proportional IO in higher layers. Agreed for direct IO. As for buffered writes, I'm seriously considering the possibility of doing proportional IO control in balance_dirty_pages(). I'd take this as the central problem of this thread. If the CFQ proportional IO controller can do its work well for direct IOs and leave the buffered writes to the balance_dirty_pages() proportional IO controller, it would result in a simple and efficient "feedback" system (comparing to the "push back" idea). I don't really know about any real use cases. However it seems to me (and perhaps Jan Kara) the most user friendly and manageable IO controller interfaces would allow the user to divide disk time (no matter it's used for reads or writes, direct or buffered IOs) among the cgroups. Then allow each cgroup to further split up disk time (or bps/iops) to different types of IO. For simplicity, let's assume only direct/buffered writes are happening and the user configures 3 blkio cgroups A, B, C with equal split of disk time and equal direct:buffered splits inside each cgroup. In the case of A: 1 direct write dd + 1 buffered write dd B: 1 direct write dd C: 1 buffered write dd The dd tasks should ideally be throttled to A.direct: 1/6 disk time A.buffered: 1/6 disk time B.direct: 1/3 disk time C.buffered: 1/3 disk time So is it possible for the proportional block IO controller to throttle direct IOs to A.direct: 1/6 disk time B.direct: 1/3 disk time and leave the remaining 1/2 disk time to buffered writes from the flusher thread? Then I promise that balance_dirty_pages() will be able to throttle the buffered writes to: A.buffered: 1/6 disk time C.buffered: 1/3 disk time thanks to the fact that the balance_dirty_pages() throttling algorithm is pretty adaptive. It will be able to work well with the blkio throttling to achieve the throttling goals. In the above case, equal split of disk time == equal split of write bandwidth since all cgroups run the same type of workload. balance_dirty_pages() will be able to work in that cooperative way after adding some direct IO rate accounting. In order to deal with mixed random/sequential workloads, balance_dirty_pages() will also need some disk time stats feedback. It will then throttle the dirtiers so that the disk time goals are matched in long run. > > This is not a problem when throttling at the block/cfq layer, since it > > has the full information of pending requests and should not depend on > > such estimations. > > CFQ does not even look at pending requests. It just maintains bunch > of IO queues and selects one queue to dispatch IO from based on its > weight. So proportional IO comes very naturally to CFQ. Sure. Nice work! > > > > The workaround I can think of, is to put the throttled task into a wait > > queue, and let block layer wake up the waiters when the IO queue runs > > empty. This should be able to avoid most disk idle time. > > Again, I am not convinced that proportional IO should go in higher layers. > > For fast devices we are already suffering from queue locking overhead and > Jens seems to have patches for multi queue. Now by trying to implement > something at higher layer, that locking overhead will show up there too > and we will end up doing something similar to multi queue there and it > is not desirable. Sure, yeah it's a hack. I was not really happy with it. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 21:42 ` Fengguang Wu ` (2 preceding siblings ...) (?) @ 2012-04-05 15:10 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote: > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > > > [..] > > > The sweet split point would be for balance_dirty_pages() to do cgroup > > > aware buffered write throttling and leave other IOs to the current > > > blkcg. For this to work well as a total solution for end users, I hope > > > we can cooperate and figure out ways for the two throttling entities > > > to work well with each other. > > > > Throttling read + direct IO, higher up has few issues too. Users will > > Yeah I have a bit worry about high layer throttling, too. > Anyway here are the ideas. > > > not like that a task got blocked as it tried to submit a read from a > > throttled group. > > That's not the same issue I worried about :) Throttling is about > inserting small sleep/waits into selected points. For reads, the ideal > sleep point is immediately after readahead IO is summited, at the end > of __do_page_cache_readahead(). The same should be applicable to > direct IO. But after a read the process might want to process the read data and do something else altogether. So throttling the process after completing the read is not the best thing. > > > Current async behavior works well where we queue up the > > bio from the task in throttled group and let task do other things. Same > > is true for AIO where we would not like to block in bio submission. > > For AIO, we'll need to delay the IO completion notification or status > update, which may involve computing some delay time and delay the > calls to io_complete() with the help of some delayed work queue. There > may be more issues to deal with as I didn't look into aio.c carefully. I don't know but delaying compltion notifications sounds odd to me. So you don't throttle while submitting requests. That does not help with pressure on request queue as process can dump whole bunch of IO without waiting for completion? What I like better that AIO is allowed to submit bunch of IO till it hits the nr_requests limit on request queue and then it is blocked as request queue is too busy and not enough request descriptors are free. > > The thing worried me is that in the proportional throttling case, the > high level throttling works on the *estimated* task_ratelimit = > disk_bandwidth / N, where N is the number of read IO tasks. When N > suddenly changes from 2 to 1, it may take 1 second for the estimated > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, > during which time the disk won't get 100% utilized because of the > temporally over-throttling of the remaining IO task. I thought we were only considering the case of absolute throttling in higher layers. Proportional IO will continue to be in CFQ. I don't think we need to push proportional IO in higher layers. > > This is not a problem when throttling at the block/cfq layer, since it > has the full information of pending requests and should not depend on > such estimations. CFQ does not even look at pending requests. It just maintains bunch of IO queues and selects one queue to dispatch IO from based on its weight. So proportional IO comes very naturally to CFQ. > > The workaround I can think of, is to put the throttled task into a wait > queue, and let block layer wake up the waiters when the IO queue runs > empty. This should be able to avoid most disk idle time. Again, I am not convinced that proportional IO should go in higher layers. For fast devices we are already suffering from queue locking overhead and Jens seems to have patches for multi queue. Now by trying to implement something at higher layer, that locking overhead will show up there too and we will end up doing something similar to multi queue there and it is not desirable. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 17:51 ` Fengguang Wu ` (2 preceding siblings ...) (?) @ 2012-04-04 18:35 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: [..] > The sweet split point would be for balance_dirty_pages() to do cgroup > aware buffered write throttling and leave other IOs to the current > blkcg. For this to work well as a total solution for end users, I hope > we can cooperate and figure out ways for the two throttling entities > to work well with each other. Throttling read + direct IO, higher up has few issues too. Users will not like that a task got blocked as it tried to submit a read from a throttled group. Current async behavior works well where we queue up the bio from the task in throttled group and let task do other things. Same is true for AIO where we would not like to block in bio submission. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 17:51 ` Fengguang Wu (?) @ 2012-04-04 19:33 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA Hey, Fengguang. On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > Yeah it should be trivial to apply the balance_dirty_pages() > throttling algorithm to the read/direct IOs. However up to now I don't > see much added value to *duplicate* the current block IO controller > functionalities, assuming the current users and developers are happy > with it. Heh, trust me. It's half broken and people ain't happy. I get that your algorithm can be updatd to consider all IOs and I believe that but what I don't get is how would such information get to writeback and in turn how writeback would enforce the result on reads and direct IOs. Through what path? Will all reads and direct IOs travel through balance_dirty_pages() even direct IOs on raw block devices? Or would the writeback algorithm take the configuration from cfq, apply the algorithm and give back the limits to enforce to cfq? If the latter, isn't that at least somewhat messed up? > I did the buffered write IO controller mainly to fill the gap. If I > happen to stand in your way, sorry that's not my initial intention. No, no, it's not about standing in my way. As Vivek said in the other reply, it's that the "gap" that you filled was created *because* writeback wasn't cgroup aware and now you're in turn filling that gap by making writeback work around that "gap". I mean, my mind boggles. Doesn't yours? I strongly believe everyone's should. > It's a pity and surprise that Google as a big user does not buy in > this simple solution. You may prefer more comprehensive controls which > may not be easily achievable with the simple scheme. However the > complexities and overheads involved in throttling the flusher IOs > really upsets me. Heh, believe it or not, I'm not really wearing google hat on this subject and google's writeback people may have completely different opinions on the subject than mine. In fact, I'm not even sure how much "work" time I'll be able to assign to this. :( > The sweet split point would be for balance_dirty_pages() to do cgroup > aware buffered write throttling and leave other IOs to the current > blkcg. For this to work well as a total solution for end users, I hope > we can cooperate and figure out ways for the two throttling entities > to work well with each other. There's where I'm confused. How is the said split supposed to work? They aren't independent. I mean, who gets to decide what and where are those decisions enforced? > What I'm interested is, what's Google and other users' use schemes in > practice. What's their desired interfaces. Whether and how the > combined bdp+blkcg throttling can fulfill the goals. I'm not too privy of mm and writeback in google and even if so I probably shouldn't talk too much about it. Confidentiality and all. That said, I have the general feeling that goog already figured out how to at least work around the existing implementation and would be able to continue no matter how upstream development fans out. That said, wearing the cgroup maintainer and general kernel contributor hat, I'd really like to avoid another design mess up. > > Let's please keep the layering clear. IO limitations will be applied > > at the block layer and pressure will be formed there and then > > propagated upwards eventually to the originator. Sure, exposing the > > whole information might result in better behavior for certain > > workloads, but down the road, say, in three or five years, devices > > which can be shared without worrying too much about seeks might be > > commonplace and we could be swearing at a disgusting structural mess, > > and sadly various cgroup support seems to be a prominent source of > > such design failures. > > Super fast storages are coming which will make us regret to make the > IO path over complex. Spinning disks are not going away anytime soon. > I doubt Google is willing to afford the disk seek costs on its > millions of disks and has the patience to wait until switching all of > the spin disks to SSD years later (if it will ever happen). This is new. Let's keep the damn employer out of the discussion. While the area I work on is affected by my employment (writeback isn't even my area BTW), I'm not gonna do something adverse to upstream even if it's beneficial to google and I'm much more likely to do something which may hurt google a bit if it's gonna benefit upstream. As for the faster / newer storage argument, that is *exactly* why we want to keep the layering proper. Writeback works from the pressure from the IO stack. If IO technology changes, we update the IO stack and writeback still works from the pressure. It may need to be adjusted but the principles don't change. > It's obvious that your below proposal involves a lot of complexities, > overheads, and will hurt performance. It basically involves Hmmm... that's not the impression I got from the discussion. According to Jan, applying the current writeback logic to cgroup'fied bdi shouldn't be too complex, no? > - running concurrent flusher threads for cgroups, which adds back the > disk seeks and lock contentions. And still has problems with sync > and shared inodes. I agree this is an actual concern but if the user wants to split one spindle to multiple resource domains, there's gonna be considerable amount of overhead no matter what. If you want to improve how block layer handles the split, you're welcome to dive into the block layer, where the split is made, and improve it. > - splitting device queue for cgroups, possibly scaling up the pool of > writeback pages (and locked pages in the case of stable pages) which > could stall random processes in the system Sure, it'll take up more buffering and memory but that's the overhead of the cgroup business. I want it to be less intrusive at the cost of somewhat more resource consumption. ie. I don't want writeback logic itself deeply involved in block IO cgroup enforcement even if that means somewhat less efficient resource usage. > - the mess of metadata handling Does throttling from writeback actually solve this problem? What about fsync()? Does that already go through balance_dirty_pages()? > - unnecessarily coupled with memcg, in order to take advantage of the > per-memcg dirty limits for balance_dirty_pages() to actually convert > the "pushed back" dirty pages pressure into lowered dirty rate. Why > the hell the users *have to* setup memcg (suffering from all the > inconvenience and overheads) in order to do IO throttling? Please, > this is really ugly! And the "back pressure" may constantly push the > memcg dirty pages to the limits. I'm not going to support *miss use* > of per-memcg dirty limits like this! Writeback sits between blkcg and memcg and it indeed can be hairy to consider both sides especially given the current sorry complex state of cgroup and I can see why it would seem tempting to add a separate controller or at least knobs to support that. That said, I *think* given that memcg controls all other memory parameters it probably would make most sense giving that parameter to memcg too. I don't think this is really relevant to this discussion tho. Who owns dirty_limits is a separate issue. > I cannot believe you would keep overlooking all the problems without > good reasons. Please do tell us the reasons that matter. Well, I tried and I hope some of it got through. I also wrote a lot of questions, mainly regarding how what you have in mind is supposed to work through what path. Maybe I'm just not seeing what you're seeing but I just can't see where all the IOs would go through and come together. Can you please elaborate more on that? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 19:33 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Fengguang. On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > Yeah it should be trivial to apply the balance_dirty_pages() > throttling algorithm to the read/direct IOs. However up to now I don't > see much added value to *duplicate* the current block IO controller > functionalities, assuming the current users and developers are happy > with it. Heh, trust me. It's half broken and people ain't happy. I get that your algorithm can be updatd to consider all IOs and I believe that but what I don't get is how would such information get to writeback and in turn how writeback would enforce the result on reads and direct IOs. Through what path? Will all reads and direct IOs travel through balance_dirty_pages() even direct IOs on raw block devices? Or would the writeback algorithm take the configuration from cfq, apply the algorithm and give back the limits to enforce to cfq? If the latter, isn't that at least somewhat messed up? > I did the buffered write IO controller mainly to fill the gap. If I > happen to stand in your way, sorry that's not my initial intention. No, no, it's not about standing in my way. As Vivek said in the other reply, it's that the "gap" that you filled was created *because* writeback wasn't cgroup aware and now you're in turn filling that gap by making writeback work around that "gap". I mean, my mind boggles. Doesn't yours? I strongly believe everyone's should. > It's a pity and surprise that Google as a big user does not buy in > this simple solution. You may prefer more comprehensive controls which > may not be easily achievable with the simple scheme. However the > complexities and overheads involved in throttling the flusher IOs > really upsets me. Heh, believe it or not, I'm not really wearing google hat on this subject and google's writeback people may have completely different opinions on the subject than mine. In fact, I'm not even sure how much "work" time I'll be able to assign to this. :( > The sweet split point would be for balance_dirty_pages() to do cgroup > aware buffered write throttling and leave other IOs to the current > blkcg. For this to work well as a total solution for end users, I hope > we can cooperate and figure out ways for the two throttling entities > to work well with each other. There's where I'm confused. How is the said split supposed to work? They aren't independent. I mean, who gets to decide what and where are those decisions enforced? > What I'm interested is, what's Google and other users' use schemes in > practice. What's their desired interfaces. Whether and how the > combined bdp+blkcg throttling can fulfill the goals. I'm not too privy of mm and writeback in google and even if so I probably shouldn't talk too much about it. Confidentiality and all. That said, I have the general feeling that goog already figured out how to at least work around the existing implementation and would be able to continue no matter how upstream development fans out. That said, wearing the cgroup maintainer and general kernel contributor hat, I'd really like to avoid another design mess up. > > Let's please keep the layering clear. IO limitations will be applied > > at the block layer and pressure will be formed there and then > > propagated upwards eventually to the originator. Sure, exposing the > > whole information might result in better behavior for certain > > workloads, but down the road, say, in three or five years, devices > > which can be shared without worrying too much about seeks might be > > commonplace and we could be swearing at a disgusting structural mess, > > and sadly various cgroup support seems to be a prominent source of > > such design failures. > > Super fast storages are coming which will make us regret to make the > IO path over complex. Spinning disks are not going away anytime soon. > I doubt Google is willing to afford the disk seek costs on its > millions of disks and has the patience to wait until switching all of > the spin disks to SSD years later (if it will ever happen). This is new. Let's keep the damn employer out of the discussion. While the area I work on is affected by my employment (writeback isn't even my area BTW), I'm not gonna do something adverse to upstream even if it's beneficial to google and I'm much more likely to do something which may hurt google a bit if it's gonna benefit upstream. As for the faster / newer storage argument, that is *exactly* why we want to keep the layering proper. Writeback works from the pressure from the IO stack. If IO technology changes, we update the IO stack and writeback still works from the pressure. It may need to be adjusted but the principles don't change. > It's obvious that your below proposal involves a lot of complexities, > overheads, and will hurt performance. It basically involves Hmmm... that's not the impression I got from the discussion. According to Jan, applying the current writeback logic to cgroup'fied bdi shouldn't be too complex, no? > - running concurrent flusher threads for cgroups, which adds back the > disk seeks and lock contentions. And still has problems with sync > and shared inodes. I agree this is an actual concern but if the user wants to split one spindle to multiple resource domains, there's gonna be considerable amount of overhead no matter what. If you want to improve how block layer handles the split, you're welcome to dive into the block layer, where the split is made, and improve it. > - splitting device queue for cgroups, possibly scaling up the pool of > writeback pages (and locked pages in the case of stable pages) which > could stall random processes in the system Sure, it'll take up more buffering and memory but that's the overhead of the cgroup business. I want it to be less intrusive at the cost of somewhat more resource consumption. ie. I don't want writeback logic itself deeply involved in block IO cgroup enforcement even if that means somewhat less efficient resource usage. > - the mess of metadata handling Does throttling from writeback actually solve this problem? What about fsync()? Does that already go through balance_dirty_pages()? > - unnecessarily coupled with memcg, in order to take advantage of the > per-memcg dirty limits for balance_dirty_pages() to actually convert > the "pushed back" dirty pages pressure into lowered dirty rate. Why > the hell the users *have to* setup memcg (suffering from all the > inconvenience and overheads) in order to do IO throttling? Please, > this is really ugly! And the "back pressure" may constantly push the > memcg dirty pages to the limits. I'm not going to support *miss use* > of per-memcg dirty limits like this! Writeback sits between blkcg and memcg and it indeed can be hairy to consider both sides especially given the current sorry complex state of cgroup and I can see why it would seem tempting to add a separate controller or at least knobs to support that. That said, I *think* given that memcg controls all other memory parameters it probably would make most sense giving that parameter to memcg too. I don't think this is really relevant to this discussion tho. Who owns dirty_limits is a separate issue. > I cannot believe you would keep overlooking all the problems without > good reasons. Please do tell us the reasons that matter. Well, I tried and I hope some of it got through. I also wrote a lot of questions, mainly regarding how what you have in mind is supposed to work through what path. Maybe I'm just not seeing what you're seeing but I just can't see where all the IOs would go through and come together. Can you please elaborate more on that? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 19:33 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Fengguang. On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > Yeah it should be trivial to apply the balance_dirty_pages() > throttling algorithm to the read/direct IOs. However up to now I don't > see much added value to *duplicate* the current block IO controller > functionalities, assuming the current users and developers are happy > with it. Heh, trust me. It's half broken and people ain't happy. I get that your algorithm can be updatd to consider all IOs and I believe that but what I don't get is how would such information get to writeback and in turn how writeback would enforce the result on reads and direct IOs. Through what path? Will all reads and direct IOs travel through balance_dirty_pages() even direct IOs on raw block devices? Or would the writeback algorithm take the configuration from cfq, apply the algorithm and give back the limits to enforce to cfq? If the latter, isn't that at least somewhat messed up? > I did the buffered write IO controller mainly to fill the gap. If I > happen to stand in your way, sorry that's not my initial intention. No, no, it's not about standing in my way. As Vivek said in the other reply, it's that the "gap" that you filled was created *because* writeback wasn't cgroup aware and now you're in turn filling that gap by making writeback work around that "gap". I mean, my mind boggles. Doesn't yours? I strongly believe everyone's should. > It's a pity and surprise that Google as a big user does not buy in > this simple solution. You may prefer more comprehensive controls which > may not be easily achievable with the simple scheme. However the > complexities and overheads involved in throttling the flusher IOs > really upsets me. Heh, believe it or not, I'm not really wearing google hat on this subject and google's writeback people may have completely different opinions on the subject than mine. In fact, I'm not even sure how much "work" time I'll be able to assign to this. :( > The sweet split point would be for balance_dirty_pages() to do cgroup > aware buffered write throttling and leave other IOs to the current > blkcg. For this to work well as a total solution for end users, I hope > we can cooperate and figure out ways for the two throttling entities > to work well with each other. There's where I'm confused. How is the said split supposed to work? They aren't independent. I mean, who gets to decide what and where are those decisions enforced? > What I'm interested is, what's Google and other users' use schemes in > practice. What's their desired interfaces. Whether and how the > combined bdp+blkcg throttling can fulfill the goals. I'm not too privy of mm and writeback in google and even if so I probably shouldn't talk too much about it. Confidentiality and all. That said, I have the general feeling that goog already figured out how to at least work around the existing implementation and would be able to continue no matter how upstream development fans out. That said, wearing the cgroup maintainer and general kernel contributor hat, I'd really like to avoid another design mess up. > > Let's please keep the layering clear. IO limitations will be applied > > at the block layer and pressure will be formed there and then > > propagated upwards eventually to the originator. Sure, exposing the > > whole information might result in better behavior for certain > > workloads, but down the road, say, in three or five years, devices > > which can be shared without worrying too much about seeks might be > > commonplace and we could be swearing at a disgusting structural mess, > > and sadly various cgroup support seems to be a prominent source of > > such design failures. > > Super fast storages are coming which will make us regret to make the > IO path over complex. Spinning disks are not going away anytime soon. > I doubt Google is willing to afford the disk seek costs on its > millions of disks and has the patience to wait until switching all of > the spin disks to SSD years later (if it will ever happen). This is new. Let's keep the damn employer out of the discussion. While the area I work on is affected by my employment (writeback isn't even my area BTW), I'm not gonna do something adverse to upstream even if it's beneficial to google and I'm much more likely to do something which may hurt google a bit if it's gonna benefit upstream. As for the faster / newer storage argument, that is *exactly* why we want to keep the layering proper. Writeback works from the pressure from the IO stack. If IO technology changes, we update the IO stack and writeback still works from the pressure. It may need to be adjusted but the principles don't change. > It's obvious that your below proposal involves a lot of complexities, > overheads, and will hurt performance. It basically involves Hmmm... that's not the impression I got from the discussion. According to Jan, applying the current writeback logic to cgroup'fied bdi shouldn't be too complex, no? > - running concurrent flusher threads for cgroups, which adds back the > disk seeks and lock contentions. And still has problems with sync > and shared inodes. I agree this is an actual concern but if the user wants to split one spindle to multiple resource domains, there's gonna be considerable amount of overhead no matter what. If you want to improve how block layer handles the split, you're welcome to dive into the block layer, where the split is made, and improve it. > - splitting device queue for cgroups, possibly scaling up the pool of > writeback pages (and locked pages in the case of stable pages) which > could stall random processes in the system Sure, it'll take up more buffering and memory but that's the overhead of the cgroup business. I want it to be less intrusive at the cost of somewhat more resource consumption. ie. I don't want writeback logic itself deeply involved in block IO cgroup enforcement even if that means somewhat less efficient resource usage. > - the mess of metadata handling Does throttling from writeback actually solve this problem? What about fsync()? Does that already go through balance_dirty_pages()? > - unnecessarily coupled with memcg, in order to take advantage of the > per-memcg dirty limits for balance_dirty_pages() to actually convert > the "pushed back" dirty pages pressure into lowered dirty rate. Why > the hell the users *have to* setup memcg (suffering from all the > inconvenience and overheads) in order to do IO throttling? Please, > this is really ugly! And the "back pressure" may constantly push the > memcg dirty pages to the limits. I'm not going to support *miss use* > of per-memcg dirty limits like this! Writeback sits between blkcg and memcg and it indeed can be hairy to consider both sides especially given the current sorry complex state of cgroup and I can see why it would seem tempting to add a separate controller or at least knobs to support that. That said, I *think* given that memcg controls all other memory parameters it probably would make most sense giving that parameter to memcg too. I don't think this is really relevant to this discussion tho. Who owns dirty_limits is a separate issue. > I cannot believe you would keep overlooking all the problems without > good reasons. Please do tell us the reasons that matter. Well, I tried and I hope some of it got through. I also wrote a lot of questions, mainly regarding how what you have in mind is supposed to work through what path. Maybe I'm just not seeing what you're seeing but I just can't see where all the IOs would go through and come together. Can you please elaborate more on that? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 20:18 ` Vivek Goyal @ 2012-04-06 9:59 ` Fengguang Wu 1 sibling, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-06 9:59 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Tejun, On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote: > Hey, Fengguang. > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > Yeah it should be trivial to apply the balance_dirty_pages() > > throttling algorithm to the read/direct IOs. However up to now I don't > > see much added value to *duplicate* the current block IO controller > > functionalities, assuming the current users and developers are happy > > with it. > > Heh, trust me. It's half broken and people ain't happy. I get that Yeah, although the balance_dirty_pages() IO controller for buffered writes looks perfect in itself, it's not enough to meet user demands. The user expectation should be: hey, please throttle *all* IOs from this cgroup to this amount, either in absolute bps/iops limits or in some proportional weight value (or both, whatever the lower takes effect). And if necessary, he may request further limits/weights for each type of IO inside the cgroup. Now the blkio cgroup supports direct IO and the balance_dirty_pages() IO controller supports buffered writes. They are providing limits/weights for either direct IO or buffered writes, which is fine if it's pure direct IO or pure buffered write. For the common mixed IO workloads, it's obviously not enough. Fortunately, the above gap can be easily filled judging from the block/cfq IO controller code. By adding some direct IO accounting and changing several lines of my patches to make use of the collected stats, the semantics of the blkio.throttle.write_bps interfaces can be changed from "limit for direct IO" to "limit for direct+buffered IOs". Ditto for blkio.weight and blkio.write_iops, as long as some iops/device time stats are made available to balance_dirty_pages(). It would be a fairly *easy* change. :-) It's merely adding some accounting code and there is no need to change the block IO controlling algorithm at all. I'll do the work of accounting (which is basically independent of the IO controlling) and use the new stats in balance_dirty_pages(). The only problem I can see now, is that balance_dirty_pages() works per-bdi and blkcg works per-device. So the two ends may not match nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where sdb is shared by lv0 and lv1. However it should be rare situations and be much more acceptable than the problems arise from the "push back" approach which impacts everyone. > your algorithm can be updatd to consider all IOs and I believe that > but what I don't get is how would such information get to writeback > and in turn how writeback would enforce the result on reads and direct > IOs. Through what path? Will all reads and direct IOs travel through > balance_dirty_pages() even direct IOs on raw block devices? Or would > the writeback algorithm take the configuration from cfq, apply the > algorithm and give back the limits to enforce to cfq? If the latter, > isn't that at least somewhat messed up? cfq is working well and don't need any modifications. Let's just make balance_dirty_pages() cgroup aware and fill the gap of the current block IO controller. If the balance_dirty_pages() throttling algorithms will ever be applied to read and direct IOs, it would be for NFS, CIFS etc. Even for them, there may be better throttling choices. For example, Trond mentioned the RPC layer to me during the summit. > > I did the buffered write IO controller mainly to fill the gap. If I > > happen to stand in your way, sorry that's not my initial intention. > > No, no, it's not about standing in my way. As Vivek said in the other > reply, it's that the "gap" that you filled was created *because* > writeback wasn't cgroup aware and now you're in turn filling that gap > by making writeback work around that "gap". I mean, my mind boggles. > Doesn't yours? I strongly believe everyone's should. Heh. It's a hard problem indeed. I felt great pains in the IO-less dirty throttling work. I did a lot reasoning about it, and have in fact kept cgroup IO controller in mind since its early days. Now I'd say it's hands down for it to adapt to the gap between the total IO limit and what's carried out by the block IO controller. > > It's a pity and surprise that Google as a big user does not buy in > > this simple solution. You may prefer more comprehensive controls which > > may not be easily achievable with the simple scheme. However the > > complexities and overheads involved in throttling the flusher IOs > > really upsets me. > > Heh, believe it or not, I'm not really wearing google hat on this > subject and google's writeback people may have completely different > opinions on the subject than mine. In fact, I'm not even sure how > much "work" time I'll be able to assign to this. :( OK, understand. > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > There's where I'm confused. How is the said split supposed to work? > They aren't independent. I mean, who gets to decide what and where > are those decisions enforced? Yeah it's not independent. It's about - keep block IO cgroup untouched (in its current algorithm, for throttling direct IO) - let balance_dirty_pages() adapt to the throttling target buffered_write_limit = total_limit - direct_IOs > > What I'm interested is, what's Google and other users' use schemes in > > practice. What's their desired interfaces. Whether and how the > > combined bdp+blkcg throttling can fulfill the goals. > > I'm not too privy of mm and writeback in google and even if so I > probably shouldn't talk too much about it. Confidentiality and all. > That said, I have the general feeling that goog already figured out > how to at least work around the existing implementation and would be > able to continue no matter how upstream development fans out. > > That said, wearing the cgroup maintainer and general kernel > contributor hat, I'd really like to avoid another design mess up. To me it looks a pretty clean split and find it to be an easy solution (after sorting it out the hard way). I'll show the code and test results after some time. > > > Let's please keep the layering clear. IO limitations will be applied > > > at the block layer and pressure will be formed there and then > > > propagated upwards eventually to the originator. Sure, exposing the > > > whole information might result in better behavior for certain > > > workloads, but down the road, say, in three or five years, devices > > > which can be shared without worrying too much about seeks might be > > > commonplace and we could be swearing at a disgusting structural mess, > > > and sadly various cgroup support seems to be a prominent source of > > > such design failures. > > > > Super fast storages are coming which will make us regret to make the > > IO path over complex. Spinning disks are not going away anytime soon. > > I doubt Google is willing to afford the disk seek costs on its > > millions of disks and has the patience to wait until switching all of > > the spin disks to SSD years later (if it will ever happen). > > This is new. Let's keep the damn employer out of the discussion. > While the area I work on is affected by my employment (writeback isn't > even my area BTW), I'm not gonna do something adverse to upstream even > if it's beneficial to google and I'm much more likely to do something > which may hurt google a bit if it's gonna benefit upstream. > > As for the faster / newer storage argument, that is *exactly* why we > want to keep the layering proper. Writeback works from the pressure > from the IO stack. If IO technology changes, we update the IO stack > and writeback still works from the pressure. It may need to be > adjusted but the principles don't change. To me, balance_dirty_pages() is *the* proper layer for buffered writes. It's always there doing 1:1 proportional throttling. Then you try to kick in to add *double* throttling in block/cfq layer. Now the low layer may enforce 10:1 throttling and push balance_dirty_pages() away from its balanced state, leading to large fluctuations and program stalls. This can be avoided by telling balance_dirty_pages(): "your balance goal is no longer 1:1, but 10:1". With this information balance_dirty_pages() will behave right. Then there is the question: if balance_dirty_pages() will work just well provided the information, why bother doing the throttling at low layer and "push back" the pressure all the way up? > > It's obvious that your below proposal involves a lot of complexities, > > overheads, and will hurt performance. It basically involves > > Hmmm... that's not the impression I got from the discussion. > According to Jan, applying the current writeback logic to cgroup'fied > bdi shouldn't be too complex, no? In the sense of "avoidable" complexity :-) > > - running concurrent flusher threads for cgroups, which adds back the > > disk seeks and lock contentions. And still has problems with sync > > and shared inodes. > > I agree this is an actual concern but if the user wants to split one > spindle to multiple resource domains, there's gonna be considerable > amount of overhead no matter what. If you want to improve how block > layer handles the split, you're welcome to dive into the block layer, > where the split is made, and improve it. > > > - splitting device queue for cgroups, possibly scaling up the pool of > > writeback pages (and locked pages in the case of stable pages) which > > could stall random processes in the system > > Sure, it'll take up more buffering and memory but that's the overhead > of the cgroup business. I want it to be less intrusive at the cost of > somewhat more resource consumption. ie. I don't want writeback logic > itself deeply involved in block IO cgroup enforcement even if that > means somewhat less efficient resource usage. The balance_dirty_pages() is already deeply involved in dirty throttling. As you can see from this patchset, the same algorithms can be extended trivially to work with cgroup IO limits. buffered write IO controller in balance_dirty_pages() https://lkml.org/lkml/2012/3/28/275 It does not require forking off the flusher threads and splitting up the IO queue at all. > > - the mess of metadata handling > > Does throttling from writeback actually solve this problem? What > about fsync()? Does that already go through balance_dirty_pages()? balance_dirty_pages() does throttling at safe points outside of fs transactions/locks. fsync() only submits IO for already dirtied pages and won't be throttled by balance_dirty_pages(). Throttling happens at earlier times when the task is dirtying the pages. > > - unnecessarily coupled with memcg, in order to take advantage of the > > per-memcg dirty limits for balance_dirty_pages() to actually convert > > the "pushed back" dirty pages pressure into lowered dirty rate. Why > > the hell the users *have to* setup memcg (suffering from all the > > inconvenience and overheads) in order to do IO throttling? Please, > > this is really ugly! And the "back pressure" may constantly push the > > memcg dirty pages to the limits. I'm not going to support *miss use* > > of per-memcg dirty limits like this! > > Writeback sits between blkcg and memcg and it indeed can be hairy to > consider both sides especially given the current sorry complex state > of cgroup and I can see why it would seem tempting to add a separate > controller or at least knobs to support that. That said, I *think* > given that memcg controls all other memory parameters it probably > would make most sense giving that parameter to memcg too. I don't > think this is really relevant to this discussion tho. Who owns > dirty_limits is a separate issue. In the "back pressure" scheme, memcg is a must because only it has all the infrastructure to track dirty pages upon which you can apply some dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > I cannot believe you would keep overlooking all the problems without > > good reasons. Please do tell us the reasons that matter. > > Well, I tried and I hope some of it got through. I also wrote a lot > of questions, mainly regarding how what you have in mind is supposed > to work through what path. Maybe I'm just not seeing what you're > seeing but I just can't see where all the IOs would go through and > come together. Can you please elaborate more on that? What I can see is, it looks pretty simple and nature to let balance_dirty_pages() fill the gap towards a total solution :-) - add direct IO accounting in some convenient point of the IO path IO submission or completion point, either is fine. - change several lines of the buffered write IO controller to integrate the direct IO rate into the formula to fit the "total IO" limit - in future, add more accounting as well as feedback control to make balance_dirty_pages() work with IOPS and disk time Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-06 9:59 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-06 9:59 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Tejun, On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote: > Hey, Fengguang. > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > Yeah it should be trivial to apply the balance_dirty_pages() > > throttling algorithm to the read/direct IOs. However up to now I don't > > see much added value to *duplicate* the current block IO controller > > functionalities, assuming the current users and developers are happy > > with it. > > Heh, trust me. It's half broken and people ain't happy. I get that Yeah, although the balance_dirty_pages() IO controller for buffered writes looks perfect in itself, it's not enough to meet user demands. The user expectation should be: hey, please throttle *all* IOs from this cgroup to this amount, either in absolute bps/iops limits or in some proportional weight value (or both, whatever the lower takes effect). And if necessary, he may request further limits/weights for each type of IO inside the cgroup. Now the blkio cgroup supports direct IO and the balance_dirty_pages() IO controller supports buffered writes. They are providing limits/weights for either direct IO or buffered writes, which is fine if it's pure direct IO or pure buffered write. For the common mixed IO workloads, it's obviously not enough. Fortunately, the above gap can be easily filled judging from the block/cfq IO controller code. By adding some direct IO accounting and changing several lines of my patches to make use of the collected stats, the semantics of the blkio.throttle.write_bps interfaces can be changed from "limit for direct IO" to "limit for direct+buffered IOs". Ditto for blkio.weight and blkio.write_iops, as long as some iops/device time stats are made available to balance_dirty_pages(). It would be a fairly *easy* change. :-) It's merely adding some accounting code and there is no need to change the block IO controlling algorithm at all. I'll do the work of accounting (which is basically independent of the IO controlling) and use the new stats in balance_dirty_pages(). The only problem I can see now, is that balance_dirty_pages() works per-bdi and blkcg works per-device. So the two ends may not match nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where sdb is shared by lv0 and lv1. However it should be rare situations and be much more acceptable than the problems arise from the "push back" approach which impacts everyone. > your algorithm can be updatd to consider all IOs and I believe that > but what I don't get is how would such information get to writeback > and in turn how writeback would enforce the result on reads and direct > IOs. Through what path? Will all reads and direct IOs travel through > balance_dirty_pages() even direct IOs on raw block devices? Or would > the writeback algorithm take the configuration from cfq, apply the > algorithm and give back the limits to enforce to cfq? If the latter, > isn't that at least somewhat messed up? cfq is working well and don't need any modifications. Let's just make balance_dirty_pages() cgroup aware and fill the gap of the current block IO controller. If the balance_dirty_pages() throttling algorithms will ever be applied to read and direct IOs, it would be for NFS, CIFS etc. Even for them, there may be better throttling choices. For example, Trond mentioned the RPC layer to me during the summit. > > I did the buffered write IO controller mainly to fill the gap. If I > > happen to stand in your way, sorry that's not my initial intention. > > No, no, it's not about standing in my way. As Vivek said in the other > reply, it's that the "gap" that you filled was created *because* > writeback wasn't cgroup aware and now you're in turn filling that gap > by making writeback work around that "gap". I mean, my mind boggles. > Doesn't yours? I strongly believe everyone's should. Heh. It's a hard problem indeed. I felt great pains in the IO-less dirty throttling work. I did a lot reasoning about it, and have in fact kept cgroup IO controller in mind since its early days. Now I'd say it's hands down for it to adapt to the gap between the total IO limit and what's carried out by the block IO controller. > > It's a pity and surprise that Google as a big user does not buy in > > this simple solution. You may prefer more comprehensive controls which > > may not be easily achievable with the simple scheme. However the > > complexities and overheads involved in throttling the flusher IOs > > really upsets me. > > Heh, believe it or not, I'm not really wearing google hat on this > subject and google's writeback people may have completely different > opinions on the subject than mine. In fact, I'm not even sure how > much "work" time I'll be able to assign to this. :( OK, understand. > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > There's where I'm confused. How is the said split supposed to work? > They aren't independent. I mean, who gets to decide what and where > are those decisions enforced? Yeah it's not independent. It's about - keep block IO cgroup untouched (in its current algorithm, for throttling direct IO) - let balance_dirty_pages() adapt to the throttling target buffered_write_limit = total_limit - direct_IOs > > What I'm interested is, what's Google and other users' use schemes in > > practice. What's their desired interfaces. Whether and how the > > combined bdp+blkcg throttling can fulfill the goals. > > I'm not too privy of mm and writeback in google and even if so I > probably shouldn't talk too much about it. Confidentiality and all. > That said, I have the general feeling that goog already figured out > how to at least work around the existing implementation and would be > able to continue no matter how upstream development fans out. > > That said, wearing the cgroup maintainer and general kernel > contributor hat, I'd really like to avoid another design mess up. To me it looks a pretty clean split and find it to be an easy solution (after sorting it out the hard way). I'll show the code and test results after some time. > > > Let's please keep the layering clear. IO limitations will be applied > > > at the block layer and pressure will be formed there and then > > > propagated upwards eventually to the originator. Sure, exposing the > > > whole information might result in better behavior for certain > > > workloads, but down the road, say, in three or five years, devices > > > which can be shared without worrying too much about seeks might be > > > commonplace and we could be swearing at a disgusting structural mess, > > > and sadly various cgroup support seems to be a prominent source of > > > such design failures. > > > > Super fast storages are coming which will make us regret to make the > > IO path over complex. Spinning disks are not going away anytime soon. > > I doubt Google is willing to afford the disk seek costs on its > > millions of disks and has the patience to wait until switching all of > > the spin disks to SSD years later (if it will ever happen). > > This is new. Let's keep the damn employer out of the discussion. > While the area I work on is affected by my employment (writeback isn't > even my area BTW), I'm not gonna do something adverse to upstream even > if it's beneficial to google and I'm much more likely to do something > which may hurt google a bit if it's gonna benefit upstream. > > As for the faster / newer storage argument, that is *exactly* why we > want to keep the layering proper. Writeback works from the pressure > from the IO stack. If IO technology changes, we update the IO stack > and writeback still works from the pressure. It may need to be > adjusted but the principles don't change. To me, balance_dirty_pages() is *the* proper layer for buffered writes. It's always there doing 1:1 proportional throttling. Then you try to kick in to add *double* throttling in block/cfq layer. Now the low layer may enforce 10:1 throttling and push balance_dirty_pages() away from its balanced state, leading to large fluctuations and program stalls. This can be avoided by telling balance_dirty_pages(): "your balance goal is no longer 1:1, but 10:1". With this information balance_dirty_pages() will behave right. Then there is the question: if balance_dirty_pages() will work just well provided the information, why bother doing the throttling at low layer and "push back" the pressure all the way up? > > It's obvious that your below proposal involves a lot of complexities, > > overheads, and will hurt performance. It basically involves > > Hmmm... that's not the impression I got from the discussion. > According to Jan, applying the current writeback logic to cgroup'fied > bdi shouldn't be too complex, no? In the sense of "avoidable" complexity :-) > > - running concurrent flusher threads for cgroups, which adds back the > > disk seeks and lock contentions. And still has problems with sync > > and shared inodes. > > I agree this is an actual concern but if the user wants to split one > spindle to multiple resource domains, there's gonna be considerable > amount of overhead no matter what. If you want to improve how block > layer handles the split, you're welcome to dive into the block layer, > where the split is made, and improve it. > > > - splitting device queue for cgroups, possibly scaling up the pool of > > writeback pages (and locked pages in the case of stable pages) which > > could stall random processes in the system > > Sure, it'll take up more buffering and memory but that's the overhead > of the cgroup business. I want it to be less intrusive at the cost of > somewhat more resource consumption. ie. I don't want writeback logic > itself deeply involved in block IO cgroup enforcement even if that > means somewhat less efficient resource usage. The balance_dirty_pages() is already deeply involved in dirty throttling. As you can see from this patchset, the same algorithms can be extended trivially to work with cgroup IO limits. buffered write IO controller in balance_dirty_pages() https://lkml.org/lkml/2012/3/28/275 It does not require forking off the flusher threads and splitting up the IO queue at all. > > - the mess of metadata handling > > Does throttling from writeback actually solve this problem? What > about fsync()? Does that already go through balance_dirty_pages()? balance_dirty_pages() does throttling at safe points outside of fs transactions/locks. fsync() only submits IO for already dirtied pages and won't be throttled by balance_dirty_pages(). Throttling happens at earlier times when the task is dirtying the pages. > > - unnecessarily coupled with memcg, in order to take advantage of the > > per-memcg dirty limits for balance_dirty_pages() to actually convert > > the "pushed back" dirty pages pressure into lowered dirty rate. Why > > the hell the users *have to* setup memcg (suffering from all the > > inconvenience and overheads) in order to do IO throttling? Please, > > this is really ugly! And the "back pressure" may constantly push the > > memcg dirty pages to the limits. I'm not going to support *miss use* > > of per-memcg dirty limits like this! > > Writeback sits between blkcg and memcg and it indeed can be hairy to > consider both sides especially given the current sorry complex state > of cgroup and I can see why it would seem tempting to add a separate > controller or at least knobs to support that. That said, I *think* > given that memcg controls all other memory parameters it probably > would make most sense giving that parameter to memcg too. I don't > think this is really relevant to this discussion tho. Who owns > dirty_limits is a separate issue. In the "back pressure" scheme, memcg is a must because only it has all the infrastructure to track dirty pages upon which you can apply some dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > I cannot believe you would keep overlooking all the problems without > > good reasons. Please do tell us the reasons that matter. > > Well, I tried and I hope some of it got through. I also wrote a lot > of questions, mainly regarding how what you have in mind is supposed > to work through what path. Maybe I'm just not seeing what you're > seeing but I just can't see where all the IOs would go through and > come together. Can you please elaborate more on that? What I can see is, it looks pretty simple and nature to let balance_dirty_pages() fill the gap towards a total solution :-) - add direct IO accounting in some convenient point of the IO path IO submission or completion point, either is fine. - change several lines of the buffered write IO controller to integrate the direct IO rate into the formula to fit the "total IO" limit - in future, add more accounting as well as feedback control to make balance_dirty_pages() work with IOPS and disk time Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-06 9:59 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-06 9:59 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hi Tejun, On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote: > Hey, Fengguang. > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > Yeah it should be trivial to apply the balance_dirty_pages() > > throttling algorithm to the read/direct IOs. However up to now I don't > > see much added value to *duplicate* the current block IO controller > > functionalities, assuming the current users and developers are happy > > with it. > > Heh, trust me. It's half broken and people ain't happy. I get that Yeah, although the balance_dirty_pages() IO controller for buffered writes looks perfect in itself, it's not enough to meet user demands. The user expectation should be: hey, please throttle *all* IOs from this cgroup to this amount, either in absolute bps/iops limits or in some proportional weight value (or both, whatever the lower takes effect). And if necessary, he may request further limits/weights for each type of IO inside the cgroup. Now the blkio cgroup supports direct IO and the balance_dirty_pages() IO controller supports buffered writes. They are providing limits/weights for either direct IO or buffered writes, which is fine if it's pure direct IO or pure buffered write. For the common mixed IO workloads, it's obviously not enough. Fortunately, the above gap can be easily filled judging from the block/cfq IO controller code. By adding some direct IO accounting and changing several lines of my patches to make use of the collected stats, the semantics of the blkio.throttle.write_bps interfaces can be changed from "limit for direct IO" to "limit for direct+buffered IOs". Ditto for blkio.weight and blkio.write_iops, as long as some iops/device time stats are made available to balance_dirty_pages(). It would be a fairly *easy* change. :-) It's merely adding some accounting code and there is no need to change the block IO controlling algorithm at all. I'll do the work of accounting (which is basically independent of the IO controlling) and use the new stats in balance_dirty_pages(). The only problem I can see now, is that balance_dirty_pages() works per-bdi and blkcg works per-device. So the two ends may not match nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where sdb is shared by lv0 and lv1. However it should be rare situations and be much more acceptable than the problems arise from the "push back" approach which impacts everyone. > your algorithm can be updatd to consider all IOs and I believe that > but what I don't get is how would such information get to writeback > and in turn how writeback would enforce the result on reads and direct > IOs. Through what path? Will all reads and direct IOs travel through > balance_dirty_pages() even direct IOs on raw block devices? Or would > the writeback algorithm take the configuration from cfq, apply the > algorithm and give back the limits to enforce to cfq? If the latter, > isn't that at least somewhat messed up? cfq is working well and don't need any modifications. Let's just make balance_dirty_pages() cgroup aware and fill the gap of the current block IO controller. If the balance_dirty_pages() throttling algorithms will ever be applied to read and direct IOs, it would be for NFS, CIFS etc. Even for them, there may be better throttling choices. For example, Trond mentioned the RPC layer to me during the summit. > > I did the buffered write IO controller mainly to fill the gap. If I > > happen to stand in your way, sorry that's not my initial intention. > > No, no, it's not about standing in my way. As Vivek said in the other > reply, it's that the "gap" that you filled was created *because* > writeback wasn't cgroup aware and now you're in turn filling that gap > by making writeback work around that "gap". I mean, my mind boggles. > Doesn't yours? I strongly believe everyone's should. Heh. It's a hard problem indeed. I felt great pains in the IO-less dirty throttling work. I did a lot reasoning about it, and have in fact kept cgroup IO controller in mind since its early days. Now I'd say it's hands down for it to adapt to the gap between the total IO limit and what's carried out by the block IO controller. > > It's a pity and surprise that Google as a big user does not buy in > > this simple solution. You may prefer more comprehensive controls which > > may not be easily achievable with the simple scheme. However the > > complexities and overheads involved in throttling the flusher IOs > > really upsets me. > > Heh, believe it or not, I'm not really wearing google hat on this > subject and google's writeback people may have completely different > opinions on the subject than mine. In fact, I'm not even sure how > much "work" time I'll be able to assign to this. :( OK, understand. > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > There's where I'm confused. How is the said split supposed to work? > They aren't independent. I mean, who gets to decide what and where > are those decisions enforced? Yeah it's not independent. It's about - keep block IO cgroup untouched (in its current algorithm, for throttling direct IO) - let balance_dirty_pages() adapt to the throttling target buffered_write_limit = total_limit - direct_IOs > > What I'm interested is, what's Google and other users' use schemes in > > practice. What's their desired interfaces. Whether and how the > > combined bdp+blkcg throttling can fulfill the goals. > > I'm not too privy of mm and writeback in google and even if so I > probably shouldn't talk too much about it. Confidentiality and all. > That said, I have the general feeling that goog already figured out > how to at least work around the existing implementation and would be > able to continue no matter how upstream development fans out. > > That said, wearing the cgroup maintainer and general kernel > contributor hat, I'd really like to avoid another design mess up. To me it looks a pretty clean split and find it to be an easy solution (after sorting it out the hard way). I'll show the code and test results after some time. > > > Let's please keep the layering clear. IO limitations will be applied > > > at the block layer and pressure will be formed there and then > > > propagated upwards eventually to the originator. Sure, exposing the > > > whole information might result in better behavior for certain > > > workloads, but down the road, say, in three or five years, devices > > > which can be shared without worrying too much about seeks might be > > > commonplace and we could be swearing at a disgusting structural mess, > > > and sadly various cgroup support seems to be a prominent source of > > > such design failures. > > > > Super fast storages are coming which will make us regret to make the > > IO path over complex. Spinning disks are not going away anytime soon. > > I doubt Google is willing to afford the disk seek costs on its > > millions of disks and has the patience to wait until switching all of > > the spin disks to SSD years later (if it will ever happen). > > This is new. Let's keep the damn employer out of the discussion. > While the area I work on is affected by my employment (writeback isn't > even my area BTW), I'm not gonna do something adverse to upstream even > if it's beneficial to google and I'm much more likely to do something > which may hurt google a bit if it's gonna benefit upstream. > > As for the faster / newer storage argument, that is *exactly* why we > want to keep the layering proper. Writeback works from the pressure > from the IO stack. If IO technology changes, we update the IO stack > and writeback still works from the pressure. It may need to be > adjusted but the principles don't change. To me, balance_dirty_pages() is *the* proper layer for buffered writes. It's always there doing 1:1 proportional throttling. Then you try to kick in to add *double* throttling in block/cfq layer. Now the low layer may enforce 10:1 throttling and push balance_dirty_pages() away from its balanced state, leading to large fluctuations and program stalls. This can be avoided by telling balance_dirty_pages(): "your balance goal is no longer 1:1, but 10:1". With this information balance_dirty_pages() will behave right. Then there is the question: if balance_dirty_pages() will work just well provided the information, why bother doing the throttling at low layer and "push back" the pressure all the way up? > > It's obvious that your below proposal involves a lot of complexities, > > overheads, and will hurt performance. It basically involves > > Hmmm... that's not the impression I got from the discussion. > According to Jan, applying the current writeback logic to cgroup'fied > bdi shouldn't be too complex, no? In the sense of "avoidable" complexity :-) > > - running concurrent flusher threads for cgroups, which adds back the > > disk seeks and lock contentions. And still has problems with sync > > and shared inodes. > > I agree this is an actual concern but if the user wants to split one > spindle to multiple resource domains, there's gonna be considerable > amount of overhead no matter what. If you want to improve how block > layer handles the split, you're welcome to dive into the block layer, > where the split is made, and improve it. > > > - splitting device queue for cgroups, possibly scaling up the pool of > > writeback pages (and locked pages in the case of stable pages) which > > could stall random processes in the system > > Sure, it'll take up more buffering and memory but that's the overhead > of the cgroup business. I want it to be less intrusive at the cost of > somewhat more resource consumption. ie. I don't want writeback logic > itself deeply involved in block IO cgroup enforcement even if that > means somewhat less efficient resource usage. The balance_dirty_pages() is already deeply involved in dirty throttling. As you can see from this patchset, the same algorithms can be extended trivially to work with cgroup IO limits. buffered write IO controller in balance_dirty_pages() https://lkml.org/lkml/2012/3/28/275 It does not require forking off the flusher threads and splitting up the IO queue at all. > > - the mess of metadata handling > > Does throttling from writeback actually solve this problem? What > about fsync()? Does that already go through balance_dirty_pages()? balance_dirty_pages() does throttling at safe points outside of fs transactions/locks. fsync() only submits IO for already dirtied pages and won't be throttled by balance_dirty_pages(). Throttling happens at earlier times when the task is dirtying the pages. > > - unnecessarily coupled with memcg, in order to take advantage of the > > per-memcg dirty limits for balance_dirty_pages() to actually convert > > the "pushed back" dirty pages pressure into lowered dirty rate. Why > > the hell the users *have to* setup memcg (suffering from all the > > inconvenience and overheads) in order to do IO throttling? Please, > > this is really ugly! And the "back pressure" may constantly push the > > memcg dirty pages to the limits. I'm not going to support *miss use* > > of per-memcg dirty limits like this! > > Writeback sits between blkcg and memcg and it indeed can be hairy to > consider both sides especially given the current sorry complex state > of cgroup and I can see why it would seem tempting to add a separate > controller or at least knobs to support that. That said, I *think* > given that memcg controls all other memory parameters it probably > would make most sense giving that parameter to memcg too. I don't > think this is really relevant to this discussion tho. Who owns > dirty_limits is a separate issue. In the "back pressure" scheme, memcg is a must because only it has all the infrastructure to track dirty pages upon which you can apply some dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > I cannot believe you would keep overlooking all the problems without > > good reasons. Please do tell us the reasons that matter. > > Well, I tried and I hope some of it got through. I also wrote a lot > of questions, mainly regarding how what you have in mind is supposed > to work through what path. Maybe I'm just not seeing what you're > seeing but I just can't see where all the IOs would go through and > come together. Can you please elaborate more on that? What I can see is, it looks pretty simple and nature to let balance_dirty_pages() fill the gap towards a total solution :-) - add direct IO accounting in some convenient point of the IO path IO submission or completion point, either is fine. - change several lines of the buffered write IO controller to integrate the direct IO rate into the formula to fit the "total IO" limit - in future, add more accounting as well as feedback control to make balance_dirty_pages() work with IOPS and disk time Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-06 9:59 ` Fengguang Wu (?) (?) @ 2012-04-17 22:38 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA Hello, Fengguang. On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > Fortunately, the above gap can be easily filled judging from the > block/cfq IO controller code. By adding some direct IO accounting > and changing several lines of my patches to make use of the collected > stats, the semantics of the blkio.throttle.write_bps interfaces can be > changed from "limit for direct IO" to "limit for direct+buffered IOs". > Ditto for blkio.weight and blkio.write_iops, as long as some > iops/device time stats are made available to balance_dirty_pages(). > > It would be a fairly *easy* change. :-) It's merely adding some > accounting code and there is no need to change the block IO > controlling algorithm at all. I'll do the work of accounting (which > is basically independent of the IO controlling) and use the new stats > in balance_dirty_pages(). I don't really understand how this can work. For hard limits, maybe, but for proportional IO, you have to know which cgroups have IOs before assigning the proportions, so blkcg assigning IO bandwidth without knowing async writes simply can't work. For example, let's say cgroups A and B have 2:8 split. If A has IOs on queue and B doesn't, blkcg will assign all IO bandwidth to A. I can't wrap my head around how writeback is gonna make use of the resulting stats but let's say it decides it needs to put out some IOs out for both cgroups. What happens then? Do all the async writes go through the root cgroup controlled by and affecting the ratio between rootcg and cgroup A and B? Or do they have to be accounted as part of cgroups A and B? If so, what if the added bandwidth goes over the limit? Let's say if we implement overcharge; then, I suppose we'll have to communicate that upwards too, right? This is still easy. What about hierarchical propio? What happens then? You can't do hierarchical proportional allocation without knowing how much IOs are pending for which group. How is that information gonna be communicated between blkcg and writeback? Are we gonna have two separate hierarchical proportional IO allocators? How is that gonna work at all? If we're gonna have single allocator in block layer, writeback would have to feed the amount of IOs it may generate into the allocator, get the resulting allocation and then issue IO and then block layer again will have to account these to the originating cgroups. It's just crazy. > The only problem I can see now, is that balance_dirty_pages() works > per-bdi and blkcg works per-device. So the two ends may not match > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > sdb is shared by lv0 and lv1. However it should be rare situations and > be much more acceptable than the problems arise from the "push back" > approach which impacts everyone. I don't know. What problems? AFAICS, the biggest issue is writeback of different inodes getting mixed resulting in poor performance, but if you think about it, that's about the frequency of switching cgroups and a problem which can and should be dealt with from block layer (e.g. use larger time slice if all the pending IOs are async). Writeback's duty is generating stream of async writes which can be served efficiently for the *cgroup* and keeping the buffer filled as necessary and chaining the backpressure from there to the actual dirtier. That's what writeback does without cgroup. Nothing fundamental changes with cgroup. It's just finer grained. > > No, no, it's not about standing in my way. As Vivek said in the other > > reply, it's that the "gap" that you filled was created *because* > > writeback wasn't cgroup aware and now you're in turn filling that gap > > by making writeback work around that "gap". I mean, my mind boggles. > > Doesn't yours? I strongly believe everyone's should. > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > dirty throttling work. I did a lot reasoning about it, and have in > fact kept cgroup IO controller in mind since its early days. Now I'd > say it's hands down for it to adapt to the gap between the total IO > limit and what's carried out by the block IO controller. You're not providing any valid counter arguments about the issues being raised about the messed up design. How is anything "hands down" here? > > There's where I'm confused. How is the said split supposed to work? > > They aren't independent. I mean, who gets to decide what and where > > are those decisions enforced? > > Yeah it's not independent. It's about > > - keep block IO cgroup untouched (in its current algorithm, for > throttling direct IO) > > - let balance_dirty_pages() adapt to the throttling target > > buffered_write_limit = total_limit - direct_IOs Think about proportional allocation. You don't have a number until you know who have pending IOs and how much. > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > It's always there doing 1:1 proportional throttling. Then you try to > kick in to add *double* throttling in block/cfq layer. Now the low > layer may enforce 10:1 throttling and push balance_dirty_pages() away > from its balanced state, leading to large fluctuations and program > stalls. Just do the same 1:1 inside each cgroup. > This can be avoided by telling balance_dirty_pages(): "your > balance goal is no longer 1:1, but 10:1". With this information > balance_dirty_pages() will behave right. Then there is the question: > if balance_dirty_pages() will work just well provided the information, > why bother doing the throttling at low layer and "push back" the > pressure all the way up? Because splitting a resource into two pieces arbitrarily with different amount of consumptions on each side and then applying the same proportion on both doesn't mean anything? > The balance_dirty_pages() is already deeply involved in dirty throttling. > As you can see from this patchset, the same algorithms can be extended > trivially to work with cgroup IO limits. > > buffered write IO controller in balance_dirty_pages() > https://lkml.org/lkml/2012/3/28/275 It is half broken thing with fundamental design flaws which can't be corrected without complete reimplementation. I don't know what to say. > In the "back pressure" scheme, memcg is a must because only it has all > the infrastructure to track dirty pages upon which you can apply some > dirty_limits. Don't tell me you want to account dirty pages in blkcg... For now, per-inode tracking seems good enough. > What I can see is, it looks pretty simple and nature to let > balance_dirty_pages() fill the gap towards a total solution :-) > > - add direct IO accounting in some convenient point of the IO path > IO submission or completion point, either is fine. > > - change several lines of the buffered write IO controller to > integrate the direct IO rate into the formula to fit the "total > IO" limit > > - in future, add more accounting as well as feedback control to make > balance_dirty_pages() work with IOPS and disk time To me, you seem to be not addressing the issues I've been raising at all and just repeating the same points again and again. If I'm misunderstanding something, please point out. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-06 9:59 ` Fengguang Wu (?) @ 2012-04-17 22:38 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Fengguang. On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > Fortunately, the above gap can be easily filled judging from the > block/cfq IO controller code. By adding some direct IO accounting > and changing several lines of my patches to make use of the collected > stats, the semantics of the blkio.throttle.write_bps interfaces can be > changed from "limit for direct IO" to "limit for direct+buffered IOs". > Ditto for blkio.weight and blkio.write_iops, as long as some > iops/device time stats are made available to balance_dirty_pages(). > > It would be a fairly *easy* change. :-) It's merely adding some > accounting code and there is no need to change the block IO > controlling algorithm at all. I'll do the work of accounting (which > is basically independent of the IO controlling) and use the new stats > in balance_dirty_pages(). I don't really understand how this can work. For hard limits, maybe, but for proportional IO, you have to know which cgroups have IOs before assigning the proportions, so blkcg assigning IO bandwidth without knowing async writes simply can't work. For example, let's say cgroups A and B have 2:8 split. If A has IOs on queue and B doesn't, blkcg will assign all IO bandwidth to A. I can't wrap my head around how writeback is gonna make use of the resulting stats but let's say it decides it needs to put out some IOs out for both cgroups. What happens then? Do all the async writes go through the root cgroup controlled by and affecting the ratio between rootcg and cgroup A and B? Or do they have to be accounted as part of cgroups A and B? If so, what if the added bandwidth goes over the limit? Let's say if we implement overcharge; then, I suppose we'll have to communicate that upwards too, right? This is still easy. What about hierarchical propio? What happens then? You can't do hierarchical proportional allocation without knowing how much IOs are pending for which group. How is that information gonna be communicated between blkcg and writeback? Are we gonna have two separate hierarchical proportional IO allocators? How is that gonna work at all? If we're gonna have single allocator in block layer, writeback would have to feed the amount of IOs it may generate into the allocator, get the resulting allocation and then issue IO and then block layer again will have to account these to the originating cgroups. It's just crazy. > The only problem I can see now, is that balance_dirty_pages() works > per-bdi and blkcg works per-device. So the two ends may not match > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > sdb is shared by lv0 and lv1. However it should be rare situations and > be much more acceptable than the problems arise from the "push back" > approach which impacts everyone. I don't know. What problems? AFAICS, the biggest issue is writeback of different inodes getting mixed resulting in poor performance, but if you think about it, that's about the frequency of switching cgroups and a problem which can and should be dealt with from block layer (e.g. use larger time slice if all the pending IOs are async). Writeback's duty is generating stream of async writes which can be served efficiently for the *cgroup* and keeping the buffer filled as necessary and chaining the backpressure from there to the actual dirtier. That's what writeback does without cgroup. Nothing fundamental changes with cgroup. It's just finer grained. > > No, no, it's not about standing in my way. As Vivek said in the other > > reply, it's that the "gap" that you filled was created *because* > > writeback wasn't cgroup aware and now you're in turn filling that gap > > by making writeback work around that "gap". I mean, my mind boggles. > > Doesn't yours? I strongly believe everyone's should. > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > dirty throttling work. I did a lot reasoning about it, and have in > fact kept cgroup IO controller in mind since its early days. Now I'd > say it's hands down for it to adapt to the gap between the total IO > limit and what's carried out by the block IO controller. You're not providing any valid counter arguments about the issues being raised about the messed up design. How is anything "hands down" here? > > There's where I'm confused. How is the said split supposed to work? > > They aren't independent. I mean, who gets to decide what and where > > are those decisions enforced? > > Yeah it's not independent. It's about > > - keep block IO cgroup untouched (in its current algorithm, for > throttling direct IO) > > - let balance_dirty_pages() adapt to the throttling target > > buffered_write_limit = total_limit - direct_IOs Think about proportional allocation. You don't have a number until you know who have pending IOs and how much. > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > It's always there doing 1:1 proportional throttling. Then you try to > kick in to add *double* throttling in block/cfq layer. Now the low > layer may enforce 10:1 throttling and push balance_dirty_pages() away > from its balanced state, leading to large fluctuations and program > stalls. Just do the same 1:1 inside each cgroup. > This can be avoided by telling balance_dirty_pages(): "your > balance goal is no longer 1:1, but 10:1". With this information > balance_dirty_pages() will behave right. Then there is the question: > if balance_dirty_pages() will work just well provided the information, > why bother doing the throttling at low layer and "push back" the > pressure all the way up? Because splitting a resource into two pieces arbitrarily with different amount of consumptions on each side and then applying the same proportion on both doesn't mean anything? > The balance_dirty_pages() is already deeply involved in dirty throttling. > As you can see from this patchset, the same algorithms can be extended > trivially to work with cgroup IO limits. > > buffered write IO controller in balance_dirty_pages() > https://lkml.org/lkml/2012/3/28/275 It is half broken thing with fundamental design flaws which can't be corrected without complete reimplementation. I don't know what to say. > In the "back pressure" scheme, memcg is a must because only it has all > the infrastructure to track dirty pages upon which you can apply some > dirty_limits. Don't tell me you want to account dirty pages in blkcg... For now, per-inode tracking seems good enough. > What I can see is, it looks pretty simple and nature to let > balance_dirty_pages() fill the gap towards a total solution :-) > > - add direct IO accounting in some convenient point of the IO path > IO submission or completion point, either is fine. > > - change several lines of the buffered write IO controller to > integrate the direct IO rate into the formula to fit the "total > IO" limit > > - in future, add more accounting as well as feedback control to make > balance_dirty_pages() work with IOPS and disk time To me, you seem to be not addressing the issues I've been raising at all and just repeating the same points again and again. If I'm misunderstanding something, please point out. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-17 22:38 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Fengguang. On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > Fortunately, the above gap can be easily filled judging from the > block/cfq IO controller code. By adding some direct IO accounting > and changing several lines of my patches to make use of the collected > stats, the semantics of the blkio.throttle.write_bps interfaces can be > changed from "limit for direct IO" to "limit for direct+buffered IOs". > Ditto for blkio.weight and blkio.write_iops, as long as some > iops/device time stats are made available to balance_dirty_pages(). > > It would be a fairly *easy* change. :-) It's merely adding some > accounting code and there is no need to change the block IO > controlling algorithm at all. I'll do the work of accounting (which > is basically independent of the IO controlling) and use the new stats > in balance_dirty_pages(). I don't really understand how this can work. For hard limits, maybe, but for proportional IO, you have to know which cgroups have IOs before assigning the proportions, so blkcg assigning IO bandwidth without knowing async writes simply can't work. For example, let's say cgroups A and B have 2:8 split. If A has IOs on queue and B doesn't, blkcg will assign all IO bandwidth to A. I can't wrap my head around how writeback is gonna make use of the resulting stats but let's say it decides it needs to put out some IOs out for both cgroups. What happens then? Do all the async writes go through the root cgroup controlled by and affecting the ratio between rootcg and cgroup A and B? Or do they have to be accounted as part of cgroups A and B? If so, what if the added bandwidth goes over the limit? Let's say if we implement overcharge; then, I suppose we'll have to communicate that upwards too, right? This is still easy. What about hierarchical propio? What happens then? You can't do hierarchical proportional allocation without knowing how much IOs are pending for which group. How is that information gonna be communicated between blkcg and writeback? Are we gonna have two separate hierarchical proportional IO allocators? How is that gonna work at all? If we're gonna have single allocator in block layer, writeback would have to feed the amount of IOs it may generate into the allocator, get the resulting allocation and then issue IO and then block layer again will have to account these to the originating cgroups. It's just crazy. > The only problem I can see now, is that balance_dirty_pages() works > per-bdi and blkcg works per-device. So the two ends may not match > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > sdb is shared by lv0 and lv1. However it should be rare situations and > be much more acceptable than the problems arise from the "push back" > approach which impacts everyone. I don't know. What problems? AFAICS, the biggest issue is writeback of different inodes getting mixed resulting in poor performance, but if you think about it, that's about the frequency of switching cgroups and a problem which can and should be dealt with from block layer (e.g. use larger time slice if all the pending IOs are async). Writeback's duty is generating stream of async writes which can be served efficiently for the *cgroup* and keeping the buffer filled as necessary and chaining the backpressure from there to the actual dirtier. That's what writeback does without cgroup. Nothing fundamental changes with cgroup. It's just finer grained. > > No, no, it's not about standing in my way. As Vivek said in the other > > reply, it's that the "gap" that you filled was created *because* > > writeback wasn't cgroup aware and now you're in turn filling that gap > > by making writeback work around that "gap". I mean, my mind boggles. > > Doesn't yours? I strongly believe everyone's should. > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > dirty throttling work. I did a lot reasoning about it, and have in > fact kept cgroup IO controller in mind since its early days. Now I'd > say it's hands down for it to adapt to the gap between the total IO > limit and what's carried out by the block IO controller. You're not providing any valid counter arguments about the issues being raised about the messed up design. How is anything "hands down" here? > > There's where I'm confused. How is the said split supposed to work? > > They aren't independent. I mean, who gets to decide what and where > > are those decisions enforced? > > Yeah it's not independent. It's about > > - keep block IO cgroup untouched (in its current algorithm, for > throttling direct IO) > > - let balance_dirty_pages() adapt to the throttling target > > buffered_write_limit = total_limit - direct_IOs Think about proportional allocation. You don't have a number until you know who have pending IOs and how much. > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > It's always there doing 1:1 proportional throttling. Then you try to > kick in to add *double* throttling in block/cfq layer. Now the low > layer may enforce 10:1 throttling and push balance_dirty_pages() away > from its balanced state, leading to large fluctuations and program > stalls. Just do the same 1:1 inside each cgroup. > This can be avoided by telling balance_dirty_pages(): "your > balance goal is no longer 1:1, but 10:1". With this information > balance_dirty_pages() will behave right. Then there is the question: > if balance_dirty_pages() will work just well provided the information, > why bother doing the throttling at low layer and "push back" the > pressure all the way up? Because splitting a resource into two pieces arbitrarily with different amount of consumptions on each side and then applying the same proportion on both doesn't mean anything? > The balance_dirty_pages() is already deeply involved in dirty throttling. > As you can see from this patchset, the same algorithms can be extended > trivially to work with cgroup IO limits. > > buffered write IO controller in balance_dirty_pages() > https://lkml.org/lkml/2012/3/28/275 It is half broken thing with fundamental design flaws which can't be corrected without complete reimplementation. I don't know what to say. > In the "back pressure" scheme, memcg is a must because only it has all > the infrastructure to track dirty pages upon which you can apply some > dirty_limits. Don't tell me you want to account dirty pages in blkcg... For now, per-inode tracking seems good enough. > What I can see is, it looks pretty simple and nature to let > balance_dirty_pages() fill the gap towards a total solution :-) > > - add direct IO accounting in some convenient point of the IO path > IO submission or completion point, either is fine. > > - change several lines of the buffered write IO controller to > integrate the direct IO rate into the formula to fit the "total > IO" limit > > - in future, add more accounting as well as feedback control to make > balance_dirty_pages() work with IOPS and disk time To me, you seem to be not addressing the issues I've been raising at all and just repeating the same points again and again. If I'm misunderstanding something, please point out. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-17 22:38 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, Fengguang. On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > Fortunately, the above gap can be easily filled judging from the > block/cfq IO controller code. By adding some direct IO accounting > and changing several lines of my patches to make use of the collected > stats, the semantics of the blkio.throttle.write_bps interfaces can be > changed from "limit for direct IO" to "limit for direct+buffered IOs". > Ditto for blkio.weight and blkio.write_iops, as long as some > iops/device time stats are made available to balance_dirty_pages(). > > It would be a fairly *easy* change. :-) It's merely adding some > accounting code and there is no need to change the block IO > controlling algorithm at all. I'll do the work of accounting (which > is basically independent of the IO controlling) and use the new stats > in balance_dirty_pages(). I don't really understand how this can work. For hard limits, maybe, but for proportional IO, you have to know which cgroups have IOs before assigning the proportions, so blkcg assigning IO bandwidth without knowing async writes simply can't work. For example, let's say cgroups A and B have 2:8 split. If A has IOs on queue and B doesn't, blkcg will assign all IO bandwidth to A. I can't wrap my head around how writeback is gonna make use of the resulting stats but let's say it decides it needs to put out some IOs out for both cgroups. What happens then? Do all the async writes go through the root cgroup controlled by and affecting the ratio between rootcg and cgroup A and B? Or do they have to be accounted as part of cgroups A and B? If so, what if the added bandwidth goes over the limit? Let's say if we implement overcharge; then, I suppose we'll have to communicate that upwards too, right? This is still easy. What about hierarchical propio? What happens then? You can't do hierarchical proportional allocation without knowing how much IOs are pending for which group. How is that information gonna be communicated between blkcg and writeback? Are we gonna have two separate hierarchical proportional IO allocators? How is that gonna work at all? If we're gonna have single allocator in block layer, writeback would have to feed the amount of IOs it may generate into the allocator, get the resulting allocation and then issue IO and then block layer again will have to account these to the originating cgroups. It's just crazy. > The only problem I can see now, is that balance_dirty_pages() works > per-bdi and blkcg works per-device. So the two ends may not match > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > sdb is shared by lv0 and lv1. However it should be rare situations and > be much more acceptable than the problems arise from the "push back" > approach which impacts everyone. I don't know. What problems? AFAICS, the biggest issue is writeback of different inodes getting mixed resulting in poor performance, but if you think about it, that's about the frequency of switching cgroups and a problem which can and should be dealt with from block layer (e.g. use larger time slice if all the pending IOs are async). Writeback's duty is generating stream of async writes which can be served efficiently for the *cgroup* and keeping the buffer filled as necessary and chaining the backpressure from there to the actual dirtier. That's what writeback does without cgroup. Nothing fundamental changes with cgroup. It's just finer grained. > > No, no, it's not about standing in my way. As Vivek said in the other > > reply, it's that the "gap" that you filled was created *because* > > writeback wasn't cgroup aware and now you're in turn filling that gap > > by making writeback work around that "gap". I mean, my mind boggles. > > Doesn't yours? I strongly believe everyone's should. > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > dirty throttling work. I did a lot reasoning about it, and have in > fact kept cgroup IO controller in mind since its early days. Now I'd > say it's hands down for it to adapt to the gap between the total IO > limit and what's carried out by the block IO controller. You're not providing any valid counter arguments about the issues being raised about the messed up design. How is anything "hands down" here? > > There's where I'm confused. How is the said split supposed to work? > > They aren't independent. I mean, who gets to decide what and where > > are those decisions enforced? > > Yeah it's not independent. It's about > > - keep block IO cgroup untouched (in its current algorithm, for > throttling direct IO) > > - let balance_dirty_pages() adapt to the throttling target > > buffered_write_limit = total_limit - direct_IOs Think about proportional allocation. You don't have a number until you know who have pending IOs and how much. > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > It's always there doing 1:1 proportional throttling. Then you try to > kick in to add *double* throttling in block/cfq layer. Now the low > layer may enforce 10:1 throttling and push balance_dirty_pages() away > from its balanced state, leading to large fluctuations and program > stalls. Just do the same 1:1 inside each cgroup. > This can be avoided by telling balance_dirty_pages(): "your > balance goal is no longer 1:1, but 10:1". With this information > balance_dirty_pages() will behave right. Then there is the question: > if balance_dirty_pages() will work just well provided the information, > why bother doing the throttling at low layer and "push back" the > pressure all the way up? Because splitting a resource into two pieces arbitrarily with different amount of consumptions on each side and then applying the same proportion on both doesn't mean anything? > The balance_dirty_pages() is already deeply involved in dirty throttling. > As you can see from this patchset, the same algorithms can be extended > trivially to work with cgroup IO limits. > > buffered write IO controller in balance_dirty_pages() > https://lkml.org/lkml/2012/3/28/275 It is half broken thing with fundamental design flaws which can't be corrected without complete reimplementation. I don't know what to say. > In the "back pressure" scheme, memcg is a must because only it has all > the infrastructure to track dirty pages upon which you can apply some > dirty_limits. Don't tell me you want to account dirty pages in blkcg... For now, per-inode tracking seems good enough. > What I can see is, it looks pretty simple and nature to let > balance_dirty_pages() fill the gap towards a total solution :-) > > - add direct IO accounting in some convenient point of the IO path > IO submission or completion point, either is fine. > > - change several lines of the buffered write IO controller to > integrate the direct IO rate into the formula to fit the "total > IO" limit > > - in future, add more accounting as well as feedback control to make > balance_dirty_pages() work with IOPS and disk time To me, you seem to be not addressing the issues I've been raising at all and just repeating the same points again and again. If I'm misunderstanding something, please point out. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-19 14:23 ` Fengguang Wu @ 2012-04-19 14:23 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Tejun, On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > > Fortunately, the above gap can be easily filled judging from the > > block/cfq IO controller code. By adding some direct IO accounting > > and changing several lines of my patches to make use of the collected > > stats, the semantics of the blkio.throttle.write_bps interfaces can be > > changed from "limit for direct IO" to "limit for direct+buffered IOs". > > Ditto for blkio.weight and blkio.write_iops, as long as some > > iops/device time stats are made available to balance_dirty_pages(). > > > > It would be a fairly *easy* change. :-) It's merely adding some > > accounting code and there is no need to change the block IO > > controlling algorithm at all. I'll do the work of accounting (which > > is basically independent of the IO controlling) and use the new stats > > in balance_dirty_pages(). > > I don't really understand how this can work. For hard limits, maybe, Yeah, hard limits are the easiest. > but for proportional IO, you have to know which cgroups have IOs > before assigning the proportions, so blkcg assigning IO bandwidth > without knowing async writes simply can't work. > > For example, let's say cgroups A and B have 2:8 split. If A has IOs > on queue and B doesn't, blkcg will assign all IO bandwidth to A. I > can't wrap my head around how writeback is gonna make use of the > resulting stats but let's say it decides it needs to put out some IOs > out for both cgroups. What happens then? Do all the async writes go > through the root cgroup controlled by and affecting the ratio between > rootcg and cgroup A and B? Or do they have to be accounted as part of > cgroups A and B? If so, what if the added bandwidth goes over the > limit? Let's say if we implement overcharge; then, I suppose we'll > have to communicate that upwards too, right? The trick is to do the throttling for buffered writes at page dirty time, when balance_dirty_pages() knows exactly what cgroup the dirtier task belongs to, the dirty rate and whether or not it's an aggressive dirtier. The cgroup's direct IO rate can also be measured. The only missing information is whether it's a non-aggressive direct writer (only cfq may know about that). Now I'm simply assuming direct writers are all aggressive. So if A and B have 2:8 split and A only submits async IO and B only submits direct IO, there will be no cfqg exist for A at all. cfq will be serving B and root cgroup interleavely. In the patch I just posted, blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root cgroup for use by the flusher. In the end the flusher gets weight 2 and B gets weight 8. Here we need to distinguish the weight assigned by user and the weight after the async/sync adjustment. The other missing information is the real cost when the dirtied pages eventually hit the disk after perhaps dozens of seconds. For that part I'm assuming simple dd at this time and balance_dirty_pages() is now splitting out the flusher's overall writeout progress to the dirtier tasks' dirty ratelimit based on bandwidth fairness. > This is still easy. What about hierarchical propio? What happens > then? You can't do hierarchical proportional allocation without > knowing how much IOs are pending for which group. How is that > information gonna be communicated between blkcg and writeback? Are we > gonna have two separate hierarchical proportional IO allocators? How > is that gonna work at all? If we're gonna have single allocator in > block layer, writeback would have to feed the amount of IOs it may > generate into the allocator, get the resulting allocation and then > issue IO and then block layer again will have to account these to the > originating cgroups. It's just crazy. No I have not got the idea on how to do the hierarchical proportional IO controller without physically splitting up the async IO streams. It's pretty hard and I'd better break out before it drives me crazy. So in the following discussion, let's assume cfq will move async requests from the current root cgroup to individual IO issuer's cfqgs and schedule service for the async streams there. And thus the need to create "backpressure" for balance_dirty_pages() to eventually throttle the individual dirtier tasks. That said, I still don't think we've come up with any satisfactory solutions. It's hard problem after all. > > The only problem I can see now, is that balance_dirty_pages() works > > per-bdi and blkcg works per-device. So the two ends may not match > > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > > sdb is shared by lv0 and lv1. However it should be rare situations and > > be much more acceptable than the problems arise from the "push back" > > approach which impacts everyone. > > I don't know. What problems? AFAICS, the biggest issue is writeback > of different inodes getting mixed resulting in poor performance, but > if you think about it, that's about the frequency of switching cgroups > and a problem which can and should be dealt with from block layer > (e.g. use larger time slice if all the pending IOs are async). Yeah increasing time slice would help that case. In general it's not merely the frequency of switching cgroup if take hard disk' writeback cache into account. Think about some inodes with async IO: A1, A2, A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different cgroups. So when the root cgroup holds all async inodes, the cfq may schedule IO interleavely like this A1, A1, A1, A2, A1, A2, ... D1, D2, D3, D4, D5, D6, ... Now it becomes A1, A2, A3, A4, A5, A6, ... D1, D2, D3, D4, D5, D6, ... The difference is that it's now switching the async inodes each time. At cfq level, the seek costs look the same, however the disk's writeback cache may help merge the data chunks from the same inode A1. Well, it may cost some latency for spin disks. But how about SSD? It can run deeper queue and benefit from large writes. > Writeback's duty is generating stream of async writes which can be > served efficiently for the *cgroup* and keeping the buffer filled as > necessary and chaining the backpressure from there to the actual > dirtier. That's what writeback does without cgroup. Nothing > fundamental changes with cgroup. It's just finer grained. Believe me, physically partitioning the dirty pages and async IO streams comes at big costs. It won't scale well in many ways. For one instance, splitting the request queues will give rise to PG_writeback pages. Those pages have been the biggest source of latency issues in the various parts of the system. It's not uncommon for me to see filesystems sleep on PG_writeback pages during heavy writeback, within some lock or transaction, which in turn stall many tasks that try to do IO or merely dirty some page in memory. Random writes are especially susceptible to such stalls. The stable page feature also vastly increase the chances of stalls by locking the writeback pages. Page reclaim may also block on PG_writeback and/or PG_dirty pages. In the case of direct reclaim, it means blocking random tasks that are allocating memory in the system. PG_writeback pages are much worse than PG_dirty pages in that they are not movable. This makes a big difference for high-order page allocations. To make room for a 2MB huge page, vmscan has the option to migrate PG_dirty pages, but for PG_writeback it has no better choices than to wait for IO completion. The difficulty of THP allocation goes up *exponentially* with the number of PG_writeback pages. Assume PG_writeback pages are randomly distributed in the physical memory space. Then we have formula P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 That's the possibly for a contiguous range of 256 pages to be free of PG_writeback, so that it's immediately reclaimable for use by transparent huge page. This ruby script shows us the concrete numbers. irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 } P(hit PG_writeback) P(reclaimable for THP) 0.001 0.599 0.002 0.359 0.003 0.215 0.004 0.128 0.005 0.077 0.006 0.046 0.007 0.027 0.008 0.016 0.009 0.010 0.010 0.006 The numbers show that when the PG_writeback pages go up from 0.1% to 1% of system memory, the THP reclaim success ratio drops quickly from 60% to 0.6%. It indicates that in order to use THP without constantly running into stalls, the reasonable PG_writeback ratio is <= 0.1%. Going beyond that threshold, it quickly becomes intolerable. That makes a limit of 256MB writeback pages for a mem=256GB system. Looking at the real vmstat:nr_writeback numbers in dd write tests: JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009 JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335 JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026 JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099 JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058 JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335 Oops btrfs has 4GB writeback pages -- which asks for some bug fixing. Even ext4's 800MB still looks way too high, but that's ~1s worth of data per queue (or 130ms worth of data for the high performance Intel SSD, which is perhaps in danger of queue underruns?). So this system would require 512GB memory to comfortably run KVM instances with THP support. Judging from the above numbers, we can hardly afford to split up the IO queues and proliferate writeback pages. It's worth to note that running multiple flusher threads per bdi means not only disk seeks for spin disks, smaller IO size for SSD, but also lock contentions and cache bouncing for metadata heavy workloads and fast storage. To give some concrete examples on how much CPU overheads can be saved by reducing multiple IO submitters, here are some summaries for the IO-less dirty throttling gains. Tests show that it yields huge benefits for reducing IO seeks as well as CPU overheads. For example, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". (by Dave Chinner) - "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the _global_ page states) - the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path - "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. And for simple dd tests - "throughput for a _single_ large dd (100GB) increase from ~650MB/s to 700MB/s" - "On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s." > > > No, no, it's not about standing in my way. As Vivek said in the other > > > reply, it's that the "gap" that you filled was created *because* > > > writeback wasn't cgroup aware and now you're in turn filling that gap > > > by making writeback work around that "gap". I mean, my mind boggles. > > > Doesn't yours? I strongly believe everyone's should. > > > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > > dirty throttling work. I did a lot reasoning about it, and have in > > fact kept cgroup IO controller in mind since its early days. Now I'd > > say it's hands down for it to adapt to the gap between the total IO > > limit and what's carried out by the block IO controller. > > You're not providing any valid counter arguments about the issues > being raised about the messed up design. How is anything "hands down" > here? Yeah sadly, it turns out to be not "hands down" when it comes to the proportional async/sync splits, and it's even prohibiting when comes to the hierarchical support.. > > > There's where I'm confused. How is the said split supposed to work? > > > They aren't independent. I mean, who gets to decide what and where > > > are those decisions enforced? > > > > Yeah it's not independent. It's about > > > > - keep block IO cgroup untouched (in its current algorithm, for > > throttling direct IO) > > > > - let balance_dirty_pages() adapt to the throttling target > > > > buffered_write_limit = total_limit - direct_IOs > > Think about proportional allocation. You don't have a number until > you know who have pending IOs and how much. We have the IO rate. The above formula is actually working on "rates". That's good enough for calculating the ratelimit for buffered writes. We don't have to know every transient states of the pending IOs. Because the direct IOs are handled by cfq based on cfqg weight and for async IOs, there are plenty of dirty pages for buffering/tolerating small errors in the dirty rate control. > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. > > Just do the same 1:1 inside each cgroup. Sure. But the ratio mismatch I'm talking about is inter-cgroup. For example there are only 2 dd tasks doing buffered writes in the system. Now consider the mismatch that cfq is dispatching their IO requests at 10:1 weights, while balance_dirty_pages() is throttling the dd tasks at 1:1 equal split because it's not aware of the cgroup weights. What will happen in the end? The 1:1 ratio imposed by balance_dirty_pages() will take effect and the dd tasks will progress at the same pace. The cfq weights will be defeated because the async queue for the second dd (and cgroup) constantly runs empty. > > This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > > Because splitting a resource into two pieces arbitrarily with > different amount of consumptions on each side and then applying the > same proportion on both doesn't mean anything? Sorry, I don't quite catch your words here. > > The balance_dirty_pages() is already deeply involved in dirty throttling. > > As you can see from this patchset, the same algorithms can be extended > > trivially to work with cgroup IO limits. > > > > buffered write IO controller in balance_dirty_pages() > > https://lkml.org/lkml/2012/3/28/275 > > It is half broken thing with fundamental design flaws which can't be > corrected without complete reimplementation. I don't know what to > say. I'm fully aware of that, and so have been exploring new ways out :) > > In the "back pressure" scheme, memcg is a must because only it has all > > the infrastructure to track dirty pages upon which you can apply some > > dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > For now, per-inode tracking seems good enough. There are actually two directions of information passing. 1) pass the dirtier ownership down to bio. For this part, it's mostly enough to do the light weight per-inode tracking. 2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO submit) as well as to balance_dirty_pages() (to actually throttle the dirty tasks). The flusher naturally works on inode granularities. However balance_dirty_pages() is about limiting dirty pages. For this part, it needs to know the total number of dirty pages and writeout bandwidth for each cgroup in order to do proper dirty throttling. And to maintain proper number of dirty pages to avoid the queue underrun issue explained in the above 2-dd example. > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > > To me, you seem to be not addressing the issues I've been raising at > all and just repeating the same points again and again. If I'm > misunderstanding something, please point out. Hopefully the renewed patch can dismiss some of your questions. It's a pity that I didn't thought about the hierarchical requirement at the time. Otherwise the complexity of calculations still looks manageable. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-19 14:23 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Tejun, On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > > Fortunately, the above gap can be easily filled judging from the > > block/cfq IO controller code. By adding some direct IO accounting > > and changing several lines of my patches to make use of the collected > > stats, the semantics of the blkio.throttle.write_bps interfaces can be > > changed from "limit for direct IO" to "limit for direct+buffered IOs". > > Ditto for blkio.weight and blkio.write_iops, as long as some > > iops/device time stats are made available to balance_dirty_pages(). > > > > It would be a fairly *easy* change. :-) It's merely adding some > > accounting code and there is no need to change the block IO > > controlling algorithm at all. I'll do the work of accounting (which > > is basically independent of the IO controlling) and use the new stats > > in balance_dirty_pages(). > > I don't really understand how this can work. For hard limits, maybe, Yeah, hard limits are the easiest. > but for proportional IO, you have to know which cgroups have IOs > before assigning the proportions, so blkcg assigning IO bandwidth > without knowing async writes simply can't work. > > For example, let's say cgroups A and B have 2:8 split. If A has IOs > on queue and B doesn't, blkcg will assign all IO bandwidth to A. I > can't wrap my head around how writeback is gonna make use of the > resulting stats but let's say it decides it needs to put out some IOs > out for both cgroups. What happens then? Do all the async writes go > through the root cgroup controlled by and affecting the ratio between > rootcg and cgroup A and B? Or do they have to be accounted as part of > cgroups A and B? If so, what if the added bandwidth goes over the > limit? Let's say if we implement overcharge; then, I suppose we'll > have to communicate that upwards too, right? The trick is to do the throttling for buffered writes at page dirty time, when balance_dirty_pages() knows exactly what cgroup the dirtier task belongs to, the dirty rate and whether or not it's an aggressive dirtier. The cgroup's direct IO rate can also be measured. The only missing information is whether it's a non-aggressive direct writer (only cfq may know about that). Now I'm simply assuming direct writers are all aggressive. So if A and B have 2:8 split and A only submits async IO and B only submits direct IO, there will be no cfqg exist for A at all. cfq will be serving B and root cgroup interleavely. In the patch I just posted, blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root cgroup for use by the flusher. In the end the flusher gets weight 2 and B gets weight 8. Here we need to distinguish the weight assigned by user and the weight after the async/sync adjustment. The other missing information is the real cost when the dirtied pages eventually hit the disk after perhaps dozens of seconds. For that part I'm assuming simple dd at this time and balance_dirty_pages() is now splitting out the flusher's overall writeout progress to the dirtier tasks' dirty ratelimit based on bandwidth fairness. > This is still easy. What about hierarchical propio? What happens > then? You can't do hierarchical proportional allocation without > knowing how much IOs are pending for which group. How is that > information gonna be communicated between blkcg and writeback? Are we > gonna have two separate hierarchical proportional IO allocators? How > is that gonna work at all? If we're gonna have single allocator in > block layer, writeback would have to feed the amount of IOs it may > generate into the allocator, get the resulting allocation and then > issue IO and then block layer again will have to account these to the > originating cgroups. It's just crazy. No I have not got the idea on how to do the hierarchical proportional IO controller without physically splitting up the async IO streams. It's pretty hard and I'd better break out before it drives me crazy. So in the following discussion, let's assume cfq will move async requests from the current root cgroup to individual IO issuer's cfqgs and schedule service for the async streams there. And thus the need to create "backpressure" for balance_dirty_pages() to eventually throttle the individual dirtier tasks. That said, I still don't think we've come up with any satisfactory solutions. It's hard problem after all. > > The only problem I can see now, is that balance_dirty_pages() works > > per-bdi and blkcg works per-device. So the two ends may not match > > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > > sdb is shared by lv0 and lv1. However it should be rare situations and > > be much more acceptable than the problems arise from the "push back" > > approach which impacts everyone. > > I don't know. What problems? AFAICS, the biggest issue is writeback > of different inodes getting mixed resulting in poor performance, but > if you think about it, that's about the frequency of switching cgroups > and a problem which can and should be dealt with from block layer > (e.g. use larger time slice if all the pending IOs are async). Yeah increasing time slice would help that case. In general it's not merely the frequency of switching cgroup if take hard disk' writeback cache into account. Think about some inodes with async IO: A1, A2, A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different cgroups. So when the root cgroup holds all async inodes, the cfq may schedule IO interleavely like this A1, A1, A1, A2, A1, A2, ... D1, D2, D3, D4, D5, D6, ... Now it becomes A1, A2, A3, A4, A5, A6, ... D1, D2, D3, D4, D5, D6, ... The difference is that it's now switching the async inodes each time. At cfq level, the seek costs look the same, however the disk's writeback cache may help merge the data chunks from the same inode A1. Well, it may cost some latency for spin disks. But how about SSD? It can run deeper queue and benefit from large writes. > Writeback's duty is generating stream of async writes which can be > served efficiently for the *cgroup* and keeping the buffer filled as > necessary and chaining the backpressure from there to the actual > dirtier. That's what writeback does without cgroup. Nothing > fundamental changes with cgroup. It's just finer grained. Believe me, physically partitioning the dirty pages and async IO streams comes at big costs. It won't scale well in many ways. For one instance, splitting the request queues will give rise to PG_writeback pages. Those pages have been the biggest source of latency issues in the various parts of the system. It's not uncommon for me to see filesystems sleep on PG_writeback pages during heavy writeback, within some lock or transaction, which in turn stall many tasks that try to do IO or merely dirty some page in memory. Random writes are especially susceptible to such stalls. The stable page feature also vastly increase the chances of stalls by locking the writeback pages. Page reclaim may also block on PG_writeback and/or PG_dirty pages. In the case of direct reclaim, it means blocking random tasks that are allocating memory in the system. PG_writeback pages are much worse than PG_dirty pages in that they are not movable. This makes a big difference for high-order page allocations. To make room for a 2MB huge page, vmscan has the option to migrate PG_dirty pages, but for PG_writeback it has no better choices than to wait for IO completion. The difficulty of THP allocation goes up *exponentially* with the number of PG_writeback pages. Assume PG_writeback pages are randomly distributed in the physical memory space. Then we have formula P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 That's the possibly for a contiguous range of 256 pages to be free of PG_writeback, so that it's immediately reclaimable for use by transparent huge page. This ruby script shows us the concrete numbers. irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 } P(hit PG_writeback) P(reclaimable for THP) 0.001 0.599 0.002 0.359 0.003 0.215 0.004 0.128 0.005 0.077 0.006 0.046 0.007 0.027 0.008 0.016 0.009 0.010 0.010 0.006 The numbers show that when the PG_writeback pages go up from 0.1% to 1% of system memory, the THP reclaim success ratio drops quickly from 60% to 0.6%. It indicates that in order to use THP without constantly running into stalls, the reasonable PG_writeback ratio is <= 0.1%. Going beyond that threshold, it quickly becomes intolerable. That makes a limit of 256MB writeback pages for a mem=256GB system. Looking at the real vmstat:nr_writeback numbers in dd write tests: JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009 JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335 JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026 JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099 JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058 JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335 Oops btrfs has 4GB writeback pages -- which asks for some bug fixing. Even ext4's 800MB still looks way too high, but that's ~1s worth of data per queue (or 130ms worth of data for the high performance Intel SSD, which is perhaps in danger of queue underruns?). So this system would require 512GB memory to comfortably run KVM instances with THP support. Judging from the above numbers, we can hardly afford to split up the IO queues and proliferate writeback pages. It's worth to note that running multiple flusher threads per bdi means not only disk seeks for spin disks, smaller IO size for SSD, but also lock contentions and cache bouncing for metadata heavy workloads and fast storage. To give some concrete examples on how much CPU overheads can be saved by reducing multiple IO submitters, here are some summaries for the IO-less dirty throttling gains. Tests show that it yields huge benefits for reducing IO seeks as well as CPU overheads. For example, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". (by Dave Chinner) - "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the _global_ page states) - the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path - "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. And for simple dd tests - "throughput for a _single_ large dd (100GB) increase from ~650MB/s to 700MB/s" - "On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s." > > > No, no, it's not about standing in my way. As Vivek said in the other > > > reply, it's that the "gap" that you filled was created *because* > > > writeback wasn't cgroup aware and now you're in turn filling that gap > > > by making writeback work around that "gap". I mean, my mind boggles. > > > Doesn't yours? I strongly believe everyone's should. > > > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > > dirty throttling work. I did a lot reasoning about it, and have in > > fact kept cgroup IO controller in mind since its early days. Now I'd > > say it's hands down for it to adapt to the gap between the total IO > > limit and what's carried out by the block IO controller. > > You're not providing any valid counter arguments about the issues > being raised about the messed up design. How is anything "hands down" > here? Yeah sadly, it turns out to be not "hands down" when it comes to the proportional async/sync splits, and it's even prohibiting when comes to the hierarchical support.. > > > There's where I'm confused. How is the said split supposed to work? > > > They aren't independent. I mean, who gets to decide what and where > > > are those decisions enforced? > > > > Yeah it's not independent. It's about > > > > - keep block IO cgroup untouched (in its current algorithm, for > > throttling direct IO) > > > > - let balance_dirty_pages() adapt to the throttling target > > > > buffered_write_limit = total_limit - direct_IOs > > Think about proportional allocation. You don't have a number until > you know who have pending IOs and how much. We have the IO rate. The above formula is actually working on "rates". That's good enough for calculating the ratelimit for buffered writes. We don't have to know every transient states of the pending IOs. Because the direct IOs are handled by cfq based on cfqg weight and for async IOs, there are plenty of dirty pages for buffering/tolerating small errors in the dirty rate control. > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. > > Just do the same 1:1 inside each cgroup. Sure. But the ratio mismatch I'm talking about is inter-cgroup. For example there are only 2 dd tasks doing buffered writes in the system. Now consider the mismatch that cfq is dispatching their IO requests at 10:1 weights, while balance_dirty_pages() is throttling the dd tasks at 1:1 equal split because it's not aware of the cgroup weights. What will happen in the end? The 1:1 ratio imposed by balance_dirty_pages() will take effect and the dd tasks will progress at the same pace. The cfq weights will be defeated because the async queue for the second dd (and cgroup) constantly runs empty. > > This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > > Because splitting a resource into two pieces arbitrarily with > different amount of consumptions on each side and then applying the > same proportion on both doesn't mean anything? Sorry, I don't quite catch your words here. > > The balance_dirty_pages() is already deeply involved in dirty throttling. > > As you can see from this patchset, the same algorithms can be extended > > trivially to work with cgroup IO limits. > > > > buffered write IO controller in balance_dirty_pages() > > https://lkml.org/lkml/2012/3/28/275 > > It is half broken thing with fundamental design flaws which can't be > corrected without complete reimplementation. I don't know what to > say. I'm fully aware of that, and so have been exploring new ways out :) > > In the "back pressure" scheme, memcg is a must because only it has all > > the infrastructure to track dirty pages upon which you can apply some > > dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > For now, per-inode tracking seems good enough. There are actually two directions of information passing. 1) pass the dirtier ownership down to bio. For this part, it's mostly enough to do the light weight per-inode tracking. 2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO submit) as well as to balance_dirty_pages() (to actually throttle the dirty tasks). The flusher naturally works on inode granularities. However balance_dirty_pages() is about limiting dirty pages. For this part, it needs to know the total number of dirty pages and writeout bandwidth for each cgroup in order to do proper dirty throttling. And to maintain proper number of dirty pages to avoid the queue underrun issue explained in the above 2-dd example. > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > > To me, you seem to be not addressing the issues I've been raising at > all and just repeating the same points again and again. If I'm > misunderstanding something, please point out. Hopefully the renewed patch can dismiss some of your questions. It's a pity that I didn't thought about the hierarchical requirement at the time. Otherwise the complexity of calculations still looks manageable. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-19 14:23 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hi Tejun, On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > > Fortunately, the above gap can be easily filled judging from the > > block/cfq IO controller code. By adding some direct IO accounting > > and changing several lines of my patches to make use of the collected > > stats, the semantics of the blkio.throttle.write_bps interfaces can be > > changed from "limit for direct IO" to "limit for direct+buffered IOs". > > Ditto for blkio.weight and blkio.write_iops, as long as some > > iops/device time stats are made available to balance_dirty_pages(). > > > > It would be a fairly *easy* change. :-) It's merely adding some > > accounting code and there is no need to change the block IO > > controlling algorithm at all. I'll do the work of accounting (which > > is basically independent of the IO controlling) and use the new stats > > in balance_dirty_pages(). > > I don't really understand how this can work. For hard limits, maybe, Yeah, hard limits are the easiest. > but for proportional IO, you have to know which cgroups have IOs > before assigning the proportions, so blkcg assigning IO bandwidth > without knowing async writes simply can't work. > > For example, let's say cgroups A and B have 2:8 split. If A has IOs > on queue and B doesn't, blkcg will assign all IO bandwidth to A. I > can't wrap my head around how writeback is gonna make use of the > resulting stats but let's say it decides it needs to put out some IOs > out for both cgroups. What happens then? Do all the async writes go > through the root cgroup controlled by and affecting the ratio between > rootcg and cgroup A and B? Or do they have to be accounted as part of > cgroups A and B? If so, what if the added bandwidth goes over the > limit? Let's say if we implement overcharge; then, I suppose we'll > have to communicate that upwards too, right? The trick is to do the throttling for buffered writes at page dirty time, when balance_dirty_pages() knows exactly what cgroup the dirtier task belongs to, the dirty rate and whether or not it's an aggressive dirtier. The cgroup's direct IO rate can also be measured. The only missing information is whether it's a non-aggressive direct writer (only cfq may know about that). Now I'm simply assuming direct writers are all aggressive. So if A and B have 2:8 split and A only submits async IO and B only submits direct IO, there will be no cfqg exist for A at all. cfq will be serving B and root cgroup interleavely. In the patch I just posted, blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root cgroup for use by the flusher. In the end the flusher gets weight 2 and B gets weight 8. Here we need to distinguish the weight assigned by user and the weight after the async/sync adjustment. The other missing information is the real cost when the dirtied pages eventually hit the disk after perhaps dozens of seconds. For that part I'm assuming simple dd at this time and balance_dirty_pages() is now splitting out the flusher's overall writeout progress to the dirtier tasks' dirty ratelimit based on bandwidth fairness. > This is still easy. What about hierarchical propio? What happens > then? You can't do hierarchical proportional allocation without > knowing how much IOs are pending for which group. How is that > information gonna be communicated between blkcg and writeback? Are we > gonna have two separate hierarchical proportional IO allocators? How > is that gonna work at all? If we're gonna have single allocator in > block layer, writeback would have to feed the amount of IOs it may > generate into the allocator, get the resulting allocation and then > issue IO and then block layer again will have to account these to the > originating cgroups. It's just crazy. No I have not got the idea on how to do the hierarchical proportional IO controller without physically splitting up the async IO streams. It's pretty hard and I'd better break out before it drives me crazy. So in the following discussion, let's assume cfq will move async requests from the current root cgroup to individual IO issuer's cfqgs and schedule service for the async streams there. And thus the need to create "backpressure" for balance_dirty_pages() to eventually throttle the individual dirtier tasks. That said, I still don't think we've come up with any satisfactory solutions. It's hard problem after all. > > The only problem I can see now, is that balance_dirty_pages() works > > per-bdi and blkcg works per-device. So the two ends may not match > > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > > sdb is shared by lv0 and lv1. However it should be rare situations and > > be much more acceptable than the problems arise from the "push back" > > approach which impacts everyone. > > I don't know. What problems? AFAICS, the biggest issue is writeback > of different inodes getting mixed resulting in poor performance, but > if you think about it, that's about the frequency of switching cgroups > and a problem which can and should be dealt with from block layer > (e.g. use larger time slice if all the pending IOs are async). Yeah increasing time slice would help that case. In general it's not merely the frequency of switching cgroup if take hard disk' writeback cache into account. Think about some inodes with async IO: A1, A2, A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different cgroups. So when the root cgroup holds all async inodes, the cfq may schedule IO interleavely like this A1, A1, A1, A2, A1, A2, ... D1, D2, D3, D4, D5, D6, ... Now it becomes A1, A2, A3, A4, A5, A6, ... D1, D2, D3, D4, D5, D6, ... The difference is that it's now switching the async inodes each time. At cfq level, the seek costs look the same, however the disk's writeback cache may help merge the data chunks from the same inode A1. Well, it may cost some latency for spin disks. But how about SSD? It can run deeper queue and benefit from large writes. > Writeback's duty is generating stream of async writes which can be > served efficiently for the *cgroup* and keeping the buffer filled as > necessary and chaining the backpressure from there to the actual > dirtier. That's what writeback does without cgroup. Nothing > fundamental changes with cgroup. It's just finer grained. Believe me, physically partitioning the dirty pages and async IO streams comes at big costs. It won't scale well in many ways. For one instance, splitting the request queues will give rise to PG_writeback pages. Those pages have been the biggest source of latency issues in the various parts of the system. It's not uncommon for me to see filesystems sleep on PG_writeback pages during heavy writeback, within some lock or transaction, which in turn stall many tasks that try to do IO or merely dirty some page in memory. Random writes are especially susceptible to such stalls. The stable page feature also vastly increase the chances of stalls by locking the writeback pages. Page reclaim may also block on PG_writeback and/or PG_dirty pages. In the case of direct reclaim, it means blocking random tasks that are allocating memory in the system. PG_writeback pages are much worse than PG_dirty pages in that they are not movable. This makes a big difference for high-order page allocations. To make room for a 2MB huge page, vmscan has the option to migrate PG_dirty pages, but for PG_writeback it has no better choices than to wait for IO completion. The difficulty of THP allocation goes up *exponentially* with the number of PG_writeback pages. Assume PG_writeback pages are randomly distributed in the physical memory space. Then we have formula P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 That's the possibly for a contiguous range of 256 pages to be free of PG_writeback, so that it's immediately reclaimable for use by transparent huge page. This ruby script shows us the concrete numbers. irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 } P(hit PG_writeback) P(reclaimable for THP) 0.001 0.599 0.002 0.359 0.003 0.215 0.004 0.128 0.005 0.077 0.006 0.046 0.007 0.027 0.008 0.016 0.009 0.010 0.010 0.006 The numbers show that when the PG_writeback pages go up from 0.1% to 1% of system memory, the THP reclaim success ratio drops quickly from 60% to 0.6%. It indicates that in order to use THP without constantly running into stalls, the reasonable PG_writeback ratio is <= 0.1%. Going beyond that threshold, it quickly becomes intolerable. That makes a limit of 256MB writeback pages for a mem=256GB system. Looking at the real vmstat:nr_writeback numbers in dd write tests: JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009 JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335 JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026 JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099 JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058 JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335 Oops btrfs has 4GB writeback pages -- which asks for some bug fixing. Even ext4's 800MB still looks way too high, but that's ~1s worth of data per queue (or 130ms worth of data for the high performance Intel SSD, which is perhaps in danger of queue underruns?). So this system would require 512GB memory to comfortably run KVM instances with THP support. Judging from the above numbers, we can hardly afford to split up the IO queues and proliferate writeback pages. It's worth to note that running multiple flusher threads per bdi means not only disk seeks for spin disks, smaller IO size for SSD, but also lock contentions and cache bouncing for metadata heavy workloads and fast storage. To give some concrete examples on how much CPU overheads can be saved by reducing multiple IO submitters, here are some summaries for the IO-less dirty throttling gains. Tests show that it yields huge benefits for reducing IO seeks as well as CPU overheads. For example, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". (by Dave Chinner) - "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the _global_ page states) - the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path - "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. And for simple dd tests - "throughput for a _single_ large dd (100GB) increase from ~650MB/s to 700MB/s" - "On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s." > > > No, no, it's not about standing in my way. As Vivek said in the other > > > reply, it's that the "gap" that you filled was created *because* > > > writeback wasn't cgroup aware and now you're in turn filling that gap > > > by making writeback work around that "gap". I mean, my mind boggles. > > > Doesn't yours? I strongly believe everyone's should. > > > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > > dirty throttling work. I did a lot reasoning about it, and have in > > fact kept cgroup IO controller in mind since its early days. Now I'd > > say it's hands down for it to adapt to the gap between the total IO > > limit and what's carried out by the block IO controller. > > You're not providing any valid counter arguments about the issues > being raised about the messed up design. How is anything "hands down" > here? Yeah sadly, it turns out to be not "hands down" when it comes to the proportional async/sync splits, and it's even prohibiting when comes to the hierarchical support.. > > > There's where I'm confused. How is the said split supposed to work? > > > They aren't independent. I mean, who gets to decide what and where > > > are those decisions enforced? > > > > Yeah it's not independent. It's about > > > > - keep block IO cgroup untouched (in its current algorithm, for > > throttling direct IO) > > > > - let balance_dirty_pages() adapt to the throttling target > > > > buffered_write_limit = total_limit - direct_IOs > > Think about proportional allocation. You don't have a number until > you know who have pending IOs and how much. We have the IO rate. The above formula is actually working on "rates". That's good enough for calculating the ratelimit for buffered writes. We don't have to know every transient states of the pending IOs. Because the direct IOs are handled by cfq based on cfqg weight and for async IOs, there are plenty of dirty pages for buffering/tolerating small errors in the dirty rate control. > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. > > Just do the same 1:1 inside each cgroup. Sure. But the ratio mismatch I'm talking about is inter-cgroup. For example there are only 2 dd tasks doing buffered writes in the system. Now consider the mismatch that cfq is dispatching their IO requests at 10:1 weights, while balance_dirty_pages() is throttling the dd tasks at 1:1 equal split because it's not aware of the cgroup weights. What will happen in the end? The 1:1 ratio imposed by balance_dirty_pages() will take effect and the dd tasks will progress at the same pace. The cfq weights will be defeated because the async queue for the second dd (and cgroup) constantly runs empty. > > This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > > Because splitting a resource into two pieces arbitrarily with > different amount of consumptions on each side and then applying the > same proportion on both doesn't mean anything? Sorry, I don't quite catch your words here. > > The balance_dirty_pages() is already deeply involved in dirty throttling. > > As you can see from this patchset, the same algorithms can be extended > > trivially to work with cgroup IO limits. > > > > buffered write IO controller in balance_dirty_pages() > > https://lkml.org/lkml/2012/3/28/275 > > It is half broken thing with fundamental design flaws which can't be > corrected without complete reimplementation. I don't know what to > say. I'm fully aware of that, and so have been exploring new ways out :) > > In the "back pressure" scheme, memcg is a must because only it has all > > the infrastructure to track dirty pages upon which you can apply some > > dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > For now, per-inode tracking seems good enough. There are actually two directions of information passing. 1) pass the dirtier ownership down to bio. For this part, it's mostly enough to do the light weight per-inode tracking. 2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO submit) as well as to balance_dirty_pages() (to actually throttle the dirty tasks). The flusher naturally works on inode granularities. However balance_dirty_pages() is about limiting dirty pages. For this part, it needs to know the total number of dirty pages and writeout bandwidth for each cgroup in order to do proper dirty throttling. And to maintain proper number of dirty pages to avoid the queue underrun issue explained in the above 2-dd example. > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > > To me, you seem to be not addressing the issues I've been raising at > all and just repeating the same points again and again. If I'm > misunderstanding something, please point out. Hopefully the renewed patch can dismiss some of your questions. It's a pity that I didn't thought about the hierarchical requirement at the time. Otherwise the complexity of calculations still looks manageable. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-19 14:23 ` Fengguang Wu @ 2012-04-19 18:31 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote: Hi Fengguang, [..] > > I don't know. What problems? AFAICS, the biggest issue is writeback > > of different inodes getting mixed resulting in poor performance, but > > if you think about it, that's about the frequency of switching cgroups > > and a problem which can and should be dealt with from block layer > > (e.g. use larger time slice if all the pending IOs are async). > > Yeah increasing time slice would help that case. In general it's not > merely the frequency of switching cgroup if take hard disk' writeback > cache into account. Think about some inodes with async IO: A1, A2, > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different > cgroups. So when the root cgroup holds all async inodes, the cfq may > schedule IO interleavely like this > > A1, A1, A1, A2, A1, A2, ... > D1, D2, D3, D4, D5, D6, ... > > Now it becomes > > A1, A2, A3, A4, A5, A6, ... > D1, D2, D3, D4, D5, D6, ... > > The difference is that it's now switching the async inodes each time. > At cfq level, the seek costs look the same, however the disk's > writeback cache may help merge the data chunks from the same inode A1. > Well, it may cost some latency for spin disks. But how about SSD? It > can run deeper queue and benefit from large writes. Not sure what's the point here. Many things seem to be mixed up. If we start putting async queues in separate groups (in an attempt to provide fairness/service differentiation), then how much IO we dispatch from one async inode will directly depend on slice time of that cgroup/queue. So if you want longer dispatch from same async inode increasing slice time will help. Also elevator merge logic anyway increses the size of async IO requests and big requests are submitted to device. If you are looking that in every dispatch cycle we continue to dispatch request from same inode, yes that's not possible. Too huge a slice length in presence of sync IO is also not good. So if you are looking for high throughput and sacrificing fairness then you can switch to mode where all async queues are put in single root group. (Note: you will have to do reasonably fast switch between cgroups so that all the cgroups are able to do some writeout in a time window). Writeback logic also submits a certain amount of writes from one inode and then switches to next inode in an attempt to provide fairness. Same thing should be directly controllable by CFQ's notion of time slice. That is continue to dispatch async IO from a cgroup/inode for extended durtaion before switching. So what's the difference. One can achieve equivalent behavior at any layer (writeback/CFQ). > > > Writeback's duty is generating stream of async writes which can be > > served efficiently for the *cgroup* and keeping the buffer filled as > > necessary and chaining the backpressure from there to the actual > > dirtier. That's what writeback does without cgroup. Nothing > > fundamental changes with cgroup. It's just finer grained. > > Believe me, physically partitioning the dirty pages and async IO > streams comes at big costs. It won't scale well in many ways. > > For one instance, splitting the request queues will give rise to > PG_writeback pages. Those pages have been the biggest source of > latency issues in the various parts of the system. So PG_writeback pages are one which have been submitted for IO? So even now we generate PG_writeback pages across multiple inodes as we submit those pages for IO. By keeping the number of request descriptor per group low, we can build back pressure early and hence per inode/group we will not have too many PG_Writeback pages. IOW, number of PG_Writeback pages will be controllable by number of request descriptros. So how does situation becomes worse in case of CFQ putting them in separate cgroups? > It's worth to note that running multiple flusher threads per bdi means > not only disk seeks for spin disks, smaller IO size for SSD, but also > lock contentions and cache bouncing for metadata heavy workloads and > fast storage. But we could still have single flusher per bdi and just check the write congestion state of each group and back off if it is congested. So single thread will still be doing IO submission. Just that it will submit IO from multiple inodes/cgroup which can cause additional seeks. And that's the tradeoff of fairness. What I am not able to understand is that how are you avoiding this tradeoff by implementing things in writeback layer. To achieve more fairness among groups, even a flusher thread will have to switch faster among cgroups/inodes. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-19 18:31 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote: Hi Fengguang, [..] > > I don't know. What problems? AFAICS, the biggest issue is writeback > > of different inodes getting mixed resulting in poor performance, but > > if you think about it, that's about the frequency of switching cgroups > > and a problem which can and should be dealt with from block layer > > (e.g. use larger time slice if all the pending IOs are async). > > Yeah increasing time slice would help that case. In general it's not > merely the frequency of switching cgroup if take hard disk' writeback > cache into account. Think about some inodes with async IO: A1, A2, > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different > cgroups. So when the root cgroup holds all async inodes, the cfq may > schedule IO interleavely like this > > A1, A1, A1, A2, A1, A2, ... > D1, D2, D3, D4, D5, D6, ... > > Now it becomes > > A1, A2, A3, A4, A5, A6, ... > D1, D2, D3, D4, D5, D6, ... > > The difference is that it's now switching the async inodes each time. > At cfq level, the seek costs look the same, however the disk's > writeback cache may help merge the data chunks from the same inode A1. > Well, it may cost some latency for spin disks. But how about SSD? It > can run deeper queue and benefit from large writes. Not sure what's the point here. Many things seem to be mixed up. If we start putting async queues in separate groups (in an attempt to provide fairness/service differentiation), then how much IO we dispatch from one async inode will directly depend on slice time of that cgroup/queue. So if you want longer dispatch from same async inode increasing slice time will help. Also elevator merge logic anyway increses the size of async IO requests and big requests are submitted to device. If you are looking that in every dispatch cycle we continue to dispatch request from same inode, yes that's not possible. Too huge a slice length in presence of sync IO is also not good. So if you are looking for high throughput and sacrificing fairness then you can switch to mode where all async queues are put in single root group. (Note: you will have to do reasonably fast switch between cgroups so that all the cgroups are able to do some writeout in a time window). Writeback logic also submits a certain amount of writes from one inode and then switches to next inode in an attempt to provide fairness. Same thing should be directly controllable by CFQ's notion of time slice. That is continue to dispatch async IO from a cgroup/inode for extended durtaion before switching. So what's the difference. One can achieve equivalent behavior at any layer (writeback/CFQ). > > > Writeback's duty is generating stream of async writes which can be > > served efficiently for the *cgroup* and keeping the buffer filled as > > necessary and chaining the backpressure from there to the actual > > dirtier. That's what writeback does without cgroup. Nothing > > fundamental changes with cgroup. It's just finer grained. > > Believe me, physically partitioning the dirty pages and async IO > streams comes at big costs. It won't scale well in many ways. > > For one instance, splitting the request queues will give rise to > PG_writeback pages. Those pages have been the biggest source of > latency issues in the various parts of the system. So PG_writeback pages are one which have been submitted for IO? So even now we generate PG_writeback pages across multiple inodes as we submit those pages for IO. By keeping the number of request descriptor per group low, we can build back pressure early and hence per inode/group we will not have too many PG_Writeback pages. IOW, number of PG_Writeback pages will be controllable by number of request descriptros. So how does situation becomes worse in case of CFQ putting them in separate cgroups? > It's worth to note that running multiple flusher threads per bdi means > not only disk seeks for spin disks, smaller IO size for SSD, but also > lock contentions and cache bouncing for metadata heavy workloads and > fast storage. But we could still have single flusher per bdi and just check the write congestion state of each group and back off if it is congested. So single thread will still be doing IO submission. Just that it will submit IO from multiple inodes/cgroup which can cause additional seeks. And that's the tradeoff of fairness. What I am not able to understand is that how are you avoiding this tradeoff by implementing things in writeback layer. To achieve more fairness among groups, even a flusher thread will have to switch faster among cgroups/inodes. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120419183118.GM10216-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120419183118.GM10216-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-20 12:45 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-20 12:45 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Hi Vivek, On Thu, Apr 19, 2012 at 02:31:18PM -0400, Vivek Goyal wrote: > On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote: > > Hi Fengguang, > > [..] > > > I don't know. What problems? AFAICS, the biggest issue is writeback > > > of different inodes getting mixed resulting in poor performance, but > > > if you think about it, that's about the frequency of switching cgroups > > > and a problem which can and should be dealt with from block layer > > > (e.g. use larger time slice if all the pending IOs are async). > > > > Yeah increasing time slice would help that case. In general it's not > > merely the frequency of switching cgroup if take hard disk' writeback > > cache into account. Think about some inodes with async IO: A1, A2, > > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different > > cgroups. So when the root cgroup holds all async inodes, the cfq may > > schedule IO interleavely like this > > > > A1, A1, A1, A2, A1, A2, ... > > D1, D2, D3, D4, D5, D6, ... > > > > Now it becomes > > > > A1, A2, A3, A4, A5, A6, ... > > D1, D2, D3, D4, D5, D6, ... > > > > The difference is that it's now switching the async inodes each time. > > At cfq level, the seek costs look the same, however the disk's > > writeback cache may help merge the data chunks from the same inode A1. > > Well, it may cost some latency for spin disks. But how about SSD? It > > can run deeper queue and benefit from large writes. > > Not sure what's the point here. Many things seem to be mixed up. > > If we start putting async queues in separate groups (in an attempt to > provide fairness/service differentiation), then how much IO we dispatch > from one async inode will directly depend on slice time of that > cgroup/queue. So if you want longer dispatch from same async inode > increasing slice time will help. Right. The problem is async slice time can hardly be increased when there are sync IO, as you said below. > Also elevator merge logic anyway increses the size of async IO requests > and big requests are submitted to device. > > If you are looking that in every dispatch cycle we continue to dispatch > request from same inode, yes that's not possible. Too huge a slice length > in presence of sync IO is also not good. So if you are looking for > high throughput and sacrificing fairness then you can switch to mode > where all async queues are put in single root group. (Note: you will have > to do reasonably fast switch between cgroups so that all the cgroups are > able to do some writeout in a time window). Agreed. > Writeback logic also submits a certain amount of writes from one inode > and then switches to next inode in an attempt to provide fairness. Same > thing should be directly controllable by CFQ's notion of time slice. That > is continue to dispatch async IO from a cgroup/inode for extended durtaion > before switching. So what's the difference. One can achieve equivalent > behavior at any layer (writeback/CFQ). The difference is, the flusher's slice time is 500ms, while the cfq's async slice time is 40ms. In the one async queue case, cfq will switch back to serve the remaining data from the same inode; while in split async queues case, cfq will switch to the other inodes. This makes the flusher's larger slice time somehow "useless". > > > Writeback's duty is generating stream of async writes which can be > > > served efficiently for the *cgroup* and keeping the buffer filled as > > > necessary and chaining the backpressure from there to the actual > > > dirtier. That's what writeback does without cgroup. Nothing > > > fundamental changes with cgroup. It's just finer grained. > > > > Believe me, physically partitioning the dirty pages and async IO > > streams comes at big costs. It won't scale well in many ways. > > > > For one instance, splitting the request queues will give rise to > > PG_writeback pages. Those pages have been the biggest source of > > latency issues in the various parts of the system. > > So PG_writeback pages are one which have been submitted for IO? So even Yes. > now we generate PG_writeback pages across multiple inodes as we submit > those pages for IO. By keeping the number of request descriptor per > group low, we can build back pressure early and hence per inode/group > we will not have too many PG_Writeback pages. IOW, number of PG_Writeback > pages will be controllable by number of request descriptros. > So how does situation becomes worse in case of CFQ putting them in > separate cgroups? Good question. Imagine there are 10 dds (each in one cgroup) dirtying pages and the flusher thread is issuing IO for them in round robin fashion, issuing 500ms worth of data for each inode and then go on to next. And imagine we keep a minimal global async queue size, which is just enough for holding the 500ms data from one inode. If it can be reduced to 40ms without leading to underrun or hurt in other ways, then great. Even if the queue size is much smaller than the flusher's write chunk size, the disk will still be serving inodes on 500ms granularity, because the flusher won't feed cfq with other data during the time. Now consider moving to 10 async queues, each in one cfq group. Now each inode will need to have at least 40ms data queued, so that when a new cfq async slice comes, it can get enough data to work with. Adding it up, (40ms per queue * 10 queues) = 400ms. It means, 400ms is what's more than enough in the global async queue scheme is now only barely enough to avoid queue underrun. This makes one fundamental need to increase the total queued requests and hence PG_writeback pages. To avoid seeks we might do tricks to let cfq return to the same group serving the same async queue and repeat it for 500ms/40ms times. However the cfq vdisktime/weight system in general don't work that way. Once cgroup A get served its vdisktime will be increased and naturally some other cgroup's async queue get selected. And it's hardly feasible to increase async slice time to 500ms. Overall the split async queues in cfq will be defeating the flusher's attempt to amortize IO, because the cfq groups are now walking through the inodes in much more "fine grained" granularity: 40ms vs 500ms. > > It's worth to note that running multiple flusher threads per bdi means > > not only disk seeks for spin disks, smaller IO size for SSD, but also > > lock contentions and cache bouncing for metadata heavy workloads and > > fast storage. > > But we could still have single flusher per bdi and just check the > write congestion state of each group and back off if it is congested. > > So single thread will still be doing IO submission. Just that it will > submit IO from multiple inodes/cgroup which can cause additional seeks. Yes we still have the good option to run one single flusher. Except that its writeback chunk size should be reduced to match the 40ms async slice time and queue size mentioned above. So yes, running one single flusher will help reduce contentions, however cannot help avoid smaller IO size. > And that's the tradeoff of fairness. What I am not able to understand > is that how are you avoiding this tradeoff by implementing things in > writeback layer. To achieve more fairness among groups, even a flusher > thread will have to switch faster among cgroups/inodes. Fairness is only a problem for the cfq groups. cfq by nature works on sub-100ms granularities and switches between groups at that frequency. If it gives each cgroup 500ms and there are 10 cgroups, latency will become uncontrollable. If still keep the global async queue, it can run small 40ms slices without defeating the flusher's 500ms granularity. After each slice it can freely switch to other cgroups with sync IOs, so is free from latency issues. After return, it will continue to serve the same inode. It will basically be working on behalf of one cgroup for 500ms data, working for another cgroup for 500ms data and so on. That behavior does not impact fairness, because it's still using small slices and its weight is computed system wide thus exhibits some kind of smooth/amortize effects over long period of time. It can naturally serve the same inode after return. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-19 18:31 ` Vivek Goyal @ 2012-04-20 12:45 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-20 12:45 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Thu, Apr 19, 2012 at 02:31:18PM -0400, Vivek Goyal wrote: > On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote: > > Hi Fengguang, > > [..] > > > I don't know. What problems? AFAICS, the biggest issue is writeback > > > of different inodes getting mixed resulting in poor performance, but > > > if you think about it, that's about the frequency of switching cgroups > > > and a problem which can and should be dealt with from block layer > > > (e.g. use larger time slice if all the pending IOs are async). > > > > Yeah increasing time slice would help that case. In general it's not > > merely the frequency of switching cgroup if take hard disk' writeback > > cache into account. Think about some inodes with async IO: A1, A2, > > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different > > cgroups. So when the root cgroup holds all async inodes, the cfq may > > schedule IO interleavely like this > > > > A1, A1, A1, A2, A1, A2, ... > > D1, D2, D3, D4, D5, D6, ... > > > > Now it becomes > > > > A1, A2, A3, A4, A5, A6, ... > > D1, D2, D3, D4, D5, D6, ... > > > > The difference is that it's now switching the async inodes each time. > > At cfq level, the seek costs look the same, however the disk's > > writeback cache may help merge the data chunks from the same inode A1. > > Well, it may cost some latency for spin disks. But how about SSD? It > > can run deeper queue and benefit from large writes. > > Not sure what's the point here. Many things seem to be mixed up. > > If we start putting async queues in separate groups (in an attempt to > provide fairness/service differentiation), then how much IO we dispatch > from one async inode will directly depend on slice time of that > cgroup/queue. So if you want longer dispatch from same async inode > increasing slice time will help. Right. The problem is async slice time can hardly be increased when there are sync IO, as you said below. > Also elevator merge logic anyway increses the size of async IO requests > and big requests are submitted to device. > > If you are looking that in every dispatch cycle we continue to dispatch > request from same inode, yes that's not possible. Too huge a slice length > in presence of sync IO is also not good. So if you are looking for > high throughput and sacrificing fairness then you can switch to mode > where all async queues are put in single root group. (Note: you will have > to do reasonably fast switch between cgroups so that all the cgroups are > able to do some writeout in a time window). Agreed. > Writeback logic also submits a certain amount of writes from one inode > and then switches to next inode in an attempt to provide fairness. Same > thing should be directly controllable by CFQ's notion of time slice. That > is continue to dispatch async IO from a cgroup/inode for extended durtaion > before switching. So what's the difference. One can achieve equivalent > behavior at any layer (writeback/CFQ). The difference is, the flusher's slice time is 500ms, while the cfq's async slice time is 40ms. In the one async queue case, cfq will switch back to serve the remaining data from the same inode; while in split async queues case, cfq will switch to the other inodes. This makes the flusher's larger slice time somehow "useless". > > > Writeback's duty is generating stream of async writes which can be > > > served efficiently for the *cgroup* and keeping the buffer filled as > > > necessary and chaining the backpressure from there to the actual > > > dirtier. That's what writeback does without cgroup. Nothing > > > fundamental changes with cgroup. It's just finer grained. > > > > Believe me, physically partitioning the dirty pages and async IO > > streams comes at big costs. It won't scale well in many ways. > > > > For one instance, splitting the request queues will give rise to > > PG_writeback pages. Those pages have been the biggest source of > > latency issues in the various parts of the system. > > So PG_writeback pages are one which have been submitted for IO? So even Yes. > now we generate PG_writeback pages across multiple inodes as we submit > those pages for IO. By keeping the number of request descriptor per > group low, we can build back pressure early and hence per inode/group > we will not have too many PG_Writeback pages. IOW, number of PG_Writeback > pages will be controllable by number of request descriptros. > So how does situation becomes worse in case of CFQ putting them in > separate cgroups? Good question. Imagine there are 10 dds (each in one cgroup) dirtying pages and the flusher thread is issuing IO for them in round robin fashion, issuing 500ms worth of data for each inode and then go on to next. And imagine we keep a minimal global async queue size, which is just enough for holding the 500ms data from one inode. If it can be reduced to 40ms without leading to underrun or hurt in other ways, then great. Even if the queue size is much smaller than the flusher's write chunk size, the disk will still be serving inodes on 500ms granularity, because the flusher won't feed cfq with other data during the time. Now consider moving to 10 async queues, each in one cfq group. Now each inode will need to have at least 40ms data queued, so that when a new cfq async slice comes, it can get enough data to work with. Adding it up, (40ms per queue * 10 queues) = 400ms. It means, 400ms is what's more than enough in the global async queue scheme is now only barely enough to avoid queue underrun. This makes one fundamental need to increase the total queued requests and hence PG_writeback pages. To avoid seeks we might do tricks to let cfq return to the same group serving the same async queue and repeat it for 500ms/40ms times. However the cfq vdisktime/weight system in general don't work that way. Once cgroup A get served its vdisktime will be increased and naturally some other cgroup's async queue get selected. And it's hardly feasible to increase async slice time to 500ms. Overall the split async queues in cfq will be defeating the flusher's attempt to amortize IO, because the cfq groups are now walking through the inodes in much more "fine grained" granularity: 40ms vs 500ms. > > It's worth to note that running multiple flusher threads per bdi means > > not only disk seeks for spin disks, smaller IO size for SSD, but also > > lock contentions and cache bouncing for metadata heavy workloads and > > fast storage. > > But we could still have single flusher per bdi and just check the > write congestion state of each group and back off if it is congested. > > So single thread will still be doing IO submission. Just that it will > submit IO from multiple inodes/cgroup which can cause additional seeks. Yes we still have the good option to run one single flusher. Except that its writeback chunk size should be reduced to match the 40ms async slice time and queue size mentioned above. So yes, running one single flusher will help reduce contentions, however cannot help avoid smaller IO size. > And that's the tradeoff of fairness. What I am not able to understand > is that how are you avoiding this tradeoff by implementing things in > writeback layer. To achieve more fairness among groups, even a flusher > thread will have to switch faster among cgroups/inodes. Fairness is only a problem for the cfq groups. cfq by nature works on sub-100ms granularities and switches between groups at that frequency. If it gives each cgroup 500ms and there are 10 cgroups, latency will become uncontrollable. If still keep the global async queue, it can run small 40ms slices without defeating the flusher's 500ms granularity. After each slice it can freely switch to other cgroups with sync IOs, so is free from latency issues. After return, it will continue to serve the same inode. It will basically be working on behalf of one cgroup for 500ms data, working for another cgroup for 500ms data and so on. That behavior does not impact fairness, because it's still using small slices and its weight is computed system wide thus exhibits some kind of smooth/amortize effects over long period of time. It can naturally serve the same inode after return. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-20 12:45 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-20 12:45 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hi Vivek, On Thu, Apr 19, 2012 at 02:31:18PM -0400, Vivek Goyal wrote: > On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote: > > Hi Fengguang, > > [..] > > > I don't know. What problems? AFAICS, the biggest issue is writeback > > > of different inodes getting mixed resulting in poor performance, but > > > if you think about it, that's about the frequency of switching cgroups > > > and a problem which can and should be dealt with from block layer > > > (e.g. use larger time slice if all the pending IOs are async). > > > > Yeah increasing time slice would help that case. In general it's not > > merely the frequency of switching cgroup if take hard disk' writeback > > cache into account. Think about some inodes with async IO: A1, A2, > > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different > > cgroups. So when the root cgroup holds all async inodes, the cfq may > > schedule IO interleavely like this > > > > A1, A1, A1, A2, A1, A2, ... > > D1, D2, D3, D4, D5, D6, ... > > > > Now it becomes > > > > A1, A2, A3, A4, A5, A6, ... > > D1, D2, D3, D4, D5, D6, ... > > > > The difference is that it's now switching the async inodes each time. > > At cfq level, the seek costs look the same, however the disk's > > writeback cache may help merge the data chunks from the same inode A1. > > Well, it may cost some latency for spin disks. But how about SSD? It > > can run deeper queue and benefit from large writes. > > Not sure what's the point here. Many things seem to be mixed up. > > If we start putting async queues in separate groups (in an attempt to > provide fairness/service differentiation), then how much IO we dispatch > from one async inode will directly depend on slice time of that > cgroup/queue. So if you want longer dispatch from same async inode > increasing slice time will help. Right. The problem is async slice time can hardly be increased when there are sync IO, as you said below. > Also elevator merge logic anyway increses the size of async IO requests > and big requests are submitted to device. > > If you are looking that in every dispatch cycle we continue to dispatch > request from same inode, yes that's not possible. Too huge a slice length > in presence of sync IO is also not good. So if you are looking for > high throughput and sacrificing fairness then you can switch to mode > where all async queues are put in single root group. (Note: you will have > to do reasonably fast switch between cgroups so that all the cgroups are > able to do some writeout in a time window). Agreed. > Writeback logic also submits a certain amount of writes from one inode > and then switches to next inode in an attempt to provide fairness. Same > thing should be directly controllable by CFQ's notion of time slice. That > is continue to dispatch async IO from a cgroup/inode for extended durtaion > before switching. So what's the difference. One can achieve equivalent > behavior at any layer (writeback/CFQ). The difference is, the flusher's slice time is 500ms, while the cfq's async slice time is 40ms. In the one async queue case, cfq will switch back to serve the remaining data from the same inode; while in split async queues case, cfq will switch to the other inodes. This makes the flusher's larger slice time somehow "useless". > > > Writeback's duty is generating stream of async writes which can be > > > served efficiently for the *cgroup* and keeping the buffer filled as > > > necessary and chaining the backpressure from there to the actual > > > dirtier. That's what writeback does without cgroup. Nothing > > > fundamental changes with cgroup. It's just finer grained. > > > > Believe me, physically partitioning the dirty pages and async IO > > streams comes at big costs. It won't scale well in many ways. > > > > For one instance, splitting the request queues will give rise to > > PG_writeback pages. Those pages have been the biggest source of > > latency issues in the various parts of the system. > > So PG_writeback pages are one which have been submitted for IO? So even Yes. > now we generate PG_writeback pages across multiple inodes as we submit > those pages for IO. By keeping the number of request descriptor per > group low, we can build back pressure early and hence per inode/group > we will not have too many PG_Writeback pages. IOW, number of PG_Writeback > pages will be controllable by number of request descriptros. > So how does situation becomes worse in case of CFQ putting them in > separate cgroups? Good question. Imagine there are 10 dds (each in one cgroup) dirtying pages and the flusher thread is issuing IO for them in round robin fashion, issuing 500ms worth of data for each inode and then go on to next. And imagine we keep a minimal global async queue size, which is just enough for holding the 500ms data from one inode. If it can be reduced to 40ms without leading to underrun or hurt in other ways, then great. Even if the queue size is much smaller than the flusher's write chunk size, the disk will still be serving inodes on 500ms granularity, because the flusher won't feed cfq with other data during the time. Now consider moving to 10 async queues, each in one cfq group. Now each inode will need to have at least 40ms data queued, so that when a new cfq async slice comes, it can get enough data to work with. Adding it up, (40ms per queue * 10 queues) = 400ms. It means, 400ms is what's more than enough in the global async queue scheme is now only barely enough to avoid queue underrun. This makes one fundamental need to increase the total queued requests and hence PG_writeback pages. To avoid seeks we might do tricks to let cfq return to the same group serving the same async queue and repeat it for 500ms/40ms times. However the cfq vdisktime/weight system in general don't work that way. Once cgroup A get served its vdisktime will be increased and naturally some other cgroup's async queue get selected. And it's hardly feasible to increase async slice time to 500ms. Overall the split async queues in cfq will be defeating the flusher's attempt to amortize IO, because the cfq groups are now walking through the inodes in much more "fine grained" granularity: 40ms vs 500ms. > > It's worth to note that running multiple flusher threads per bdi means > > not only disk seeks for spin disks, smaller IO size for SSD, but also > > lock contentions and cache bouncing for metadata heavy workloads and > > fast storage. > > But we could still have single flusher per bdi and just check the > write congestion state of each group and back off if it is congested. > > So single thread will still be doing IO submission. Just that it will > submit IO from multiple inodes/cgroup which can cause additional seeks. Yes we still have the good option to run one single flusher. Except that its writeback chunk size should be reduced to match the 40ms async slice time and queue size mentioned above. So yes, running one single flusher will help reduce contentions, however cannot help avoid smaller IO size. > And that's the tradeoff of fairness. What I am not able to understand > is that how are you avoiding this tradeoff by implementing things in > writeback layer. To achieve more fairness among groups, even a flusher > thread will have to switch faster among cgroups/inodes. Fairness is only a problem for the cfq groups. cfq by nature works on sub-100ms granularities and switches between groups at that frequency. If it gives each cgroup 500ms and there are 10 cgroups, latency will become uncontrollable. If still keep the global async queue, it can run small 40ms slices without defeating the flusher's 500ms granularity. After each slice it can freely switch to other cgroups with sync IOs, so is free from latency issues. After return, it will continue to serve the same inode. It will basically be working on behalf of one cgroup for 500ms data, working for another cgroup for 500ms data and so on. That behavior does not impact fairness, because it's still using small slices and its weight is computed system wide thus exhibits some kind of smooth/amortize effects over long period of time. It can naturally serve the same inode after return. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 12:45 ` Fengguang Wu @ 2012-04-20 19:29 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-20 19:29 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 08:45:18PM +0800, Fengguang Wu wrote: [..] > If still keep the global async queue, it can run small 40ms slices > without defeating the flusher's 500ms granularity. After each slice > it can freely switch to other cgroups with sync IOs, so is free from > latency issues. After return, it will continue to serve the same > inode. It will basically be working on behalf of one cgroup for 500ms > data, working for another cgroup for 500ms data and so on. That > behavior does not impact fairness, because it's still using small > slices and its weight is computed system wide thus exhibits some kind > of smooth/amortize effects over long period of time. It can naturally > serve the same inode after return. Ok, So tejun did say that we will have a switch where we will allow retaining the old behavior of keeping all async writes in root group and not in individual group. So throughput sensitive users can make use of that and there is no need to push proportional IO logic to writeback layer for buffered writes? I am personally is not too excited about the case of putting async IO in separate groups due to the reason that async IO of one group will start impacting latencies of sync IO of another group and in practice it might not be desirable. But there are others who have use cases for separate async IO queue. So as long as switch is there to change the behavior, I am not too worried. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-20 19:29 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-20 19:29 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 08:45:18PM +0800, Fengguang Wu wrote: [..] > If still keep the global async queue, it can run small 40ms slices > without defeating the flusher's 500ms granularity. After each slice > it can freely switch to other cgroups with sync IOs, so is free from > latency issues. After return, it will continue to serve the same > inode. It will basically be working on behalf of one cgroup for 500ms > data, working for another cgroup for 500ms data and so on. That > behavior does not impact fairness, because it's still using small > slices and its weight is computed system wide thus exhibits some kind > of smooth/amortize effects over long period of time. It can naturally > serve the same inode after return. Ok, So tejun did say that we will have a switch where we will allow retaining the old behavior of keeping all async writes in root group and not in individual group. So throughput sensitive users can make use of that and there is no need to push proportional IO logic to writeback layer for buffered writes? I am personally is not too excited about the case of putting async IO in separate groups due to the reason that async IO of one group will start impacting latencies of sync IO of another group and in practice it might not be desirable. But there are others who have use cases for separate async IO queue. So as long as switch is there to change the behavior, I am not too worried. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 19:29 ` Vivek Goyal @ 2012-04-20 21:33 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-20 21:33 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > I am personally is not too excited about the case of putting async IO > in separate groups due to the reason that async IO of one group will > start impacting latencies of sync IO of another group and in practice > it might not be desirable. But there are others who have use cases for > separate async IO queue. So as long as switch is there to change the > behavior, I am not too worried. Why not just fix cfq so that it prefers groups w/ sync IOs? -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-20 21:33 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-20 21:33 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > I am personally is not too excited about the case of putting async IO > in separate groups due to the reason that async IO of one group will > start impacting latencies of sync IO of another group and in practice > it might not be desirable. But there are others who have use cases for > separate async IO queue. So as long as switch is there to change the > behavior, I am not too worried. Why not just fix cfq so that it prefers groups w/ sync IOs? -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 21:33 ` Tejun Heo @ 2012-04-22 14:26 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-22 14:26 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > I am personally is not too excited about the case of putting async IO > > in separate groups due to the reason that async IO of one group will > > start impacting latencies of sync IO of another group and in practice > > it might not be desirable. But there are others who have use cases for > > separate async IO queue. So as long as switch is there to change the > > behavior, I am not too worried. > > Why not just fix cfq so that it prefers groups w/ sync IOs? There may be a sync+async group in front, but when switch into it, it decides to give its async queue a run. That's not necessarily a bad decision, but we do lose some control here. ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-22 14:26 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-22 14:26 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > I am personally is not too excited about the case of putting async IO > > in separate groups due to the reason that async IO of one group will > > start impacting latencies of sync IO of another group and in practice > > it might not be desirable. But there are others who have use cases for > > separate async IO queue. So as long as switch is there to change the > > behavior, I am not too worried. > > Why not just fix cfq so that it prefers groups w/ sync IOs? There may be a sync+async group in front, but when switch into it, it decides to give its async queue a run. That's not necessarily a bad decision, but we do lose some control here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-04-22 14:26 ` Fengguang Wu 2012-04-23 12:30 ` Vivek Goyal 1 sibling, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-22 14:26 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > I am personally is not too excited about the case of putting async IO > > in separate groups due to the reason that async IO of one group will > > start impacting latencies of sync IO of another group and in practice > > it might not be desirable. But there are others who have use cases for > > separate async IO queue. So as long as switch is there to change the > > behavior, I am not too worried. > > Why not just fix cfq so that it prefers groups w/ sync IOs? There may be a sync+async group in front, but when switch into it, it decides to give its async queue a run. That's not necessarily a bad decision, but we do lose some control here. ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-22 14:26 ` Fengguang Wu @ 2012-04-23 12:30 ` Vivek Goyal 1 sibling, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-23 12:30 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > I am personally is not too excited about the case of putting async IO > > in separate groups due to the reason that async IO of one group will > > start impacting latencies of sync IO of another group and in practice > > it might not be desirable. But there are others who have use cases for > > separate async IO queue. So as long as switch is there to change the > > behavior, I am not too worried. > > Why not just fix cfq so that it prefers groups w/ sync IOs? Yes that could possibly be done but now that's change of requirements. Now we are saying that I want one buffered write to go faster than other buffered write only if there is no sync IO present in any of the groups. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 21:33 ` Tejun Heo @ 2012-04-23 12:30 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-23 12:30 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > I am personally is not too excited about the case of putting async IO > > in separate groups due to the reason that async IO of one group will > > start impacting latencies of sync IO of another group and in practice > > it might not be desirable. But there are others who have use cases for > > separate async IO queue. So as long as switch is there to change the > > behavior, I am not too worried. > > Why not just fix cfq so that it prefers groups w/ sync IOs? Yes that could possibly be done but now that's change of requirements. Now we are saying that I want one buffered write to go faster than other buffered write only if there is no sync IO present in any of the groups. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 12:30 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-23 12:30 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > I am personally is not too excited about the case of putting async IO > > in separate groups due to the reason that async IO of one group will > > start impacting latencies of sync IO of another group and in practice > > it might not be desirable. But there are others who have use cases for > > separate async IO queue. So as long as switch is there to change the > > behavior, I am not too worried. > > Why not just fix cfq so that it prefers groups w/ sync IOs? Yes that could possibly be done but now that's change of requirements. Now we are saying that I want one buffered write to go faster than other buffered write only if there is no sync IO present in any of the groups. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-23 12:30 ` Vivek Goyal @ 2012-04-23 16:04 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-23 16:04 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Vivek. On Mon, Apr 23, 2012 at 08:30:11AM -0400, Vivek Goyal wrote: > On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > > I am personally is not too excited about the case of putting async IO > > > in separate groups due to the reason that async IO of one group will > > > start impacting latencies of sync IO of another group and in practice > > > it might not be desirable. But there are others who have use cases for > > > separate async IO queue. So as long as switch is there to change the > > > behavior, I am not too worried. > > > > Why not just fix cfq so that it prefers groups w/ sync IOs? > > Yes that could possibly be done but now that's change of requirements. Now > we are saying that I want one buffered write to go faster than other > buffered write only if there is no sync IO present in any of the groups. It's a scheduling decision and the resource split may or may not be about latency (the faster part). We're currently just shoving all asyncs into the root group and preferring sync IOs in general. The other end would be keeping them completely siloed and not caring about [a]sync across different cgroups. My point is that managing async IOs per cgroup doesn't mean we can't prioritize sync IOs in general. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 16:04 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-23 16:04 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hello, Vivek. On Mon, Apr 23, 2012 at 08:30:11AM -0400, Vivek Goyal wrote: > On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > > I am personally is not too excited about the case of putting async IO > > > in separate groups due to the reason that async IO of one group will > > > start impacting latencies of sync IO of another group and in practice > > > it might not be desirable. But there are others who have use cases for > > > separate async IO queue. So as long as switch is there to change the > > > behavior, I am not too worried. > > > > Why not just fix cfq so that it prefers groups w/ sync IOs? > > Yes that could possibly be done but now that's change of requirements. Now > we are saying that I want one buffered write to go faster than other > buffered write only if there is no sync IO present in any of the groups. It's a scheduling decision and the resource split may or may not be about latency (the faster part). We're currently just shoving all asyncs into the root group and preferring sync IOs in general. The other end would be keeping them completely siloed and not caring about [a]sync across different cgroups. My point is that managing async IOs per cgroup doesn't mean we can't prioritize sync IOs in general. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120423123011.GA8103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120423123011.GA8103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-23 16:04 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-23 16:04 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hello, Vivek. On Mon, Apr 23, 2012 at 08:30:11AM -0400, Vivek Goyal wrote: > On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote: > > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > > > I am personally is not too excited about the case of putting async IO > > > in separate groups due to the reason that async IO of one group will > > > start impacting latencies of sync IO of another group and in practice > > > it might not be desirable. But there are others who have use cases for > > > separate async IO queue. So as long as switch is there to change the > > > behavior, I am not too worried. > > > > Why not just fix cfq so that it prefers groups w/ sync IOs? > > Yes that could possibly be done but now that's change of requirements. Now > we are saying that I want one buffered write to go faster than other > buffered write only if there is no sync IO present in any of the groups. It's a scheduling decision and the resource split may or may not be about latency (the faster part). We're currently just shoving all asyncs into the root group and preferring sync IOs in general. The other end would be keeping them completely siloed and not caring about [a]sync across different cgroups. My point is that managing async IOs per cgroup doesn't mean we can't prioritize sync IOs in general. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120420192930.GR22419-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120420192930.GR22419-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-20 21:33 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-20 21:33 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote: > I am personally is not too excited about the case of putting async IO > in separate groups due to the reason that async IO of one group will > start impacting latencies of sync IO of another group and in practice > it might not be desirable. But there are others who have use cases for > separate async IO queue. So as long as switch is there to change the > behavior, I am not too worried. Why not just fix cfq so that it prefers groups w/ sync IOs? -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 12:45 ` Fengguang Wu (?) (?) @ 2012-04-20 19:29 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-20 19:29 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Fri, Apr 20, 2012 at 08:45:18PM +0800, Fengguang Wu wrote: [..] > If still keep the global async queue, it can run small 40ms slices > without defeating the flusher's 500ms granularity. After each slice > it can freely switch to other cgroups with sync IOs, so is free from > latency issues. After return, it will continue to serve the same > inode. It will basically be working on behalf of one cgroup for 500ms > data, working for another cgroup for 500ms data and so on. That > behavior does not impact fairness, because it's still using small > slices and its weight is computed system wide thus exhibits some kind > of smooth/amortize effects over long period of time. It can naturally > serve the same inode after return. Ok, So tejun did say that we will have a switch where we will allow retaining the old behavior of keeping all async writes in root group and not in individual group. So throughput sensitive users can make use of that and there is no need to push proportional IO logic to writeback layer for buffered writes? I am personally is not too excited about the case of putting async IO in separate groups due to the reason that async IO of one group will start impacting latencies of sync IO of another group and in practice it might not be desirable. But there are others who have use cases for separate async IO queue. So as long as switch is there to change the behavior, I am not too worried. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-19 14:23 ` Fengguang Wu ` (2 preceding siblings ...) (?) @ 2012-04-19 18:31 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote: Hi Fengguang, [..] > > I don't know. What problems? AFAICS, the biggest issue is writeback > > of different inodes getting mixed resulting in poor performance, but > > if you think about it, that's about the frequency of switching cgroups > > and a problem which can and should be dealt with from block layer > > (e.g. use larger time slice if all the pending IOs are async). > > Yeah increasing time slice would help that case. In general it's not > merely the frequency of switching cgroup if take hard disk' writeback > cache into account. Think about some inodes with async IO: A1, A2, > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different > cgroups. So when the root cgroup holds all async inodes, the cfq may > schedule IO interleavely like this > > A1, A1, A1, A2, A1, A2, ... > D1, D2, D3, D4, D5, D6, ... > > Now it becomes > > A1, A2, A3, A4, A5, A6, ... > D1, D2, D3, D4, D5, D6, ... > > The difference is that it's now switching the async inodes each time. > At cfq level, the seek costs look the same, however the disk's > writeback cache may help merge the data chunks from the same inode A1. > Well, it may cost some latency for spin disks. But how about SSD? It > can run deeper queue and benefit from large writes. Not sure what's the point here. Many things seem to be mixed up. If we start putting async queues in separate groups (in an attempt to provide fairness/service differentiation), then how much IO we dispatch from one async inode will directly depend on slice time of that cgroup/queue. So if you want longer dispatch from same async inode increasing slice time will help. Also elevator merge logic anyway increses the size of async IO requests and big requests are submitted to device. If you are looking that in every dispatch cycle we continue to dispatch request from same inode, yes that's not possible. Too huge a slice length in presence of sync IO is also not good. So if you are looking for high throughput and sacrificing fairness then you can switch to mode where all async queues are put in single root group. (Note: you will have to do reasonably fast switch between cgroups so that all the cgroups are able to do some writeout in a time window). Writeback logic also submits a certain amount of writes from one inode and then switches to next inode in an attempt to provide fairness. Same thing should be directly controllable by CFQ's notion of time slice. That is continue to dispatch async IO from a cgroup/inode for extended durtaion before switching. So what's the difference. One can achieve equivalent behavior at any layer (writeback/CFQ). > > > Writeback's duty is generating stream of async writes which can be > > served efficiently for the *cgroup* and keeping the buffer filled as > > necessary and chaining the backpressure from there to the actual > > dirtier. That's what writeback does without cgroup. Nothing > > fundamental changes with cgroup. It's just finer grained. > > Believe me, physically partitioning the dirty pages and async IO > streams comes at big costs. It won't scale well in many ways. > > For one instance, splitting the request queues will give rise to > PG_writeback pages. Those pages have been the biggest source of > latency issues in the various parts of the system. So PG_writeback pages are one which have been submitted for IO? So even now we generate PG_writeback pages across multiple inodes as we submit those pages for IO. By keeping the number of request descriptor per group low, we can build back pressure early and hence per inode/group we will not have too many PG_Writeback pages. IOW, number of PG_Writeback pages will be controllable by number of request descriptros. So how does situation becomes worse in case of CFQ putting them in separate cgroups? > It's worth to note that running multiple flusher threads per bdi means > not only disk seeks for spin disks, smaller IO size for SSD, but also > lock contentions and cache bouncing for metadata heavy workloads and > fast storage. But we could still have single flusher per bdi and just check the write congestion state of each group and back off if it is congested. So single thread will still be doing IO submission. Just that it will submit IO from multiple inodes/cgroup which can cause additional seeks. And that's the tradeoff of fairness. What I am not able to understand is that how are you avoiding this tradeoff by implementing things in writeback layer. To achieve more fairness among groups, even a flusher thread will have to switch faster among cgroups/inodes. Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-19 14:23 ` Fengguang Wu (?) @ 2012-04-19 20:26 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA On Thu 19-04-12 22:23:43, Wu Fengguang wrote: > For one instance, splitting the request queues will give rise to > PG_writeback pages. Those pages have been the biggest source of > latency issues in the various parts of the system. Well, if we allow more requests to be in flight in total then yes, number of PG_Writeback pages can be higher as well. > It's not uncommon for me to see filesystems sleep on PG_writeback > pages during heavy writeback, within some lock or transaction, which in > turn stall many tasks that try to do IO or merely dirty some page in > memory. Random writes are especially susceptible to such stalls. The > stable page feature also vastly increase the chances of stalls by > locking the writeback pages. > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > the case of direct reclaim, it means blocking random tasks that are > allocating memory in the system. > > PG_writeback pages are much worse than PG_dirty pages in that they are > not movable. This makes a big difference for high-order page allocations. > To make room for a 2MB huge page, vmscan has the option to migrate > PG_dirty pages, but for PG_writeback it has no better choices than to > wait for IO completion. > > The difficulty of THP allocation goes up *exponentially* with the > number of PG_writeback pages. Assume PG_writeback pages are randomly > distributed in the physical memory space. Then we have formula > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 Well, this implicitely assumes that PG_Writeback pages are scattered across memory uniformly at random. I'm not sure to which extent this is true... Also as a nitpick, this isn't really an exponential growth since the exponent is fixed (256 - actually it should be 512, right?). It's just a polynomial with a big exponent. But sure, growth in number of PG_Writeback pages will cause relatively steep drop in the number of available huge pages. ... > It's worth to note that running multiple flusher threads per bdi means > not only disk seeks for spin disks, smaller IO size for SSD, but also > lock contentions and cache bouncing for metadata heavy workloads and > fast storage. Well, this heavily depends on particular implementation (and chosen data structures). But yes, we should have that in mind. ... > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > It's always there doing 1:1 proportional throttling. Then you try to > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > from its balanced state, leading to large fluctuations and program > > > stalls. > > > > Just do the same 1:1 inside each cgroup. > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > For example there are only 2 dd tasks doing buffered writes in the > system. Now consider the mismatch that cfq is dispatching their IO > requests at 10:1 weights, while balance_dirty_pages() is throttling > the dd tasks at 1:1 equal split because it's not aware of the cgroup > weights. > > What will happen in the end? The 1:1 ratio imposed by > balance_dirty_pages() will take effect and the dd tasks will progress > at the same pace. The cfq weights will be defeated because the async > queue for the second dd (and cgroup) constantly runs empty. Yup. This just shows that you have to have per-cgroup dirty limits. Once you have those, things start working again. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-19 20:26 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu 19-04-12 22:23:43, Wu Fengguang wrote: > For one instance, splitting the request queues will give rise to > PG_writeback pages. Those pages have been the biggest source of > latency issues in the various parts of the system. Well, if we allow more requests to be in flight in total then yes, number of PG_Writeback pages can be higher as well. > It's not uncommon for me to see filesystems sleep on PG_writeback > pages during heavy writeback, within some lock or transaction, which in > turn stall many tasks that try to do IO or merely dirty some page in > memory. Random writes are especially susceptible to such stalls. The > stable page feature also vastly increase the chances of stalls by > locking the writeback pages. > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > the case of direct reclaim, it means blocking random tasks that are > allocating memory in the system. > > PG_writeback pages are much worse than PG_dirty pages in that they are > not movable. This makes a big difference for high-order page allocations. > To make room for a 2MB huge page, vmscan has the option to migrate > PG_dirty pages, but for PG_writeback it has no better choices than to > wait for IO completion. > > The difficulty of THP allocation goes up *exponentially* with the > number of PG_writeback pages. Assume PG_writeback pages are randomly > distributed in the physical memory space. Then we have formula > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 Well, this implicitely assumes that PG_Writeback pages are scattered across memory uniformly at random. I'm not sure to which extent this is true... Also as a nitpick, this isn't really an exponential growth since the exponent is fixed (256 - actually it should be 512, right?). It's just a polynomial with a big exponent. But sure, growth in number of PG_Writeback pages will cause relatively steep drop in the number of available huge pages. ... > It's worth to note that running multiple flusher threads per bdi means > not only disk seeks for spin disks, smaller IO size for SSD, but also > lock contentions and cache bouncing for metadata heavy workloads and > fast storage. Well, this heavily depends on particular implementation (and chosen data structures). But yes, we should have that in mind. ... > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > It's always there doing 1:1 proportional throttling. Then you try to > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > from its balanced state, leading to large fluctuations and program > > > stalls. > > > > Just do the same 1:1 inside each cgroup. > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > For example there are only 2 dd tasks doing buffered writes in the > system. Now consider the mismatch that cfq is dispatching their IO > requests at 10:1 weights, while balance_dirty_pages() is throttling > the dd tasks at 1:1 equal split because it's not aware of the cgroup > weights. > > What will happen in the end? The 1:1 ratio imposed by > balance_dirty_pages() will take effect and the dd tasks will progress > at the same pace. The cfq weights will be defeated because the async > queue for the second dd (and cgroup) constantly runs empty. Yup. This just shows that you have to have per-cgroup dirty limits. Once you have those, things start working again. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-19 20:26 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu 19-04-12 22:23:43, Wu Fengguang wrote: > For one instance, splitting the request queues will give rise to > PG_writeback pages. Those pages have been the biggest source of > latency issues in the various parts of the system. Well, if we allow more requests to be in flight in total then yes, number of PG_Writeback pages can be higher as well. > It's not uncommon for me to see filesystems sleep on PG_writeback > pages during heavy writeback, within some lock or transaction, which in > turn stall many tasks that try to do IO or merely dirty some page in > memory. Random writes are especially susceptible to such stalls. The > stable page feature also vastly increase the chances of stalls by > locking the writeback pages. > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > the case of direct reclaim, it means blocking random tasks that are > allocating memory in the system. > > PG_writeback pages are much worse than PG_dirty pages in that they are > not movable. This makes a big difference for high-order page allocations. > To make room for a 2MB huge page, vmscan has the option to migrate > PG_dirty pages, but for PG_writeback it has no better choices than to > wait for IO completion. > > The difficulty of THP allocation goes up *exponentially* with the > number of PG_writeback pages. Assume PG_writeback pages are randomly > distributed in the physical memory space. Then we have formula > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 Well, this implicitely assumes that PG_Writeback pages are scattered across memory uniformly at random. I'm not sure to which extent this is true... Also as a nitpick, this isn't really an exponential growth since the exponent is fixed (256 - actually it should be 512, right?). It's just a polynomial with a big exponent. But sure, growth in number of PG_Writeback pages will cause relatively steep drop in the number of available huge pages. ... > It's worth to note that running multiple flusher threads per bdi means > not only disk seeks for spin disks, smaller IO size for SSD, but also > lock contentions and cache bouncing for metadata heavy workloads and > fast storage. Well, this heavily depends on particular implementation (and chosen data structures). But yes, we should have that in mind. ... > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > It's always there doing 1:1 proportional throttling. Then you try to > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > from its balanced state, leading to large fluctuations and program > > > stalls. > > > > Just do the same 1:1 inside each cgroup. > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > For example there are only 2 dd tasks doing buffered writes in the > system. Now consider the mismatch that cfq is dispatching their IO > requests at 10:1 weights, while balance_dirty_pages() is throttling > the dd tasks at 1:1 equal split because it's not aware of the cgroup > weights. > > What will happen in the end? The 1:1 ratio imposed by > balance_dirty_pages() will take effect and the dd tasks will progress > at the same pace. The cfq weights will be defeated because the async > queue for the second dd (and cgroup) constantly runs empty. Yup. This just shows that you have to have per-cgroup dirty limits. Once you have those, things start working again. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-19 20:26 ` Jan Kara @ 2012-04-20 13:34 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-20 13:34 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > On Thu 19-04-12 22:23:43, Wu Fengguang wrote: > > For one instance, splitting the request queues will give rise to > > PG_writeback pages. Those pages have been the biggest source of > > latency issues in the various parts of the system. > Well, if we allow more requests to be in flight in total then yes, number > of PG_Writeback pages can be higher as well. Exactly. > > It's not uncommon for me to see filesystems sleep on PG_writeback > > pages during heavy writeback, within some lock or transaction, which in > > turn stall many tasks that try to do IO or merely dirty some page in > > memory. Random writes are especially susceptible to such stalls. The > > stable page feature also vastly increase the chances of stalls by > > locking the writeback pages. > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > the case of direct reclaim, it means blocking random tasks that are > > allocating memory in the system. > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > not movable. This makes a big difference for high-order page allocations. > > To make room for a 2MB huge page, vmscan has the option to migrate > > PG_dirty pages, but for PG_writeback it has no better choices than to > > wait for IO completion. > > > > The difficulty of THP allocation goes up *exponentially* with the > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > distributed in the physical memory space. Then we have formula > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > Well, this implicitely assumes that PG_Writeback pages are scattered > across memory uniformly at random. I'm not sure to which extent this is > true... Yeah, when describing the problem I was also thinking about the possibilities of optimization (it would be a very good general improvements). Or maybe Mel already has some solutions :) > Also as a nitpick, this isn't really an exponential growth since > the exponent is fixed (256 - actually it should be 512, right?). It's just Right, 512 4k pages to form one x86_64 2MB huge pages. > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > pages will cause relatively steep drop in the number of available huge > pages. It's exponential indeed, because "1 - p(x)" here means "p(!x)". It's exponential for a 10x increase in x resulting in 100x drop of y. > ... > > It's worth to note that running multiple flusher threads per bdi means > > not only disk seeks for spin disks, smaller IO size for SSD, but also > > lock contentions and cache bouncing for metadata heavy workloads and > > fast storage. > Well, this heavily depends on particular implementation (and chosen > data structures). But yes, we should have that in mind. The lock contentions and cache bouncing actually mainly happen in fs code due to concurrent IO submissions. Also when replying Vivek's email I realized that the disk seeks and/or smaller IO size are more fundamentally tied to the split async queues in cfq which makes it switch inodes on every async slice time (typically 40ms). > ... > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > from its balanced state, leading to large fluctuations and program > > > > stalls. > > > > > > Just do the same 1:1 inside each cgroup. > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > For example there are only 2 dd tasks doing buffered writes in the > > system. Now consider the mismatch that cfq is dispatching their IO > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > weights. > > > > What will happen in the end? The 1:1 ratio imposed by > > balance_dirty_pages() will take effect and the dd tasks will progress > > at the same pace. The cfq weights will be defeated because the async > > queue for the second dd (and cgroup) constantly runs empty. > Yup. This just shows that you have to have per-cgroup dirty limits. Once > you have those, things start working again. Right. I think Tejun was more of less aware of this. I was rather upset by this per-memcg dirty_limit idea indeed. I never expect it to work well when used extensively. My plan was to set the default memcg dirty_limit high enough, so that it's not hit in normal. Then Tejun came and proposed to (mis-)use dirty_limit as the way to convert the dirty pages' backpressure into real dirty throttling rate. No, that's just crazy idea! Come on, let's not over-use memcg's dirty_limit. It's there as the *last resort* to keep dirty pages under control so as to maintain interactive performance inside the cgroup. However if used extensively in the system (like dozens of memcgs all hit their dirty limits), the limit itself may stall random dirtiers and create interactive performance issues! In the recent days I've come up with the idea of memcg.dirty_setpoint for the blkcg backpressure stuff. We can use that instead. memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. Imagine bdi_setpoint. It's all the same concepts. Why we need this? Because if blkcg A and B does 10:1 weights and are both doing buffered writes, their dirty pages should better be maintained around 10:1 ratio to avoid underrun and hopefully achieve better IO size. memcg.dirty_limit cannot guarantee that goal. But be warned! Partitioning the dirty pages always means more fluctuations of dirty rates (and even stalls) that's perceivable by the user. Which means another limiting factor for the backpressure based IO controller to scale well. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-20 13:34 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-20 13:34 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > On Thu 19-04-12 22:23:43, Wu Fengguang wrote: > > For one instance, splitting the request queues will give rise to > > PG_writeback pages. Those pages have been the biggest source of > > latency issues in the various parts of the system. > Well, if we allow more requests to be in flight in total then yes, number > of PG_Writeback pages can be higher as well. Exactly. > > It's not uncommon for me to see filesystems sleep on PG_writeback > > pages during heavy writeback, within some lock or transaction, which in > > turn stall many tasks that try to do IO or merely dirty some page in > > memory. Random writes are especially susceptible to such stalls. The > > stable page feature also vastly increase the chances of stalls by > > locking the writeback pages. > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > the case of direct reclaim, it means blocking random tasks that are > > allocating memory in the system. > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > not movable. This makes a big difference for high-order page allocations. > > To make room for a 2MB huge page, vmscan has the option to migrate > > PG_dirty pages, but for PG_writeback it has no better choices than to > > wait for IO completion. > > > > The difficulty of THP allocation goes up *exponentially* with the > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > distributed in the physical memory space. Then we have formula > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > Well, this implicitely assumes that PG_Writeback pages are scattered > across memory uniformly at random. I'm not sure to which extent this is > true... Yeah, when describing the problem I was also thinking about the possibilities of optimization (it would be a very good general improvements). Or maybe Mel already has some solutions :) > Also as a nitpick, this isn't really an exponential growth since > the exponent is fixed (256 - actually it should be 512, right?). It's just Right, 512 4k pages to form one x86_64 2MB huge pages. > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > pages will cause relatively steep drop in the number of available huge > pages. It's exponential indeed, because "1 - p(x)" here means "p(!x)". It's exponential for a 10x increase in x resulting in 100x drop of y. > ... > > It's worth to note that running multiple flusher threads per bdi means > > not only disk seeks for spin disks, smaller IO size for SSD, but also > > lock contentions and cache bouncing for metadata heavy workloads and > > fast storage. > Well, this heavily depends on particular implementation (and chosen > data structures). But yes, we should have that in mind. The lock contentions and cache bouncing actually mainly happen in fs code due to concurrent IO submissions. Also when replying Vivek's email I realized that the disk seeks and/or smaller IO size are more fundamentally tied to the split async queues in cfq which makes it switch inodes on every async slice time (typically 40ms). > ... > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > from its balanced state, leading to large fluctuations and program > > > > stalls. > > > > > > Just do the same 1:1 inside each cgroup. > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > For example there are only 2 dd tasks doing buffered writes in the > > system. Now consider the mismatch that cfq is dispatching their IO > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > weights. > > > > What will happen in the end? The 1:1 ratio imposed by > > balance_dirty_pages() will take effect and the dd tasks will progress > > at the same pace. The cfq weights will be defeated because the async > > queue for the second dd (and cgroup) constantly runs empty. > Yup. This just shows that you have to have per-cgroup dirty limits. Once > you have those, things start working again. Right. I think Tejun was more of less aware of this. I was rather upset by this per-memcg dirty_limit idea indeed. I never expect it to work well when used extensively. My plan was to set the default memcg dirty_limit high enough, so that it's not hit in normal. Then Tejun came and proposed to (mis-)use dirty_limit as the way to convert the dirty pages' backpressure into real dirty throttling rate. No, that's just crazy idea! Come on, let's not over-use memcg's dirty_limit. It's there as the *last resort* to keep dirty pages under control so as to maintain interactive performance inside the cgroup. However if used extensively in the system (like dozens of memcgs all hit their dirty limits), the limit itself may stall random dirtiers and create interactive performance issues! In the recent days I've come up with the idea of memcg.dirty_setpoint for the blkcg backpressure stuff. We can use that instead. memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. Imagine bdi_setpoint. It's all the same concepts. Why we need this? Because if blkcg A and B does 10:1 weights and are both doing buffered writes, their dirty pages should better be maintained around 10:1 ratio to avoid underrun and hopefully achieve better IO size. memcg.dirty_limit cannot guarantee that goal. But be warned! Partitioning the dirty pages always means more fluctuations of dirty rates (and even stalls) that's perceivable by the user. Which means another limiting factor for the backpressure based IO controller to scale well. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 13:34 ` Fengguang Wu (?) @ 2012-04-20 19:08 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-20 19:08 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman Hello, Fengguang. On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > you have those, things start working again. > > Right. I think Tejun was more of less aware of this. I'm fairly sure I'm on the "less" side of it. > I was rather upset by this per-memcg dirty_limit idea indeed. I never > expect it to work well when used extensively. My plan was to set the > default memcg dirty_limit high enough, so that it's not hit in normal. > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > convert the dirty pages' backpressure into real dirty throttling rate. > No, that's just crazy idea! I'll tell you what's crazy. We're not gonna cut three more kernel releases and then change jobs. Some of the stuff we put in the kernel ends up staying there for over a decade. While ignoring fundamental designs and violating layers may look like rendering a quick solution. They tend to come back and bite our collective asses. Ask Vivek. The iosched / blkcg API was messed up to the extent that bugs were so difficult to track down and it was nearly impossible to add new features, let alone new blkcg policy or elevator and people did suffer for that for long time. I ended up cleaning up the mess. It took me longer than three months and even then we have to carry on with a lot of ugly stuff for compatibility. Unfortunately, your proposed solution is far worse than blkcg was or ever could be. It's not even contained in a single subsystem and it's not even clear what it achieves. Neither weight or hard limit can be properly enforced without another layer of controlling at the block layer (some use cases do expect strict enforcement) and we're baking assumptions about use cases, interfaces and underlying hardware across multiple subsystems (some ssds work fine with per-iops switching). For your suggested solution, the moment it's best fit is now and it'll be a long painful way down until someone snaps and reimplements the whole thing. The kernel is larger than balance_dirty_pages() or writeback. Each subsystem should do what it's supposed to do. Let's solve problems where they belong and pay overheads where they're due. Let's not contort the whole stack for the short term goal of shoving writeback support into the existing, still-developing, blkcg cfq proportional IO implementation. Because that's pure insanity. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 13:34 ` Fengguang Wu @ 2012-04-20 19:08 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-20 19:08 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hello, Fengguang. On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > you have those, things start working again. > > Right. I think Tejun was more of less aware of this. I'm fairly sure I'm on the "less" side of it. > I was rather upset by this per-memcg dirty_limit idea indeed. I never > expect it to work well when used extensively. My plan was to set the > default memcg dirty_limit high enough, so that it's not hit in normal. > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > convert the dirty pages' backpressure into real dirty throttling rate. > No, that's just crazy idea! I'll tell you what's crazy. We're not gonna cut three more kernel releases and then change jobs. Some of the stuff we put in the kernel ends up staying there for over a decade. While ignoring fundamental designs and violating layers may look like rendering a quick solution. They tend to come back and bite our collective asses. Ask Vivek. The iosched / blkcg API was messed up to the extent that bugs were so difficult to track down and it was nearly impossible to add new features, let alone new blkcg policy or elevator and people did suffer for that for long time. I ended up cleaning up the mess. It took me longer than three months and even then we have to carry on with a lot of ugly stuff for compatibility. Unfortunately, your proposed solution is far worse than blkcg was or ever could be. It's not even contained in a single subsystem and it's not even clear what it achieves. Neither weight or hard limit can be properly enforced without another layer of controlling at the block layer (some use cases do expect strict enforcement) and we're baking assumptions about use cases, interfaces and underlying hardware across multiple subsystems (some ssds work fine with per-iops switching). For your suggested solution, the moment it's best fit is now and it'll be a long painful way down until someone snaps and reimplements the whole thing. The kernel is larger than balance_dirty_pages() or writeback. Each subsystem should do what it's supposed to do. Let's solve problems where they belong and pay overheads where they're due. Let's not contort the whole stack for the short term goal of shoving writeback support into the existing, still-developing, blkcg cfq proportional IO implementation. Because that's pure insanity. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-20 19:08 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-20 19:08 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hello, Fengguang. On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > you have those, things start working again. > > Right. I think Tejun was more of less aware of this. I'm fairly sure I'm on the "less" side of it. > I was rather upset by this per-memcg dirty_limit idea indeed. I never > expect it to work well when used extensively. My plan was to set the > default memcg dirty_limit high enough, so that it's not hit in normal. > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > convert the dirty pages' backpressure into real dirty throttling rate. > No, that's just crazy idea! I'll tell you what's crazy. We're not gonna cut three more kernel releases and then change jobs. Some of the stuff we put in the kernel ends up staying there for over a decade. While ignoring fundamental designs and violating layers may look like rendering a quick solution. They tend to come back and bite our collective asses. Ask Vivek. The iosched / blkcg API was messed up to the extent that bugs were so difficult to track down and it was nearly impossible to add new features, let alone new blkcg policy or elevator and people did suffer for that for long time. I ended up cleaning up the mess. It took me longer than three months and even then we have to carry on with a lot of ugly stuff for compatibility. Unfortunately, your proposed solution is far worse than blkcg was or ever could be. It's not even contained in a single subsystem and it's not even clear what it achieves. Neither weight or hard limit can be properly enforced without another layer of controlling at the block layer (some use cases do expect strict enforcement) and we're baking assumptions about use cases, interfaces and underlying hardware across multiple subsystems (some ssds work fine with per-iops switching). For your suggested solution, the moment it's best fit is now and it'll be a long painful way down until someone snaps and reimplements the whole thing. The kernel is larger than balance_dirty_pages() or writeback. Each subsystem should do what it's supposed to do. Let's solve problems where they belong and pay overheads where they're due. Let's not contort the whole stack for the short term goal of shoving writeback support into the existing, still-developing, blkcg cfq proportional IO implementation. Because that's pure insanity. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-04-22 14:46 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman Hi Tejun, On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > I'm fairly sure I'm on the "less" side of it. OK. Sorry I should have explained why memcg dirty limit is not the right tool for back pressure based throttling. To limit memcg dirty pages, two thresholds will be introduced: 0 call for flush dirty limit ------------------------*--------------------------------*-----------------------> memcg dirty pages 1) when dirty pages increase to "call for flush" point, the memcg will explicitly ask the flusher thread to focus more on this memcg's inodes 2) when "dirty limit" is reached, the dirtier tasks will be throttled the hard way When there are few memcgs, or when the safety margin between the two thresholds are large enough, the dirty limit won't be hit and all goes virtually as smooth as when there are only global dirty limits. Otherwise the memcg dirty limit will be occasionally hit, but still should drop soon when the flusher thread round-robin to this memcg. Basically the more memcgs with dirty limits, the more hard time for the flusher to serve them fairly and knock down their dirty pages in time. Because the flusher works inode by inode, each one may take up to 0.5 second, and there may be many memcgs asking for the flusher's attention. Also the more memcgs, the global dirty pages pool are partitioned into smaller pieces, which means smaller safety margin for each memcg. Adding these two effects up, there may be constantly some memcgs hitting their dirty limits when there are dozens of memcgs. Hitting the dirty limits means all dirtiers tasks, including the light dirtiers who do occasional writes, become painfully slow. It's a bad state that should be avoided by any means. Now consider the back pressure case. When the user configured two blkcgs with 10:1 weights, the flusher will have great difficulties writeout pages for the latter blkcg. The corresponding memcg's dirty pages rush straightly to its dirty limit, _stay_ there and can never drop to normal. This means the latter blkcg's tasks will constantly see second-long time stalls. The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint that's proportional to its buffered writeout bandwidth and teach balance_dirty_pages() to balance dirty pages around that target. It avoids the worst case of hitting dirty_limit. However it may still present big challenges to balance_dirty_pages(). For example, when there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120 dirty balance targets. Wow I cannot imagine how it's going to fulfill so many different targets. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > I'll tell you what's crazy. > > We're not gonna cut three more kernel releases and then change jobs. > Some of the stuff we put in the kernel ends up staying there for over > a decade. While ignoring fundamental designs and violating layers may > look like rendering a quick solution. They tend to come back and bite > our collective asses. Ask Vivek. The iosched / blkcg API was messed > up to the extent that bugs were so difficult to track down and it was > nearly impossible to add new features, let alone new blkcg policy or > elevator and people did suffer for that for long time. I ended up > cleaning up the mess. It took me longer than three months and even > then we have to carry on with a lot of ugly stuff for compatibility. "block/cfq-iosched.c" 3930L Yeah it's a big pile of tricky code. In despite of that, the code structure still looks pretty neat, kudos to all of you! > Unfortunately, your proposed solution is far worse than blkcg was or > ever could be. It's not even contained in a single subsystem and it's > not even clear what it achieves. Yeah it's cross subsystems, mainly due to there are two natural throttling points: balance_dirty_pages() and cfq. It requires both sides to work properly. In my proposal, balance_dirty_pages() takes care to update the weights for async/direct IO on every 200ms and store it in blkcg. cfq then grabs the weights to update the cfq group's vdisktime. Such cross subsystem coordinations still look natural to me because "weight" is a fundamental and general parameter. It's really a blkcg thing (determined by the blkio.weight user interface) rather than specifically tied to cfq. When another kernel entity (eg. NFS or noop) decides to add support for proportional weight IO control in future, it can make use of the weights calculated by balance_dirty_pages(), too. That scheme does involve non-trivial complexities in the calculations, however IMHO sucks much less than let cfq take control and convey the information all the way up to balance_dirty_pages() via "backpressure". When balance_dirty_pages() takes part in the job, it merely costs some per-cpu accounting and calculations on every 200ms -- both scales pretty well. Virtually nothing changed (how buffered IO is performed) before/after applying IO controllers. From the users' perspective: - No more latency - No performance drop - No bumpy progress and stalls - No need to attach memcg to blkcg - Feel free to create 1000+ IO controllers, to heart's content w/o worrying about costs (if any, it would be some existing scalability issues) On the other hand, the back pressure scheme makes Linux more clumsy by vectorizing everything from bottom to up, giving rise to a number of problems: - in cfq, by splitting up the global async queue, cfq suddenly sees a number of cfq groups full of async requests lining up competing for the disk time. This could obscure things up and add difficulties to maintain low latency for sync requests. - in cfq, it will now be switching inodes based on the 40ms async slice time, which defeats the flusher thread's 500ms inode slice time. The below numbers show the performance cost of lowering the flusher's slices to ~40ms: 3.4.0-rc2 3.4.0-rc2-4M+ ----------- ------------------------ 114.02 -4.2% 109.23 snb/thresh=8G/xfs-1dd-1-3.4.0-rc2 102.25 -11.7% 90.24 snb/thresh=8G/xfs-10dd-1-3.4.0-rc2 104.17 -17.5% 85.91 snb/thresh=8G/xfs-20dd-1-3.4.0-rc2 104.94 -18.7% 85.28 snb/thresh=8G/xfs-30dd-1-3.4.0-rc2 104.76 -21.9% 81.82 snb/thresh=8G/xfs-100dd-1-3.4.0-rc2 We can do the optimization of increasing cfq async time slice when there are no sync IO. However in general cases it could still hurt. - in cfq, the lots more async queues will be holding much more async requests in order to prevent queue underrun. This proportionally scales up the number of writeback pages, which in turn exponentially scales up the difficulty to reclaim high order pages: P(reclaimable for THP) = P(non-PG_writeback)^512 That means we cannot comfortably use THP in a system with more than 0.1% writeback pages. Perhaps we need to work out some general optimizations to make writeback pages more concentrated in the physical memory space. Besides, when there are N seconds worth of writeback pages, it may take N/2 seconds on average for wait_on_page_writeback() to finish. So the total time cost of running into a random writeback page and waiting on it is also O(n^2): E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it) That means we can hardly keep more than 1-second worth of writeback pages w/o worrying about long waits on PG_writeback in various parts of the kernel. - in the flusher, we'll need to vectorize the dirty inode lists, that's fine. However we either need to create one flusher per blkcg, which has the problem of intensify various fs lock contentions, or let one single flusher to walk through the blkcgs, which risks more cfq queue underruns. We may decrease the flusher's time slice or increase the queue size to mitigate this, however neither looks the exciting way. - balance_dirty_pages() will need to keep each blkcg's dirty pages at reasonable level, otherwise there may be starvations to defeat the low level IO controllers and to hurt IO size. Thus comes the very undesirable need to attach memcg to blkcg to track dirty pages. It's also not fun to work with dozens of dirty pages targets because dirty pages tend to fluctuate a lot. In comparison, it's far more easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks in the global context. In summary, the back pressure scheme looks obvious at first sight, however there are some fundamental problems in the way. Cgroups are expected to be *light weight* facilities. Unfortunately this scheme will likely present too much burden and side effects to the system. It might become uncomfortable for the user to run 10+ blkcgs... > Neither weight or hard limit can be > properly enforced without another layer of controlling at the block > layer (some use cases do expect strict enforcement) and we're baking > assumptions about use cases, interfaces and underlying hardware across > multiple subsystems (some ssds work fine with per-iops switching). cfq still has the freedom to do per-iops switching, based on the same weight values computed by balance_dirty_pages(). cfq will need to feed back some "IO cost" stats based on either disk time or iops, upon which balance_dirty_pages() scales the throttling bandwidth for the dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS hard limits based on the scaled throttling bandwidth. > For your suggested solution, the moment it's best fit is now and it'll > be a long painful way down until someone snaps and reimplements the > whole thing. > > The kernel is larger than balance_dirty_pages() or writeback. Each > subsystem should do what it's supposed to do. Let's solve problems > where they belong and pay overheads where they're due. Let's not > contort the whole stack for the short term goal of shoving writeback > support into the existing, still-developing, blkcg cfq proportional IO > implementation. Because that's pure insanity. To be frank I would be very pleased to avoid going into the pains of doing all the hairy computations to graft balance_dirty_pages() onto cfq, if ever the back pressure idea is not so upsetting. And if there are proper ways to address its problems, it would be a great relief for me to stop pondering on the details of disk time/IOPS feedback and the hierarchical support (yeah I think it's somehow possible now), and the foreseeable _numerous_ experiments to get the ideas into shape... Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-22 14:46 ` Fengguang Wu @ 2012-04-22 14:46 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hi Tejun, On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > I'm fairly sure I'm on the "less" side of it. OK. Sorry I should have explained why memcg dirty limit is not the right tool for back pressure based throttling. To limit memcg dirty pages, two thresholds will be introduced: 0 call for flush dirty limit ------------------------*--------------------------------*-----------------------> memcg dirty pages 1) when dirty pages increase to "call for flush" point, the memcg will explicitly ask the flusher thread to focus more on this memcg's inodes 2) when "dirty limit" is reached, the dirtier tasks will be throttled the hard way When there are few memcgs, or when the safety margin between the two thresholds are large enough, the dirty limit won't be hit and all goes virtually as smooth as when there are only global dirty limits. Otherwise the memcg dirty limit will be occasionally hit, but still should drop soon when the flusher thread round-robin to this memcg. Basically the more memcgs with dirty limits, the more hard time for the flusher to serve them fairly and knock down their dirty pages in time. Because the flusher works inode by inode, each one may take up to 0.5 second, and there may be many memcgs asking for the flusher's attention. Also the more memcgs, the global dirty pages pool are partitioned into smaller pieces, which means smaller safety margin for each memcg. Adding these two effects up, there may be constantly some memcgs hitting their dirty limits when there are dozens of memcgs. Hitting the dirty limits means all dirtiers tasks, including the light dirtiers who do occasional writes, become painfully slow. It's a bad state that should be avoided by any means. Now consider the back pressure case. When the user configured two blkcgs with 10:1 weights, the flusher will have great difficulties writeout pages for the latter blkcg. The corresponding memcg's dirty pages rush straightly to its dirty limit, _stay_ there and can never drop to normal. This means the latter blkcg's tasks will constantly see second-long time stalls. The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint that's proportional to its buffered writeout bandwidth and teach balance_dirty_pages() to balance dirty pages around that target. It avoids the worst case of hitting dirty_limit. However it may still present big challenges to balance_dirty_pages(). For example, when there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120 dirty balance targets. Wow I cannot imagine how it's going to fulfill so many different targets. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > I'll tell you what's crazy. > > We're not gonna cut three more kernel releases and then change jobs. > Some of the stuff we put in the kernel ends up staying there for over > a decade. While ignoring fundamental designs and violating layers may > look like rendering a quick solution. They tend to come back and bite > our collective asses. Ask Vivek. The iosched / blkcg API was messed > up to the extent that bugs were so difficult to track down and it was > nearly impossible to add new features, let alone new blkcg policy or > elevator and people did suffer for that for long time. I ended up > cleaning up the mess. It took me longer than three months and even > then we have to carry on with a lot of ugly stuff for compatibility. "block/cfq-iosched.c" 3930L Yeah it's a big pile of tricky code. In despite of that, the code structure still looks pretty neat, kudos to all of you! > Unfortunately, your proposed solution is far worse than blkcg was or > ever could be. It's not even contained in a single subsystem and it's > not even clear what it achieves. Yeah it's cross subsystems, mainly due to there are two natural throttling points: balance_dirty_pages() and cfq. It requires both sides to work properly. In my proposal, balance_dirty_pages() takes care to update the weights for async/direct IO on every 200ms and store it in blkcg. cfq then grabs the weights to update the cfq group's vdisktime. Such cross subsystem coordinations still look natural to me because "weight" is a fundamental and general parameter. It's really a blkcg thing (determined by the blkio.weight user interface) rather than specifically tied to cfq. When another kernel entity (eg. NFS or noop) decides to add support for proportional weight IO control in future, it can make use of the weights calculated by balance_dirty_pages(), too. That scheme does involve non-trivial complexities in the calculations, however IMHO sucks much less than let cfq take control and convey the information all the way up to balance_dirty_pages() via "backpressure". When balance_dirty_pages() takes part in the job, it merely costs some per-cpu accounting and calculations on every 200ms -- both scales pretty well. Virtually nothing changed (how buffered IO is performed) before/after applying IO controllers. From the users' perspective: - No more latency - No performance drop - No bumpy progress and stalls - No need to attach memcg to blkcg - Feel free to create 1000+ IO controllers, to heart's content w/o worrying about costs (if any, it would be some existing scalability issues) On the other hand, the back pressure scheme makes Linux more clumsy by vectorizing everything from bottom to up, giving rise to a number of problems: - in cfq, by splitting up the global async queue, cfq suddenly sees a number of cfq groups full of async requests lining up competing for the disk time. This could obscure things up and add difficulties to maintain low latency for sync requests. - in cfq, it will now be switching inodes based on the 40ms async slice time, which defeats the flusher thread's 500ms inode slice time. The below numbers show the performance cost of lowering the flusher's slices to ~40ms: 3.4.0-rc2 3.4.0-rc2-4M+ ----------- ------------------------ 114.02 -4.2% 109.23 snb/thresh=8G/xfs-1dd-1-3.4.0-rc2 102.25 -11.7% 90.24 snb/thresh=8G/xfs-10dd-1-3.4.0-rc2 104.17 -17.5% 85.91 snb/thresh=8G/xfs-20dd-1-3.4.0-rc2 104.94 -18.7% 85.28 snb/thresh=8G/xfs-30dd-1-3.4.0-rc2 104.76 -21.9% 81.82 snb/thresh=8G/xfs-100dd-1-3.4.0-rc2 We can do the optimization of increasing cfq async time slice when there are no sync IO. However in general cases it could still hurt. - in cfq, the lots more async queues will be holding much more async requests in order to prevent queue underrun. This proportionally scales up the number of writeback pages, which in turn exponentially scales up the difficulty to reclaim high order pages: P(reclaimable for THP) = P(non-PG_writeback)^512 That means we cannot comfortably use THP in a system with more than 0.1% writeback pages. Perhaps we need to work out some general optimizations to make writeback pages more concentrated in the physical memory space. Besides, when there are N seconds worth of writeback pages, it may take N/2 seconds on average for wait_on_page_writeback() to finish. So the total time cost of running into a random writeback page and waiting on it is also O(n^2): E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it) That means we can hardly keep more than 1-second worth of writeback pages w/o worrying about long waits on PG_writeback in various parts of the kernel. - in the flusher, we'll need to vectorize the dirty inode lists, that's fine. However we either need to create one flusher per blkcg, which has the problem of intensify various fs lock contentions, or let one single flusher to walk through the blkcgs, which risks more cfq queue underruns. We may decrease the flusher's time slice or increase the queue size to mitigate this, however neither looks the exciting way. - balance_dirty_pages() will need to keep each blkcg's dirty pages at reasonable level, otherwise there may be starvations to defeat the low level IO controllers and to hurt IO size. Thus comes the very undesirable need to attach memcg to blkcg to track dirty pages. It's also not fun to work with dozens of dirty pages targets because dirty pages tend to fluctuate a lot. In comparison, it's far more easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks in the global context. In summary, the back pressure scheme looks obvious at first sight, however there are some fundamental problems in the way. Cgroups are expected to be *light weight* facilities. Unfortunately this scheme will likely present too much burden and side effects to the system. It might become uncomfortable for the user to run 10+ blkcgs... > Neither weight or hard limit can be > properly enforced without another layer of controlling at the block > layer (some use cases do expect strict enforcement) and we're baking > assumptions about use cases, interfaces and underlying hardware across > multiple subsystems (some ssds work fine with per-iops switching). cfq still has the freedom to do per-iops switching, based on the same weight values computed by balance_dirty_pages(). cfq will need to feed back some "IO cost" stats based on either disk time or iops, upon which balance_dirty_pages() scales the throttling bandwidth for the dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS hard limits based on the scaled throttling bandwidth. > For your suggested solution, the moment it's best fit is now and it'll > be a long painful way down until someone snaps and reimplements the > whole thing. > > The kernel is larger than balance_dirty_pages() or writeback. Each > subsystem should do what it's supposed to do. Let's solve problems > where they belong and pay overheads where they're due. Let's not > contort the whole stack for the short term goal of shoving writeback > support into the existing, still-developing, blkcg cfq proportional IO > implementation. Because that's pure insanity. To be frank I would be very pleased to avoid going into the pains of doing all the hairy computations to graft balance_dirty_pages() onto cfq, if ever the back pressure idea is not so upsetting. And if there are proper ways to address its problems, it would be a great relief for me to stop pondering on the details of disk time/IOPS feedback and the hierarchical support (yeah I think it's somehow possible now), and the foreseeable _numerous_ experiments to get the ideas into shape... Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-22 14:46 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hi Tejun, On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > I'm fairly sure I'm on the "less" side of it. OK. Sorry I should have explained why memcg dirty limit is not the right tool for back pressure based throttling. To limit memcg dirty pages, two thresholds will be introduced: 0 call for flush dirty limit ------------------------*--------------------------------*-----------------------> memcg dirty pages 1) when dirty pages increase to "call for flush" point, the memcg will explicitly ask the flusher thread to focus more on this memcg's inodes 2) when "dirty limit" is reached, the dirtier tasks will be throttled the hard way When there are few memcgs, or when the safety margin between the two thresholds are large enough, the dirty limit won't be hit and all goes virtually as smooth as when there are only global dirty limits. Otherwise the memcg dirty limit will be occasionally hit, but still should drop soon when the flusher thread round-robin to this memcg. Basically the more memcgs with dirty limits, the more hard time for the flusher to serve them fairly and knock down their dirty pages in time. Because the flusher works inode by inode, each one may take up to 0.5 second, and there may be many memcgs asking for the flusher's attention. Also the more memcgs, the global dirty pages pool are partitioned into smaller pieces, which means smaller safety margin for each memcg. Adding these two effects up, there may be constantly some memcgs hitting their dirty limits when there are dozens of memcgs. Hitting the dirty limits means all dirtiers tasks, including the light dirtiers who do occasional writes, become painfully slow. It's a bad state that should be avoided by any means. Now consider the back pressure case. When the user configured two blkcgs with 10:1 weights, the flusher will have great difficulties writeout pages for the latter blkcg. The corresponding memcg's dirty pages rush straightly to its dirty limit, _stay_ there and can never drop to normal. This means the latter blkcg's tasks will constantly see second-long time stalls. The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint that's proportional to its buffered writeout bandwidth and teach balance_dirty_pages() to balance dirty pages around that target. It avoids the worst case of hitting dirty_limit. However it may still present big challenges to balance_dirty_pages(). For example, when there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120 dirty balance targets. Wow I cannot imagine how it's going to fulfill so many different targets. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > I'll tell you what's crazy. > > We're not gonna cut three more kernel releases and then change jobs. > Some of the stuff we put in the kernel ends up staying there for over > a decade. While ignoring fundamental designs and violating layers may > look like rendering a quick solution. They tend to come back and bite > our collective asses. Ask Vivek. The iosched / blkcg API was messed > up to the extent that bugs were so difficult to track down and it was > nearly impossible to add new features, let alone new blkcg policy or > elevator and people did suffer for that for long time. I ended up > cleaning up the mess. It took me longer than three months and even > then we have to carry on with a lot of ugly stuff for compatibility. "block/cfq-iosched.c" 3930L Yeah it's a big pile of tricky code. In despite of that, the code structure still looks pretty neat, kudos to all of you! > Unfortunately, your proposed solution is far worse than blkcg was or > ever could be. It's not even contained in a single subsystem and it's > not even clear what it achieves. Yeah it's cross subsystems, mainly due to there are two natural throttling points: balance_dirty_pages() and cfq. It requires both sides to work properly. In my proposal, balance_dirty_pages() takes care to update the weights for async/direct IO on every 200ms and store it in blkcg. cfq then grabs the weights to update the cfq group's vdisktime. Such cross subsystem coordinations still look natural to me because "weight" is a fundamental and general parameter. It's really a blkcg thing (determined by the blkio.weight user interface) rather than specifically tied to cfq. When another kernel entity (eg. NFS or noop) decides to add support for proportional weight IO control in future, it can make use of the weights calculated by balance_dirty_pages(), too. That scheme does involve non-trivial complexities in the calculations, however IMHO sucks much less than let cfq take control and convey the information all the way up to balance_dirty_pages() via "backpressure". When balance_dirty_pages() takes part in the job, it merely costs some per-cpu accounting and calculations on every 200ms -- both scales pretty well. Virtually nothing changed (how buffered IO is performed) before/after applying IO controllers. From the users' perspective: - No more latency - No performance drop - No bumpy progress and stalls - No need to attach memcg to blkcg - Feel free to create 1000+ IO controllers, to heart's content w/o worrying about costs (if any, it would be some existing scalability issues) On the other hand, the back pressure scheme makes Linux more clumsy by vectorizing everything from bottom to up, giving rise to a number of problems: - in cfq, by splitting up the global async queue, cfq suddenly sees a number of cfq groups full of async requests lining up competing for the disk time. This could obscure things up and add difficulties to maintain low latency for sync requests. - in cfq, it will now be switching inodes based on the 40ms async slice time, which defeats the flusher thread's 500ms inode slice time. The below numbers show the performance cost of lowering the flusher's slices to ~40ms: 3.4.0-rc2 3.4.0-rc2-4M+ ----------- ------------------------ 114.02 -4.2% 109.23 snb/thresh=8G/xfs-1dd-1-3.4.0-rc2 102.25 -11.7% 90.24 snb/thresh=8G/xfs-10dd-1-3.4.0-rc2 104.17 -17.5% 85.91 snb/thresh=8G/xfs-20dd-1-3.4.0-rc2 104.94 -18.7% 85.28 snb/thresh=8G/xfs-30dd-1-3.4.0-rc2 104.76 -21.9% 81.82 snb/thresh=8G/xfs-100dd-1-3.4.0-rc2 We can do the optimization of increasing cfq async time slice when there are no sync IO. However in general cases it could still hurt. - in cfq, the lots more async queues will be holding much more async requests in order to prevent queue underrun. This proportionally scales up the number of writeback pages, which in turn exponentially scales up the difficulty to reclaim high order pages: P(reclaimable for THP) = P(non-PG_writeback)^512 That means we cannot comfortably use THP in a system with more than 0.1% writeback pages. Perhaps we need to work out some general optimizations to make writeback pages more concentrated in the physical memory space. Besides, when there are N seconds worth of writeback pages, it may take N/2 seconds on average for wait_on_page_writeback() to finish. So the total time cost of running into a random writeback page and waiting on it is also O(n^2): E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it) That means we can hardly keep more than 1-second worth of writeback pages w/o worrying about long waits on PG_writeback in various parts of the kernel. - in the flusher, we'll need to vectorize the dirty inode lists, that's fine. However we either need to create one flusher per blkcg, which has the problem of intensify various fs lock contentions, or let one single flusher to walk through the blkcgs, which risks more cfq queue underruns. We may decrease the flusher's time slice or increase the queue size to mitigate this, however neither looks the exciting way. - balance_dirty_pages() will need to keep each blkcg's dirty pages at reasonable level, otherwise there may be starvations to defeat the low level IO controllers and to hurt IO size. Thus comes the very undesirable need to attach memcg to blkcg to track dirty pages. It's also not fun to work with dozens of dirty pages targets because dirty pages tend to fluctuate a lot. In comparison, it's far more easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks in the global context. In summary, the back pressure scheme looks obvious at first sight, however there are some fundamental problems in the way. Cgroups are expected to be *light weight* facilities. Unfortunately this scheme will likely present too much burden and side effects to the system. It might become uncomfortable for the user to run 10+ blkcgs... > Neither weight or hard limit can be > properly enforced without another layer of controlling at the block > layer (some use cases do expect strict enforcement) and we're baking > assumptions about use cases, interfaces and underlying hardware across > multiple subsystems (some ssds work fine with per-iops switching). cfq still has the freedom to do per-iops switching, based on the same weight values computed by balance_dirty_pages(). cfq will need to feed back some "IO cost" stats based on either disk time or iops, upon which balance_dirty_pages() scales the throttling bandwidth for the dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS hard limits based on the scaled throttling bandwidth. > For your suggested solution, the moment it's best fit is now and it'll > be a long painful way down until someone snaps and reimplements the > whole thing. > > The kernel is larger than balance_dirty_pages() or writeback. Each > subsystem should do what it's supposed to do. Let's solve problems > where they belong and pay overheads where they're due. Let's not > contort the whole stack for the short term goal of shoving writeback > support into the existing, still-developing, blkcg cfq proportional IO > implementation. Because that's pure insanity. To be frank I would be very pleased to avoid going into the pains of doing all the hairy computations to graft balance_dirty_pages() onto cfq, if ever the back pressure idea is not so upsetting. And if there are proper ways to address its problems, it would be a great relief for me to stop pondering on the details of disk time/IOPS feedback and the hierarchical support (yeah I think it's somehow possible now), and the foreseeable _numerous_ experiments to get the ideas into shape... Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-22 14:46 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mel Gorman Hi Tejun, On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > I'm fairly sure I'm on the "less" side of it. OK. Sorry I should have explained why memcg dirty limit is not the right tool for back pressure based throttling. To limit memcg dirty pages, two thresholds will be introduced: 0 call for flush dirty limit ------------------------*--------------------------------*-----------------------> memcg dirty pages 1) when dirty pages increase to "call for flush" point, the memcg will explicitly ask the flusher thread to focus more on this memcg's inodes 2) when "dirty limit" is reached, the dirtier tasks will be throttled the hard way When there are few memcgs, or when the safety margin between the two thresholds are large enough, the dirty limit won't be hit and all goes virtually as smooth as when there are only global dirty limits. Otherwise the memcg dirty limit will be occasionally hit, but still should drop soon when the flusher thread round-robin to this memcg. Basically the more memcgs with dirty limits, the more hard time for the flusher to serve them fairly and knock down their dirty pages in time. Because the flusher works inode by inode, each one may take up to 0.5 second, and there may be many memcgs asking for the flusher's attention. Also the more memcgs, the global dirty pages pool are partitioned into smaller pieces, which means smaller safety margin for each memcg. Adding these two effects up, there may be constantly some memcgs hitting their dirty limits when there are dozens of memcgs. Hitting the dirty limits means all dirtiers tasks, including the light dirtiers who do occasional writes, become painfully slow. It's a bad state that should be avoided by any means. Now consider the back pressure case. When the user configured two blkcgs with 10:1 weights, the flusher will have great difficulties writeout pages for the latter blkcg. The corresponding memcg's dirty pages rush straightly to its dirty limit, _stay_ there and can never drop to normal. This means the latter blkcg's tasks will constantly see second-long time stalls. The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint that's proportional to its buffered writeout bandwidth and teach balance_dirty_pages() to balance dirty pages around that target. It avoids the worst case of hitting dirty_limit. However it may still present big challenges to balance_dirty_pages(). For example, when there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120 dirty balance targets. Wow I cannot imagine how it's going to fulfill so many different targets. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > I'll tell you what's crazy. > > We're not gonna cut three more kernel releases and then change jobs. > Some of the stuff we put in the kernel ends up staying there for over > a decade. While ignoring fundamental designs and violating layers may > look like rendering a quick solution. They tend to come back and bite > our collective asses. Ask Vivek. The iosched / blkcg API was messed > up to the extent that bugs were so difficult to track down and it was > nearly impossible to add new features, let alone new blkcg policy or > elevator and people did suffer for that for long time. I ended up > cleaning up the mess. It took me longer than three months and even > then we have to carry on with a lot of ugly stuff for compatibility. "block/cfq-iosched.c" 3930L Yeah it's a big pile of tricky code. In despite of that, the code structure still looks pretty neat, kudos to all of you! > Unfortunately, your proposed solution is far worse than blkcg was or > ever could be. It's not even contained in a single subsystem and it's > not even clear what it achieves. Yeah it's cross subsystems, mainly due to there are two natural throttling points: balance_dirty_pages() and cfq. It requires both sides to work properly. In my proposal, balance_dirty_pages() takes care to update the weights for async/direct IO on every 200ms and store it in blkcg. cfq then grabs the weights to update the cfq group's vdisktime. Such cross subsystem coordinations still look natural to me because "weight" is a fundamental and general parameter. It's really a blkcg thing (determined by the blkio.weight user interface) rather than specifically tied to cfq. When another kernel entity (eg. NFS or noop) decides to add support for proportional weight IO control in future, it can make use of the weights calculated by balance_dirty_pages(), too. That scheme does involve non-trivial complexities in the calculations, however IMHO sucks much less than let cfq take control and convey the information all the way up to balance_dirty_pages() via "backpressure". When balance_dirty_pages() takes part in the job, it merely costs some per-cpu accounting and calculations on every 200ms -- both scales pretty well. Virtually nothing changed (how buffered IO is performed) before/after applying IO controllers. From the users' perspective: - No more latency - No performance drop - No bumpy progress and stalls - No need to attach memcg to blkcg - Feel free to create 1000+ IO controllers, to heart's content w/o worrying about costs (if any, it would be some existing scalability issues) On the other hand, the back pressure scheme makes Linux more clumsy by vectorizing everything from bottom to up, giving rise to a number of problems: - in cfq, by splitting up the global async queue, cfq suddenly sees a number of cfq groups full of async requests lining up competing for the disk time. This could obscure things up and add difficulties to maintain low latency for sync requests. - in cfq, it will now be switching inodes based on the 40ms async slice time, which defeats the flusher thread's 500ms inode slice time. The below numbers show the performance cost of lowering the flusher's slices to ~40ms: 3.4.0-rc2 3.4.0-rc2-4M+ ----------- ------------------------ 114.02 -4.2% 109.23 snb/thresh=8G/xfs-1dd-1-3.4.0-rc2 102.25 -11.7% 90.24 snb/thresh=8G/xfs-10dd-1-3.4.0-rc2 104.17 -17.5% 85.91 snb/thresh=8G/xfs-20dd-1-3.4.0-rc2 104.94 -18.7% 85.28 snb/thresh=8G/xfs-30dd-1-3.4.0-rc2 104.76 -21.9% 81.82 snb/thresh=8G/xfs-100dd-1-3.4.0-rc2 We can do the optimization of increasing cfq async time slice when there are no sync IO. However in general cases it could still hurt. - in cfq, the lots more async queues will be holding much more async requests in order to prevent queue underrun. This proportionally scales up the number of writeback pages, which in turn exponentially scales up the difficulty to reclaim high order pages: P(reclaimable for THP) = P(non-PG_writeback)^512 That means we cannot comfortably use THP in a system with more than 0.1% writeback pages. Perhaps we need to work out some general optimizations to make writeback pages more concentrated in the physical memory space. Besides, when there are N seconds worth of writeback pages, it may take N/2 seconds on average for wait_on_page_writeback() to finish. So the total time cost of running into a random writeback page and waiting on it is also O(n^2): E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it) That means we can hardly keep more than 1-second worth of writeback pages w/o worrying about long waits on PG_writeback in various parts of the kernel. - in the flusher, we'll need to vectorize the dirty inode lists, that's fine. However we either need to create one flusher per blkcg, which has the problem of intensify various fs lock contentions, or let one single flusher to walk through the blkcgs, which risks more cfq queue underruns. We may decrease the flusher's time slice or increase the queue size to mitigate this, however neither looks the exciting way. - balance_dirty_pages() will need to keep each blkcg's dirty pages at reasonable level, otherwise there may be starvations to defeat the low level IO controllers and to hurt IO size. Thus comes the very undesirable need to attach memcg to blkcg to track dirty pages. It's also not fun to work with dozens of dirty pages targets because dirty pages tend to fluctuate a lot. In comparison, it's far more easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks in the global context. In summary, the back pressure scheme looks obvious at first sight, however there are some fundamental problems in the way. Cgroups are expected to be *light weight* facilities. Unfortunately this scheme will likely present too much burden and side effects to the system. It might become uncomfortable for the user to run 10+ blkcgs... > Neither weight or hard limit can be > properly enforced without another layer of controlling at the block > layer (some use cases do expect strict enforcement) and we're baking > assumptions about use cases, interfaces and underlying hardware across > multiple subsystems (some ssds work fine with per-iops switching). cfq still has the freedom to do per-iops switching, based on the same weight values computed by balance_dirty_pages(). cfq will need to feed back some "IO cost" stats based on either disk time or iops, upon which balance_dirty_pages() scales the throttling bandwidth for the dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS hard limits based on the scaled throttling bandwidth. > For your suggested solution, the moment it's best fit is now and it'll > be a long painful way down until someone snaps and reimplements the > whole thing. > > The kernel is larger than balance_dirty_pages() or writeback. Each > subsystem should do what it's supposed to do. Let's solve problems > where they belong and pay overheads where they're due. Let's not > contort the whole stack for the short term goal of shoving writeback > support into the existing, still-developing, blkcg cfq proportional IO > implementation. Because that's pure insanity. To be frank I would be very pleased to avoid going into the pains of doing all the hairy computations to graft balance_dirty_pages() onto cfq, if ever the back pressure idea is not so upsetting. And if there are proper ways to address its problems, it would be a great relief for me to stop pondering on the details of disk time/IOPS feedback and the hierarchical support (yeah I think it's somehow possible now), and the foreseeable _numerous_ experiments to get the ideas into shape... Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-22 14:46 ` Fengguang Wu (?) @ 2012-04-23 16:56 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-23 16:56 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman Hello, Fengguang. On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote: > OK. Sorry I should have explained why memcg dirty limit is not the > right tool for back pressure based throttling. I have two questions. Why do we need memcg for this? Writeback currently works without memcg, right? Why does that change with blkcg aware bdi? > Basically the more memcgs with dirty limits, the more hard time for > the flusher to serve them fairly and knock down their dirty pages in > time. Because the flusher works inode by inode, each one may take up > to 0.5 second, and there may be many memcgs asking for the flusher's > attention. Also the more memcgs, the global dirty pages pool are > partitioned into smaller pieces, which means smaller safety margin for > each memcg. Adding these two effects up, there may be constantly some > memcgs hitting their dirty limits when there are dozens of memcgs. And how is this different from a machine with smaller memory? If so, why? > Such cross subsystem coordinations still look natural to me because > "weight" is a fundamental and general parameter. It's really a blkcg > thing (determined by the blkio.weight user interface) rather than > specifically tied to cfq. When another kernel entity (eg. NFS or noop) > decides to add support for proportional weight IO control in future, > it can make use of the weights calculated by balance_dirty_pages(), too. It is not fundamental and natural at all and is already made cfq specific in the devel branch. You seem to think "weight" is somehow a global concept which everyone can agree on but it is not. Weight of what? Is it disktime, bandwidth, iops or something else? cfq deals primarily with disktime because that makes sense for spinning drives with single head. For SSDs with smart enough FTLs, the unit should probably be iops. For storage technology bottlenecked on bus speed, bw would make sense. IIUC, writeback is primarily dealing with abstracted bandwidth which is applied per-inode, which is fine at that layer as details like block allocations isn't and shouldn't be visible there and files (or inodes) are the level of abstraction. However, this doesn't necessarily translate easily into the actual underlying IO resource. For devices with spindle, seek time dominates and the same amount of IO may consume vastly different amount of IO and the disk time becomes the primary resource, not the iops or bandwidth. Naturally, people want to allocate and limit the primary resource, so cfq distributes disk time across different cgroups as configured. Your suggested solution is applying the same a number - the weight - to one portion of a mostly arbitrarily split resource using a different unit. I don't even understand what that achieves. The requirement is to be able to split IO resource according to cgroups in configurable way and enforce the limits established by the configuration, which we're currently failing to do for async IOs. Your proposed solution applies some arbitrary ratio according to some arbitrary interpretation of cfq IO time weight way up in the stack which, when propagated to the lower layer, would cause significant amount of delay and fluctuation which behaves completely independent from how (using what unit, in what granularity and in what time scale) actual IO resource is handled, split and accounted, which would result in something which probably has some semblance of interpreting blkcg.weight as vague best-effort priority at its luckiest moments. So, I don't think your suggested solution is a solution at all. I'm in fact not even sure what it achieves at the cost of the gross layering violation and fundamental design braindamage. > - No more latency > - No performance drop > - No bumpy progress and stalls > - No need to attach memcg to blkcg > - Feel free to create 1000+ IO controllers, to heart's content > w/o worrying about costs (if any, it would be some existing > scalability issues) I'm not sure why memcg suddenly becomes necessary with blkcg and I don't think having per-blkcg writeback and reasonable async optimization from iosched would be considerably worse. It sure will add some overhead (e.g. from split buffering) but there will be proper working isolation which is what this fuss is all about. Also, I just don't see how creating 1000+ (relatively active, I presume) blkcgs on a single spindle would be sane and how is the end result gonna be significantly better for your suggested solution, so let's please put aside the silly non-use case. In terms of overhead, I suspect the biggest would be the increased buffering coming from split channels but that seems like the cost of business to me. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 16:56 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-23 16:56 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hello, Fengguang. On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote: > OK. Sorry I should have explained why memcg dirty limit is not the > right tool for back pressure based throttling. I have two questions. Why do we need memcg for this? Writeback currently works without memcg, right? Why does that change with blkcg aware bdi? > Basically the more memcgs with dirty limits, the more hard time for > the flusher to serve them fairly and knock down their dirty pages in > time. Because the flusher works inode by inode, each one may take up > to 0.5 second, and there may be many memcgs asking for the flusher's > attention. Also the more memcgs, the global dirty pages pool are > partitioned into smaller pieces, which means smaller safety margin for > each memcg. Adding these two effects up, there may be constantly some > memcgs hitting their dirty limits when there are dozens of memcgs. And how is this different from a machine with smaller memory? If so, why? > Such cross subsystem coordinations still look natural to me because > "weight" is a fundamental and general parameter. It's really a blkcg > thing (determined by the blkio.weight user interface) rather than > specifically tied to cfq. When another kernel entity (eg. NFS or noop) > decides to add support for proportional weight IO control in future, > it can make use of the weights calculated by balance_dirty_pages(), too. It is not fundamental and natural at all and is already made cfq specific in the devel branch. You seem to think "weight" is somehow a global concept which everyone can agree on but it is not. Weight of what? Is it disktime, bandwidth, iops or something else? cfq deals primarily with disktime because that makes sense for spinning drives with single head. For SSDs with smart enough FTLs, the unit should probably be iops. For storage technology bottlenecked on bus speed, bw would make sense. IIUC, writeback is primarily dealing with abstracted bandwidth which is applied per-inode, which is fine at that layer as details like block allocations isn't and shouldn't be visible there and files (or inodes) are the level of abstraction. However, this doesn't necessarily translate easily into the actual underlying IO resource. For devices with spindle, seek time dominates and the same amount of IO may consume vastly different amount of IO and the disk time becomes the primary resource, not the iops or bandwidth. Naturally, people want to allocate and limit the primary resource, so cfq distributes disk time across different cgroups as configured. Your suggested solution is applying the same a number - the weight - to one portion of a mostly arbitrarily split resource using a different unit. I don't even understand what that achieves. The requirement is to be able to split IO resource according to cgroups in configurable way and enforce the limits established by the configuration, which we're currently failing to do for async IOs. Your proposed solution applies some arbitrary ratio according to some arbitrary interpretation of cfq IO time weight way up in the stack which, when propagated to the lower layer, would cause significant amount of delay and fluctuation which behaves completely independent from how (using what unit, in what granularity and in what time scale) actual IO resource is handled, split and accounted, which would result in something which probably has some semblance of interpreting blkcg.weight as vague best-effort priority at its luckiest moments. So, I don't think your suggested solution is a solution at all. I'm in fact not even sure what it achieves at the cost of the gross layering violation and fundamental design braindamage. > - No more latency > - No performance drop > - No bumpy progress and stalls > - No need to attach memcg to blkcg > - Feel free to create 1000+ IO controllers, to heart's content > w/o worrying about costs (if any, it would be some existing > scalability issues) I'm not sure why memcg suddenly becomes necessary with blkcg and I don't think having per-blkcg writeback and reasonable async optimization from iosched would be considerably worse. It sure will add some overhead (e.g. from split buffering) but there will be proper working isolation which is what this fuss is all about. Also, I just don't see how creating 1000+ (relatively active, I presume) blkcgs on a single spindle would be sane and how is the end result gonna be significantly better for your suggested solution, so let's please put aside the silly non-use case. In terms of overhead, I suspect the biggest would be the increased buffering coming from split channels but that seems like the cost of business to me. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 16:56 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-23 16:56 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hello, Fengguang. On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote: > OK. Sorry I should have explained why memcg dirty limit is not the > right tool for back pressure based throttling. I have two questions. Why do we need memcg for this? Writeback currently works without memcg, right? Why does that change with blkcg aware bdi? > Basically the more memcgs with dirty limits, the more hard time for > the flusher to serve them fairly and knock down their dirty pages in > time. Because the flusher works inode by inode, each one may take up > to 0.5 second, and there may be many memcgs asking for the flusher's > attention. Also the more memcgs, the global dirty pages pool are > partitioned into smaller pieces, which means smaller safety margin for > each memcg. Adding these two effects up, there may be constantly some > memcgs hitting their dirty limits when there are dozens of memcgs. And how is this different from a machine with smaller memory? If so, why? > Such cross subsystem coordinations still look natural to me because > "weight" is a fundamental and general parameter. It's really a blkcg > thing (determined by the blkio.weight user interface) rather than > specifically tied to cfq. When another kernel entity (eg. NFS or noop) > decides to add support for proportional weight IO control in future, > it can make use of the weights calculated by balance_dirty_pages(), too. It is not fundamental and natural at all and is already made cfq specific in the devel branch. You seem to think "weight" is somehow a global concept which everyone can agree on but it is not. Weight of what? Is it disktime, bandwidth, iops or something else? cfq deals primarily with disktime because that makes sense for spinning drives with single head. For SSDs with smart enough FTLs, the unit should probably be iops. For storage technology bottlenecked on bus speed, bw would make sense. IIUC, writeback is primarily dealing with abstracted bandwidth which is applied per-inode, which is fine at that layer as details like block allocations isn't and shouldn't be visible there and files (or inodes) are the level of abstraction. However, this doesn't necessarily translate easily into the actual underlying IO resource. For devices with spindle, seek time dominates and the same amount of IO may consume vastly different amount of IO and the disk time becomes the primary resource, not the iops or bandwidth. Naturally, people want to allocate and limit the primary resource, so cfq distributes disk time across different cgroups as configured. Your suggested solution is applying the same a number - the weight - to one portion of a mostly arbitrarily split resource using a different unit. I don't even understand what that achieves. The requirement is to be able to split IO resource according to cgroups in configurable way and enforce the limits established by the configuration, which we're currently failing to do for async IOs. Your proposed solution applies some arbitrary ratio according to some arbitrary interpretation of cfq IO time weight way up in the stack which, when propagated to the lower layer, would cause significant amount of delay and fluctuation which behaves completely independent from how (using what unit, in what granularity and in what time scale) actual IO resource is handled, split and accounted, which would result in something which probably has some semblance of interpreting blkcg.weight as vague best-effort priority at its luckiest moments. So, I don't think your suggested solution is a solution at all. I'm in fact not even sure what it achieves at the cost of the gross layering violation and fundamental design braindamage. > - No more latency > - No performance drop > - No bumpy progress and stalls > - No need to attach memcg to blkcg > - Feel free to create 1000+ IO controllers, to heart's content > w/o worrying about costs (if any, it would be some existing > scalability issues) I'm not sure why memcg suddenly becomes necessary with blkcg and I don't think having per-blkcg writeback and reasonable async optimization from iosched would be considerably worse. It sure will add some overhead (e.g. from split buffering) but there will be proper working isolation which is what this fuss is all about. Also, I just don't see how creating 1000+ (relatively active, I presume) blkcgs on a single spindle would be sane and how is the end result gonna be significantly better for your suggested solution, so let's please put aside the silly non-use case. In terms of overhead, I suspect the biggest would be the increased buffering coming from split channels but that seems like the cost of business to me. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120423165626.GB5406-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120423165626.GB5406-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-04-24 7:58 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-24 7:58 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman Hi Tejun, On Mon, Apr 23, 2012 at 09:56:26AM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote: > > OK. Sorry I should have explained why memcg dirty limit is not the > > right tool for back pressure based throttling. > > I have two questions. Why do we need memcg for this? Writeback > currently works without memcg, right? Why does that change with blkcg > aware bdi? Yeah currently writeback does not depend on memcg. As for blkcg, it's necessary to keep a number of dirty pages for each blkcg, so that the cfq groups' async IO queue does not go empty and lose its turn to do IO. memcg provides the proper infrastructure to account dirty pages. In a previous email, we have an example of two 10:1 weight cgroups, each running one dd. They will make two IO pipes, each holding a number of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's dirty pages are consumed quickly. However balance_dirty_pages(), without knowing about cfq's bandwidth divisions, is throttling the two dd tasks equally. So dd-1 will be producing dirty pages much slower than cfq is consuming them. The flusher thus won't send enough dirty pages down to fill the corresponding async IO queue for dd-1. cfq cannot really give dd-1 more bandwidth share due to lack of data feed. The end result will be: the two cgroups get 1:1 bandwidth share honored by balance_dirty_pages() even though cfq honors 10:1 weights to them. 1:1 balance_dirty_pages() bandwidth split [ dd-1 | dd-2 ] | \ | | \**************************| | \*************************| | \************************| | \***********************| | \**********************| | \*********************| | \********************| | \*******************| | \******************| | \*****************| | \****************| | \***************| | \**************| | \*************| | \************| | \***********| | \**********| | \*********| | \********| | \*******| | \******| |************************ (constantly underrun) \*****| 10:1 cfq bandwidth split [*] dirty pages Ideally is, [ dd-1 | dd-2] | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| Or better, one single pipe :) [ dd-1 | dd-2] | | | | | | | | | | | | | | | | | | | | | | | | | | | | |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| > > Basically the more memcgs with dirty limits, the more hard time for > > the flusher to serve them fairly and knock down their dirty pages in > > time. Because the flusher works inode by inode, each one may take up > > to 0.5 second, and there may be many memcgs asking for the flusher's > > attention. Also the more memcgs, the global dirty pages pool are > > partitioned into smaller pieces, which means smaller safety margin for > > each memcg. Adding these two effects up, there may be constantly some > > memcgs hitting their dirty limits when there are dozens of memcgs. > > And how is this different from a machine with smaller memory? If so, > why? In a small memory box, dd and flusher produce/consume dirty pages continuously, so that over time the number of dirty pages can remain roughly stable. ^ dirty pages | +dirty limit | | dd continously dirtying pages |dirty setpoint v +******************************************************************* | | | v flusher continously clean pages | +--------------------------------------------------------------------> time However if it's a large memory machine whose dirty pages get partitioned to 100 cgroups, the flusher will be serving them in round robin fashion. For a particular cgroup, the flusher only comes and consumes its dirty pages once on every (100*flusher_slice) seconds. The interval would be 50s for the current 0.5s flusher slice, or 5s if lowering flusher slice to 50ms. I'm not sure whether it's practical to decrease the flusher slice for ext4, which for the sake of write performance and avoid fragmentation, increases the write chunk size to 128MB internally. For a number of reasons, the flusher's behavior cannot be exactly controlled. The intervals the flusher come to each cgroup go up and down, fairness can only be coarsely assured. The dirty pages for each cgroup will be going up and down irregularly across very large dynamic ranges. Now you should be able to imagine the challenges to avoid hitting the dirty limit, to balance the dirty pages around the per-cgroup-per-bdi dirty setpoints and to avoid underruns. When there are 10 cgroups and 12 bdi's, the dirty setpoints could explode up to 10*12. ^ dirty pages | | dd continously dirtying pages + dirty limit dd stalled | ****** | | * * | * | * * | * * | * * | * * | dirty * * | * * + setpoint * * | * * | * * v * * | * * * * | * * * * | * * * * * | * * * * * |* * * * * | * * * | * | ^^ the flusher comes around to this cgroup | +--------------------------------------------------------------------> time > > Such cross subsystem coordinations still look natural to me because > > "weight" is a fundamental and general parameter. It's really a blkcg > > thing (determined by the blkio.weight user interface) rather than > > specifically tied to cfq. When another kernel entity (eg. NFS or noop) > > decides to add support for proportional weight IO control in future, > > it can make use of the weights calculated by balance_dirty_pages(), too. > > It is not fundamental and natural at all and is already made cfq > specific in the devel branch. You seem to think "weight" is somehow a > global concept which everyone can agree on but it is not. Weight of > what? Is it disktime, bandwidth, iops or something else? cfq deals > primarily with disktime because that makes sense for spinning drives > with single head. For SSDs with smart enough FTLs, the unit should > probably be iops. For storage technology bottlenecked on bus speed, > bw would make sense. "Weight" is sure a global concept that reflects the "importance" deemed by the user for that cgroup. cfq (or NFS, whatever on the horizon) then interprets the importance number as disk time, IOPS, bandwidth, whatever semantic that best fits the backing storage and workload. blkio.weight will be the "number" shared and interpreted by all IO controller entities, whether it be cfq, NFS or balance_dirty_pages(). And I can assure you that balance_dirty_pages() will be interpreting it the _same_ way the underlying cfq/NFS interprets it, via the feedback scheme described below. > IIUC, writeback is primarily dealing with abstracted bandwidth which > is applied per-inode, which is fine at that layer as details like > block allocations isn't and shouldn't be visible there and files (or > inodes) are the level of abstraction. > > However, this doesn't necessarily translate easily into the actual > underlying IO resource. For devices with spindle, seek time dominates > and the same amount of IO may consume vastly different amount of IO > and the disk time becomes the primary resource, not the iops or > bandwidth. Naturally, people want to allocate and limit the primary > resource, so cfq distributes disk time across different cgroups as > configured. Right. balance_dirty_pages() is always doing dirty throttling wrt. bandwidth, even in your back pressure scheme, isn't it? In this regard, there are nothing fundamentally different between our proposals. They will both employ some way to convert the cfq's disk time or IOPS notion to balance_dirty_pages()'s bandwidth notion. See below for my way of conversion. > Your suggested solution is applying the same a number - the weight - > to one portion of a mostly arbitrarily split resource using a > different unit. I don't even understand what that achieves. You seem to miss my stated plan: next step, balance_dirty_pages() will get some feedback information from cfq to adjust its bandwidth targets accordingly. That information will be io_cost = charge/sectors The charge value is exactly the value computed in cfq_group_served(), which is the slice time or IOs dispatched depending the mode cfq is operating in. By dividing ratelimit by the normalized io_cost, balance_dirty_pages() will automatically get the same weight interpretation as cfq. For example, on spin disks, it will be able to allocate lower bandwidth to seeky cgroups due to the larger io_cost reported by cfq. > The requirement is to be able to split IO resource according to > cgroups in configurable way and enforce the limits established by the > configuration, which we're currently failing to do for async IOs. > Your proposed solution applies some arbitrary ratio according to some > arbitrary interpretation of cfq IO time weight way up in the stack > which, when propagated to the lower layer, would cause significant > amount of delay and fluctuation which behaves completely independent > from how (using what unit, in what granularity and in what time scale) > actual IO resource is handled, split and accounted, which would result > in something which probably has some semblance of interpreting > blkcg.weight as vague best-effort priority at its luckiest moments. Interestingly, our proposals are once again on the same plane regarding the delays and fluctuations. Due to the long delays between dirty and writeout time, the access pattern for the newly generated dirty pages and the access pattern for the under-writeback pages may have changed. So even if cfq is throttling the stream proportional to its IO cost, the user on the other side of the pipe (with long delay) may still see the strange behavior of lower throughput for sequential writes and higher throughput for random writes. Let's accept the fact: it's a natural problem/property of the buffered writes. What we can do is to aim for _long term_ rate matching. > So, I don't think your suggested solution is a solution at all. I'm > in fact not even sure what it achieves at the cost of the gross > layering violation and fundamental design braindamage. It doesn't make anything perform better (nor worse). In face of the challenging problem, both proposals suck. My solution just sucks less as in the below listing. > > - No more latency > > - No performance drop > > - No bumpy progress and stalls > > - No need to attach memcg to blkcg > > - Feel free to create 1000+ IO controllers, to heart's content > > w/o worrying about costs (if any, it would be some existing > > scalability issues) > > I'm not sure why memcg suddenly becomes necessary with blkcg and I > don't think having per-blkcg writeback and reasonable async > optimization from iosched would be considerably worse. It sure will > add some overhead (e.g. from split buffering) but there will be proper > working isolation which is what this fuss is all about. Also, I just > don't see how creating 1000+ (relatively active, I presume) blkcgs on > a single spindle would be sane and how is the end result gonna be > significantly better for your suggested solution, so let's please put > aside the silly non-use case. There are big disk arrays with lots of spindles inside, or arrays of fast SSDs. People may want to create lots of cgroups on them. IO controllers should be made cheap and scalable to meet the demands from our variety user base, now and future. > In terms of overhead, I suspect the biggest would be the increased > buffering coming from split channels but that seems like the cost of > business to me. I know that the back pressure idea actually come a long way (several years?) and it's kind of become a common agreement that there will be inevitable costs incur to the isolation. So I can understand why you keep ignoring all the overheads, costs and scalability issues because there seems no other way out. However here comes the solution that can magically avoid the partition and all the resulted problems, and still be able to provide the isolation. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-23 16:56 ` Tejun Heo @ 2012-04-24 7:58 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-24 7:58 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hi Tejun, On Mon, Apr 23, 2012 at 09:56:26AM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote: > > OK. Sorry I should have explained why memcg dirty limit is not the > > right tool for back pressure based throttling. > > I have two questions. Why do we need memcg for this? Writeback > currently works without memcg, right? Why does that change with blkcg > aware bdi? Yeah currently writeback does not depend on memcg. As for blkcg, it's necessary to keep a number of dirty pages for each blkcg, so that the cfq groups' async IO queue does not go empty and lose its turn to do IO. memcg provides the proper infrastructure to account dirty pages. In a previous email, we have an example of two 10:1 weight cgroups, each running one dd. They will make two IO pipes, each holding a number of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's dirty pages are consumed quickly. However balance_dirty_pages(), without knowing about cfq's bandwidth divisions, is throttling the two dd tasks equally. So dd-1 will be producing dirty pages much slower than cfq is consuming them. The flusher thus won't send enough dirty pages down to fill the corresponding async IO queue for dd-1. cfq cannot really give dd-1 more bandwidth share due to lack of data feed. The end result will be: the two cgroups get 1:1 bandwidth share honored by balance_dirty_pages() even though cfq honors 10:1 weights to them. 1:1 balance_dirty_pages() bandwidth split [ dd-1 | dd-2 ] | \ | | \**************************| | \*************************| | \************************| | \***********************| | \**********************| | \*********************| | \********************| | \*******************| | \******************| | \*****************| | \****************| | \***************| | \**************| | \*************| | \************| | \***********| | \**********| | \*********| | \********| | \*******| | \******| |************************ (constantly underrun) \*****| 10:1 cfq bandwidth split [*] dirty pages Ideally is, [ dd-1 | dd-2] | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| Or better, one single pipe :) [ dd-1 | dd-2] | | | | | | | | | | | | | | | | | | | | | | | | | | | | |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| > > Basically the more memcgs with dirty limits, the more hard time for > > the flusher to serve them fairly and knock down their dirty pages in > > time. Because the flusher works inode by inode, each one may take up > > to 0.5 second, and there may be many memcgs asking for the flusher's > > attention. Also the more memcgs, the global dirty pages pool are > > partitioned into smaller pieces, which means smaller safety margin for > > each memcg. Adding these two effects up, there may be constantly some > > memcgs hitting their dirty limits when there are dozens of memcgs. > > And how is this different from a machine with smaller memory? If so, > why? In a small memory box, dd and flusher produce/consume dirty pages continuously, so that over time the number of dirty pages can remain roughly stable. ^ dirty pages | +dirty limit | | dd continously dirtying pages |dirty setpoint v +******************************************************************* | | | v flusher continously clean pages | +--------------------------------------------------------------------> time However if it's a large memory machine whose dirty pages get partitioned to 100 cgroups, the flusher will be serving them in round robin fashion. For a particular cgroup, the flusher only comes and consumes its dirty pages once on every (100*flusher_slice) seconds. The interval would be 50s for the current 0.5s flusher slice, or 5s if lowering flusher slice to 50ms. I'm not sure whether it's practical to decrease the flusher slice for ext4, which for the sake of write performance and avoid fragmentation, increases the write chunk size to 128MB internally. For a number of reasons, the flusher's behavior cannot be exactly controlled. The intervals the flusher come to each cgroup go up and down, fairness can only be coarsely assured. The dirty pages for each cgroup will be going up and down irregularly across very large dynamic ranges. Now you should be able to imagine the challenges to avoid hitting the dirty limit, to balance the dirty pages around the per-cgroup-per-bdi dirty setpoints and to avoid underruns. When there are 10 cgroups and 12 bdi's, the dirty setpoints could explode up to 10*12. ^ dirty pages | | dd continously dirtying pages + dirty limit dd stalled | ****** | | * * | * | * * | * * | * * | * * | dirty * * | * * + setpoint * * | * * | * * v * * | * * * * | * * * * | * * * * * | * * * * * |* * * * * | * * * | * | ^^ the flusher comes around to this cgroup | +--------------------------------------------------------------------> time > > Such cross subsystem coordinations still look natural to me because > > "weight" is a fundamental and general parameter. It's really a blkcg > > thing (determined by the blkio.weight user interface) rather than > > specifically tied to cfq. When another kernel entity (eg. NFS or noop) > > decides to add support for proportional weight IO control in future, > > it can make use of the weights calculated by balance_dirty_pages(), too. > > It is not fundamental and natural at all and is already made cfq > specific in the devel branch. You seem to think "weight" is somehow a > global concept which everyone can agree on but it is not. Weight of > what? Is it disktime, bandwidth, iops or something else? cfq deals > primarily with disktime because that makes sense for spinning drives > with single head. For SSDs with smart enough FTLs, the unit should > probably be iops. For storage technology bottlenecked on bus speed, > bw would make sense. "Weight" is sure a global concept that reflects the "importance" deemed by the user for that cgroup. cfq (or NFS, whatever on the horizon) then interprets the importance number as disk time, IOPS, bandwidth, whatever semantic that best fits the backing storage and workload. blkio.weight will be the "number" shared and interpreted by all IO controller entities, whether it be cfq, NFS or balance_dirty_pages(). And I can assure you that balance_dirty_pages() will be interpreting it the _same_ way the underlying cfq/NFS interprets it, via the feedback scheme described below. > IIUC, writeback is primarily dealing with abstracted bandwidth which > is applied per-inode, which is fine at that layer as details like > block allocations isn't and shouldn't be visible there and files (or > inodes) are the level of abstraction. > > However, this doesn't necessarily translate easily into the actual > underlying IO resource. For devices with spindle, seek time dominates > and the same amount of IO may consume vastly different amount of IO > and the disk time becomes the primary resource, not the iops or > bandwidth. Naturally, people want to allocate and limit the primary > resource, so cfq distributes disk time across different cgroups as > configured. Right. balance_dirty_pages() is always doing dirty throttling wrt. bandwidth, even in your back pressure scheme, isn't it? In this regard, there are nothing fundamentally different between our proposals. They will both employ some way to convert the cfq's disk time or IOPS notion to balance_dirty_pages()'s bandwidth notion. See below for my way of conversion. > Your suggested solution is applying the same a number - the weight - > to one portion of a mostly arbitrarily split resource using a > different unit. I don't even understand what that achieves. You seem to miss my stated plan: next step, balance_dirty_pages() will get some feedback information from cfq to adjust its bandwidth targets accordingly. That information will be io_cost = charge/sectors The charge value is exactly the value computed in cfq_group_served(), which is the slice time or IOs dispatched depending the mode cfq is operating in. By dividing ratelimit by the normalized io_cost, balance_dirty_pages() will automatically get the same weight interpretation as cfq. For example, on spin disks, it will be able to allocate lower bandwidth to seeky cgroups due to the larger io_cost reported by cfq. > The requirement is to be able to split IO resource according to > cgroups in configurable way and enforce the limits established by the > configuration, which we're currently failing to do for async IOs. > Your proposed solution applies some arbitrary ratio according to some > arbitrary interpretation of cfq IO time weight way up in the stack > which, when propagated to the lower layer, would cause significant > amount of delay and fluctuation which behaves completely independent > from how (using what unit, in what granularity and in what time scale) > actual IO resource is handled, split and accounted, which would result > in something which probably has some semblance of interpreting > blkcg.weight as vague best-effort priority at its luckiest moments. Interestingly, our proposals are once again on the same plane regarding the delays and fluctuations. Due to the long delays between dirty and writeout time, the access pattern for the newly generated dirty pages and the access pattern for the under-writeback pages may have changed. So even if cfq is throttling the stream proportional to its IO cost, the user on the other side of the pipe (with long delay) may still see the strange behavior of lower throughput for sequential writes and higher throughput for random writes. Let's accept the fact: it's a natural problem/property of the buffered writes. What we can do is to aim for _long term_ rate matching. > So, I don't think your suggested solution is a solution at all. I'm > in fact not even sure what it achieves at the cost of the gross > layering violation and fundamental design braindamage. It doesn't make anything perform better (nor worse). In face of the challenging problem, both proposals suck. My solution just sucks less as in the below listing. > > - No more latency > > - No performance drop > > - No bumpy progress and stalls > > - No need to attach memcg to blkcg > > - Feel free to create 1000+ IO controllers, to heart's content > > w/o worrying about costs (if any, it would be some existing > > scalability issues) > > I'm not sure why memcg suddenly becomes necessary with blkcg and I > don't think having per-blkcg writeback and reasonable async > optimization from iosched would be considerably worse. It sure will > add some overhead (e.g. from split buffering) but there will be proper > working isolation which is what this fuss is all about. Also, I just > don't see how creating 1000+ (relatively active, I presume) blkcgs on > a single spindle would be sane and how is the end result gonna be > significantly better for your suggested solution, so let's please put > aside the silly non-use case. There are big disk arrays with lots of spindles inside, or arrays of fast SSDs. People may want to create lots of cgroups on them. IO controllers should be made cheap and scalable to meet the demands from our variety user base, now and future. > In terms of overhead, I suspect the biggest would be the increased > buffering coming from split channels but that seems like the cost of > business to me. I know that the back pressure idea actually come a long way (several years?) and it's kind of become a common agreement that there will be inevitable costs incur to the isolation. So I can understand why you keep ignoring all the overheads, costs and scalability issues because there seems no other way out. However here comes the solution that can magically avoid the partition and all the resulted problems, and still be able to provide the isolation. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-24 7:58 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-24 7:58 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hi Tejun, On Mon, Apr 23, 2012 at 09:56:26AM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote: > > OK. Sorry I should have explained why memcg dirty limit is not the > > right tool for back pressure based throttling. > > I have two questions. Why do we need memcg for this? Writeback > currently works without memcg, right? Why does that change with blkcg > aware bdi? Yeah currently writeback does not depend on memcg. As for blkcg, it's necessary to keep a number of dirty pages for each blkcg, so that the cfq groups' async IO queue does not go empty and lose its turn to do IO. memcg provides the proper infrastructure to account dirty pages. In a previous email, we have an example of two 10:1 weight cgroups, each running one dd. They will make two IO pipes, each holding a number of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's dirty pages are consumed quickly. However balance_dirty_pages(), without knowing about cfq's bandwidth divisions, is throttling the two dd tasks equally. So dd-1 will be producing dirty pages much slower than cfq is consuming them. The flusher thus won't send enough dirty pages down to fill the corresponding async IO queue for dd-1. cfq cannot really give dd-1 more bandwidth share due to lack of data feed. The end result will be: the two cgroups get 1:1 bandwidth share honored by balance_dirty_pages() even though cfq honors 10:1 weights to them. 1:1 balance_dirty_pages() bandwidth split [ dd-1 | dd-2 ] | \ | | \**************************| | \*************************| | \************************| | \***********************| | \**********************| | \*********************| | \********************| | \*******************| | \******************| | \*****************| | \****************| | \***************| | \**************| | \*************| | \************| | \***********| | \**********| | \*********| | \********| | \*******| | \******| |************************ (constantly underrun) \*****| 10:1 cfq bandwidth split [*] dirty pages Ideally is, [ dd-1 | dd-2] | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| |***************************************************|*****| Or better, one single pipe :) [ dd-1 | dd-2] | | | | | | | | | | | | | | | | | | | | | | | | | | | | |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| |*********************************************************| > > Basically the more memcgs with dirty limits, the more hard time for > > the flusher to serve them fairly and knock down their dirty pages in > > time. Because the flusher works inode by inode, each one may take up > > to 0.5 second, and there may be many memcgs asking for the flusher's > > attention. Also the more memcgs, the global dirty pages pool are > > partitioned into smaller pieces, which means smaller safety margin for > > each memcg. Adding these two effects up, there may be constantly some > > memcgs hitting their dirty limits when there are dozens of memcgs. > > And how is this different from a machine with smaller memory? If so, > why? In a small memory box, dd and flusher produce/consume dirty pages continuously, so that over time the number of dirty pages can remain roughly stable. ^ dirty pages | +dirty limit | | dd continously dirtying pages |dirty setpoint v +******************************************************************* | | | v flusher continously clean pages | +--------------------------------------------------------------------> time However if it's a large memory machine whose dirty pages get partitioned to 100 cgroups, the flusher will be serving them in round robin fashion. For a particular cgroup, the flusher only comes and consumes its dirty pages once on every (100*flusher_slice) seconds. The interval would be 50s for the current 0.5s flusher slice, or 5s if lowering flusher slice to 50ms. I'm not sure whether it's practical to decrease the flusher slice for ext4, which for the sake of write performance and avoid fragmentation, increases the write chunk size to 128MB internally. For a number of reasons, the flusher's behavior cannot be exactly controlled. The intervals the flusher come to each cgroup go up and down, fairness can only be coarsely assured. The dirty pages for each cgroup will be going up and down irregularly across very large dynamic ranges. Now you should be able to imagine the challenges to avoid hitting the dirty limit, to balance the dirty pages around the per-cgroup-per-bdi dirty setpoints and to avoid underruns. When there are 10 cgroups and 12 bdi's, the dirty setpoints could explode up to 10*12. ^ dirty pages | | dd continously dirtying pages + dirty limit dd stalled | ****** | | * * | * | * * | * * | * * | * * | dirty * * | * * + setpoint * * | * * | * * v * * | * * * * | * * * * | * * * * * | * * * * * |* * * * * | * * * | * | ^^ the flusher comes around to this cgroup | +--------------------------------------------------------------------> time > > Such cross subsystem coordinations still look natural to me because > > "weight" is a fundamental and general parameter. It's really a blkcg > > thing (determined by the blkio.weight user interface) rather than > > specifically tied to cfq. When another kernel entity (eg. NFS or noop) > > decides to add support for proportional weight IO control in future, > > it can make use of the weights calculated by balance_dirty_pages(), too. > > It is not fundamental and natural at all and is already made cfq > specific in the devel branch. You seem to think "weight" is somehow a > global concept which everyone can agree on but it is not. Weight of > what? Is it disktime, bandwidth, iops or something else? cfq deals > primarily with disktime because that makes sense for spinning drives > with single head. For SSDs with smart enough FTLs, the unit should > probably be iops. For storage technology bottlenecked on bus speed, > bw would make sense. "Weight" is sure a global concept that reflects the "importance" deemed by the user for that cgroup. cfq (or NFS, whatever on the horizon) then interprets the importance number as disk time, IOPS, bandwidth, whatever semantic that best fits the backing storage and workload. blkio.weight will be the "number" shared and interpreted by all IO controller entities, whether it be cfq, NFS or balance_dirty_pages(). And I can assure you that balance_dirty_pages() will be interpreting it the _same_ way the underlying cfq/NFS interprets it, via the feedback scheme described below. > IIUC, writeback is primarily dealing with abstracted bandwidth which > is applied per-inode, which is fine at that layer as details like > block allocations isn't and shouldn't be visible there and files (or > inodes) are the level of abstraction. > > However, this doesn't necessarily translate easily into the actual > underlying IO resource. For devices with spindle, seek time dominates > and the same amount of IO may consume vastly different amount of IO > and the disk time becomes the primary resource, not the iops or > bandwidth. Naturally, people want to allocate and limit the primary > resource, so cfq distributes disk time across different cgroups as > configured. Right. balance_dirty_pages() is always doing dirty throttling wrt. bandwidth, even in your back pressure scheme, isn't it? In this regard, there are nothing fundamentally different between our proposals. They will both employ some way to convert the cfq's disk time or IOPS notion to balance_dirty_pages()'s bandwidth notion. See below for my way of conversion. > Your suggested solution is applying the same a number - the weight - > to one portion of a mostly arbitrarily split resource using a > different unit. I don't even understand what that achieves. You seem to miss my stated plan: next step, balance_dirty_pages() will get some feedback information from cfq to adjust its bandwidth targets accordingly. That information will be io_cost = charge/sectors The charge value is exactly the value computed in cfq_group_served(), which is the slice time or IOs dispatched depending the mode cfq is operating in. By dividing ratelimit by the normalized io_cost, balance_dirty_pages() will automatically get the same weight interpretation as cfq. For example, on spin disks, it will be able to allocate lower bandwidth to seeky cgroups due to the larger io_cost reported by cfq. > The requirement is to be able to split IO resource according to > cgroups in configurable way and enforce the limits established by the > configuration, which we're currently failing to do for async IOs. > Your proposed solution applies some arbitrary ratio according to some > arbitrary interpretation of cfq IO time weight way up in the stack > which, when propagated to the lower layer, would cause significant > amount of delay and fluctuation which behaves completely independent > from how (using what unit, in what granularity and in what time scale) > actual IO resource is handled, split and accounted, which would result > in something which probably has some semblance of interpreting > blkcg.weight as vague best-effort priority at its luckiest moments. Interestingly, our proposals are once again on the same plane regarding the delays and fluctuations. Due to the long delays between dirty and writeout time, the access pattern for the newly generated dirty pages and the access pattern for the under-writeback pages may have changed. So even if cfq is throttling the stream proportional to its IO cost, the user on the other side of the pipe (with long delay) may still see the strange behavior of lower throughput for sequential writes and higher throughput for random writes. Let's accept the fact: it's a natural problem/property of the buffered writes. What we can do is to aim for _long term_ rate matching. > So, I don't think your suggested solution is a solution at all. I'm > in fact not even sure what it achieves at the cost of the gross > layering violation and fundamental design braindamage. It doesn't make anything perform better (nor worse). In face of the challenging problem, both proposals suck. My solution just sucks less as in the below listing. > > - No more latency > > - No performance drop > > - No bumpy progress and stalls > > - No need to attach memcg to blkcg > > - Feel free to create 1000+ IO controllers, to heart's content > > w/o worrying about costs (if any, it would be some existing > > scalability issues) > > I'm not sure why memcg suddenly becomes necessary with blkcg and I > don't think having per-blkcg writeback and reasonable async > optimization from iosched would be considerably worse. It sure will > add some overhead (e.g. from split buffering) but there will be proper > working isolation which is what this fuss is all about. Also, I just > don't see how creating 1000+ (relatively active, I presume) blkcgs on > a single spindle would be sane and how is the end result gonna be > significantly better for your suggested solution, so let's please put > aside the silly non-use case. There are big disk arrays with lots of spindles inside, or arrays of fast SSDs. People may want to create lots of cgroups on them. IO controllers should be made cheap and scalable to meet the demands from our variety user base, now and future. > In terms of overhead, I suspect the biggest would be the increased > buffering coming from split channels but that seems like the cost of > business to me. I know that the back pressure idea actually come a long way (several years?) and it's kind of become a common agreement that there will be inevitable costs incur to the isolation. So I can understand why you keep ignoring all the overheads, costs and scalability issues because there seems no other way out. However here comes the solution that can magically avoid the partition and all the resulted problems, and still be able to provide the isolation. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-24 7:58 ` Fengguang Wu (?) @ 2012-04-25 15:47 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-25 15:47 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman Hey, Fengguang. On Tue, Apr 24, 2012 at 03:58:53PM +0800, Fengguang Wu wrote: > > I have two questions. Why do we need memcg for this? Writeback > > currently works without memcg, right? Why does that change with blkcg > > aware bdi? > > Yeah currently writeback does not depend on memcg. As for blkcg, it's > necessary to keep a number of dirty pages for each blkcg, so that the > cfq groups' async IO queue does not go empty and lose its turn to do > IO. memcg provides the proper infrastructure to account dirty pages. > > In a previous email, we have an example of two 10:1 weight cgroups, > each running one dd. They will make two IO pipes, each holding a number > of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's > dirty pages are consumed quickly. However balance_dirty_pages(), > without knowing about cfq's bandwidth divisions, is throttling the > two dd tasks equally. So dd-1 will be producing dirty pages much > slower than cfq is consuming them. The flusher thus won't send enough > dirty pages down to fill the corresponding async IO queue for dd-1. > cfq cannot really give dd-1 more bandwidth share due to lack of data > feed. The end result will be: the two cgroups get 1:1 bandwidth share > honored by balance_dirty_pages() even though cfq honors 10:1 weights > to them. My question is why can't cgroup-bdi pair be handled the same or similar way each bdi is handled now? I haven't looked through the code yet but something is determining, even inadvertently, the dirty memory usage among different bdi's, right? What I'm curious about is why cgroupfying bdi makes any different to that. If it's indeterministic w/o memcg, let it be that way with blkcg too. Just treat cgroup-bdi as separate bdis. So, what changes? > However if it's a large memory machine whose dirty pages get > partitioned to 100 cgroups, the flusher will be serving them > in round robin fashion. Just treat cgroup-bdi as a separate bdi. Run an independent flusher on it. They're separate channels. > blkio.weight will be the "number" shared and interpreted by all IO > controller entities, whether it be cfq, NFS or balance_dirty_pages(). It already isn't. blk-throttle is an IO controller entity but doesn't make use of weight. > > However, this doesn't necessarily translate easily into the actual > > underlying IO resource. For devices with spindle, seek time dominates > > and the same amount of IO may consume vastly different amount of IO > > and the disk time becomes the primary resource, not the iops or > > bandwidth. Naturally, people want to allocate and limit the primary > > resource, so cfq distributes disk time across different cgroups as > > configured. > > Right. balance_dirty_pages() is always doing dirty throttling wrt. > bandwidth, even in your back pressure scheme, isn't it? In this regard, > there are nothing fundamentally different between our proposals. They If balance_dirty_pages() fails to keep the IO buffer full, it's balance_dirty_pages()'s failure (and doing so from time to time could be fine given enough benefits), but no matter what writeback does, blkcg *should* enforce the configured limits, so they're quite different in terms of encapsulation and functionality. > > Your suggested solution is applying the same a number - the weight - > > to one portion of a mostly arbitrarily split resource using a > > different unit. I don't even understand what that achieves. > > You seem to miss my stated plan: next step, balance_dirty_pages() will > get some feedback information from cfq to adjust its bandwidth targets > accordingly. That information will be > > io_cost = charge/sectors > > The charge value is exactly the value computed in cfq_group_served(), > which is the slice time or IOs dispatched depending the mode cfq is > operating in. By dividing ratelimit by the normalized io_cost, > balance_dirty_pages() will automatically get the same weight > interpretation as cfq. For example, on spin disks, it will be able to > allocate lower bandwidth to seeky cgroups due to the larger io_cost > reported by cfq. So, cfq is basing its cost calculation on disk time spent by sync IOs which gets fluctuated by uncategorized async IOs and you're gonna apply that number to async IOs in some magical way? What the hell does that achieve? Please take a step back and look at the whole stack and think about what each part is supposed to do and how they are supposed to interact. If you still can't see the mess you're trying to make, ummm... I don't know. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-24 7:58 ` Fengguang Wu @ 2012-04-25 15:47 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-25 15:47 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hey, Fengguang. On Tue, Apr 24, 2012 at 03:58:53PM +0800, Fengguang Wu wrote: > > I have two questions. Why do we need memcg for this? Writeback > > currently works without memcg, right? Why does that change with blkcg > > aware bdi? > > Yeah currently writeback does not depend on memcg. As for blkcg, it's > necessary to keep a number of dirty pages for each blkcg, so that the > cfq groups' async IO queue does not go empty and lose its turn to do > IO. memcg provides the proper infrastructure to account dirty pages. > > In a previous email, we have an example of two 10:1 weight cgroups, > each running one dd. They will make two IO pipes, each holding a number > of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's > dirty pages are consumed quickly. However balance_dirty_pages(), > without knowing about cfq's bandwidth divisions, is throttling the > two dd tasks equally. So dd-1 will be producing dirty pages much > slower than cfq is consuming them. The flusher thus won't send enough > dirty pages down to fill the corresponding async IO queue for dd-1. > cfq cannot really give dd-1 more bandwidth share due to lack of data > feed. The end result will be: the two cgroups get 1:1 bandwidth share > honored by balance_dirty_pages() even though cfq honors 10:1 weights > to them. My question is why can't cgroup-bdi pair be handled the same or similar way each bdi is handled now? I haven't looked through the code yet but something is determining, even inadvertently, the dirty memory usage among different bdi's, right? What I'm curious about is why cgroupfying bdi makes any different to that. If it's indeterministic w/o memcg, let it be that way with blkcg too. Just treat cgroup-bdi as separate bdis. So, what changes? > However if it's a large memory machine whose dirty pages get > partitioned to 100 cgroups, the flusher will be serving them > in round robin fashion. Just treat cgroup-bdi as a separate bdi. Run an independent flusher on it. They're separate channels. > blkio.weight will be the "number" shared and interpreted by all IO > controller entities, whether it be cfq, NFS or balance_dirty_pages(). It already isn't. blk-throttle is an IO controller entity but doesn't make use of weight. > > However, this doesn't necessarily translate easily into the actual > > underlying IO resource. For devices with spindle, seek time dominates > > and the same amount of IO may consume vastly different amount of IO > > and the disk time becomes the primary resource, not the iops or > > bandwidth. Naturally, people want to allocate and limit the primary > > resource, so cfq distributes disk time across different cgroups as > > configured. > > Right. balance_dirty_pages() is always doing dirty throttling wrt. > bandwidth, even in your back pressure scheme, isn't it? In this regard, > there are nothing fundamentally different between our proposals. They If balance_dirty_pages() fails to keep the IO buffer full, it's balance_dirty_pages()'s failure (and doing so from time to time could be fine given enough benefits), but no matter what writeback does, blkcg *should* enforce the configured limits, so they're quite different in terms of encapsulation and functionality. > > Your suggested solution is applying the same a number - the weight - > > to one portion of a mostly arbitrarily split resource using a > > different unit. I don't even understand what that achieves. > > You seem to miss my stated plan: next step, balance_dirty_pages() will > get some feedback information from cfq to adjust its bandwidth targets > accordingly. That information will be > > io_cost = charge/sectors > > The charge value is exactly the value computed in cfq_group_served(), > which is the slice time or IOs dispatched depending the mode cfq is > operating in. By dividing ratelimit by the normalized io_cost, > balance_dirty_pages() will automatically get the same weight > interpretation as cfq. For example, on spin disks, it will be able to > allocate lower bandwidth to seeky cgroups due to the larger io_cost > reported by cfq. So, cfq is basing its cost calculation on disk time spent by sync IOs which gets fluctuated by uncategorized async IOs and you're gonna apply that number to async IOs in some magical way? What the hell does that achieve? Please take a step back and look at the whole stack and think about what each part is supposed to do and how they are supposed to interact. If you still can't see the mess you're trying to make, ummm... I don't know. Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-25 15:47 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-25 15:47 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman Hey, Fengguang. On Tue, Apr 24, 2012 at 03:58:53PM +0800, Fengguang Wu wrote: > > I have two questions. Why do we need memcg for this? Writeback > > currently works without memcg, right? Why does that change with blkcg > > aware bdi? > > Yeah currently writeback does not depend on memcg. As for blkcg, it's > necessary to keep a number of dirty pages for each blkcg, so that the > cfq groups' async IO queue does not go empty and lose its turn to do > IO. memcg provides the proper infrastructure to account dirty pages. > > In a previous email, we have an example of two 10:1 weight cgroups, > each running one dd. They will make two IO pipes, each holding a number > of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's > dirty pages are consumed quickly. However balance_dirty_pages(), > without knowing about cfq's bandwidth divisions, is throttling the > two dd tasks equally. So dd-1 will be producing dirty pages much > slower than cfq is consuming them. The flusher thus won't send enough > dirty pages down to fill the corresponding async IO queue for dd-1. > cfq cannot really give dd-1 more bandwidth share due to lack of data > feed. The end result will be: the two cgroups get 1:1 bandwidth share > honored by balance_dirty_pages() even though cfq honors 10:1 weights > to them. My question is why can't cgroup-bdi pair be handled the same or similar way each bdi is handled now? I haven't looked through the code yet but something is determining, even inadvertently, the dirty memory usage among different bdi's, right? What I'm curious about is why cgroupfying bdi makes any different to that. If it's indeterministic w/o memcg, let it be that way with blkcg too. Just treat cgroup-bdi as separate bdis. So, what changes? > However if it's a large memory machine whose dirty pages get > partitioned to 100 cgroups, the flusher will be serving them > in round robin fashion. Just treat cgroup-bdi as a separate bdi. Run an independent flusher on it. They're separate channels. > blkio.weight will be the "number" shared and interpreted by all IO > controller entities, whether it be cfq, NFS or balance_dirty_pages(). It already isn't. blk-throttle is an IO controller entity but doesn't make use of weight. > > However, this doesn't necessarily translate easily into the actual > > underlying IO resource. For devices with spindle, seek time dominates > > and the same amount of IO may consume vastly different amount of IO > > and the disk time becomes the primary resource, not the iops or > > bandwidth. Naturally, people want to allocate and limit the primary > > resource, so cfq distributes disk time across different cgroups as > > configured. > > Right. balance_dirty_pages() is always doing dirty throttling wrt. > bandwidth, even in your back pressure scheme, isn't it? In this regard, > there are nothing fundamentally different between our proposals. They If balance_dirty_pages() fails to keep the IO buffer full, it's balance_dirty_pages()'s failure (and doing so from time to time could be fine given enough benefits), but no matter what writeback does, blkcg *should* enforce the configured limits, so they're quite different in terms of encapsulation and functionality. > > Your suggested solution is applying the same a number - the weight - > > to one portion of a mostly arbitrarily split resource using a > > different unit. I don't even understand what that achieves. > > You seem to miss my stated plan: next step, balance_dirty_pages() will > get some feedback information from cfq to adjust its bandwidth targets > accordingly. That information will be > > io_cost = charge/sectors > > The charge value is exactly the value computed in cfq_group_served(), > which is the slice time or IOs dispatched depending the mode cfq is > operating in. By dividing ratelimit by the normalized io_cost, > balance_dirty_pages() will automatically get the same weight > interpretation as cfq. For example, on spin disks, it will be able to > allocate lower bandwidth to seeky cgroups due to the larger io_cost > reported by cfq. So, cfq is basing its cost calculation on disk time spent by sync IOs which gets fluctuated by uncategorized async IOs and you're gonna apply that number to async IOs in some magical way? What the hell does that achieve? Please take a step back and look at the whole stack and think about what each part is supposed to do and how they are supposed to interact. If you still can't see the mess you're trying to make, ummm... I don't know. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-20 13:34 ` Fengguang Wu (?) @ 2012-04-23 9:14 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-23 9:14 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > > > It's not uncommon for me to see filesystems sleep on PG_writeback > > > pages during heavy writeback, within some lock or transaction, which in > > > turn stall many tasks that try to do IO or merely dirty some page in > > > memory. Random writes are especially susceptible to such stalls. The > > > stable page feature also vastly increase the chances of stalls by > > > locking the writeback pages. > > > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > > the case of direct reclaim, it means blocking random tasks that are > > > allocating memory in the system. > > > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > > not movable. This makes a big difference for high-order page allocations. > > > To make room for a 2MB huge page, vmscan has the option to migrate > > > PG_dirty pages, but for PG_writeback it has no better choices than to > > > wait for IO completion. > > > > > > The difficulty of THP allocation goes up *exponentially* with the > > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > > distributed in the physical memory space. Then we have formula > > > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > > Well, this implicitely assumes that PG_Writeback pages are scattered > > across memory uniformly at random. I'm not sure to which extent this is > > true... > > Yeah, when describing the problem I was also thinking about the > possibilities of optimization (it would be a very good general > improvements). Or maybe Mel already has some solutions :) > > > Also as a nitpick, this isn't really an exponential growth since > > the exponent is fixed (256 - actually it should be 512, right?). It's just > > Right, 512 4k pages to form one x86_64 2MB huge pages. > > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > > pages will cause relatively steep drop in the number of available huge > > pages. > > It's exponential indeed, because "1 - p(x)" here means "p(!x)". > It's exponential for a 10x increase in x resulting in 100x drop of y. If 'x' is the probability page has PG_Writeback set, then the probability a huge page has a single PG_Writeback page is (as you almost correctly wrote): (1-x)^512. This is a polynominal by the definition: It can be expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite. The expression decreases fast as x approaches to 1, that's for sure, but that does not make it exponential. Sorry, my mathematical part could not resist this terminology correction. > > ... > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > from its balanced state, leading to large fluctuations and program > > > > > stalls. > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > For example there are only 2 dd tasks doing buffered writes in the > > > system. Now consider the mismatch that cfq is dispatching their IO > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > weights. > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > at the same pace. The cfq weights will be defeated because the async > > > queue for the second dd (and cgroup) constantly runs empty. > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > you have those, things start working again. > > Right. I think Tejun was more of less aware of this. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > expect it to work well when used extensively. My plan was to set the > default memcg dirty_limit high enough, so that it's not hit in normal. > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > convert the dirty pages' backpressure into real dirty throttling rate. > No, that's just crazy idea! > > Come on, let's not over-use memcg's dirty_limit. It's there as the > *last resort* to keep dirty pages under control so as to maintain > interactive performance inside the cgroup. However if used extensively > in the system (like dozens of memcgs all hit their dirty limits), the > limit itself may stall random dirtiers and create interactive > performance issues! > > In the recent days I've come up with the idea of memcg.dirty_setpoint > for the blkcg backpressure stuff. We can use that instead. > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > Because if blkcg A and B does 10:1 weights and are both doing buffered > writes, their dirty pages should better be maintained around 10:1 > ratio to avoid underrun and hopefully achieve better IO size. > memcg.dirty_limit cannot guarantee that goal. I agree that to avoid stalls of throttled processes we shouldn't be hitting memcg.dirty_limit on a regular basis. When I wrote we need "per cgroup dirty limits" I actually imagined something like you write above - do complete throttling computations within each memcg - estimate throughput available for it, compute appropriate dirty rates for it's processes and from its dirty limit estimate appropriate setpoint to balance around. > But be warned! Partitioning the dirty pages always means more > fluctuations of dirty rates (and even stalls) that's perceivable by > the user. Which means another limiting factor for the backpressure > based IO controller to scale well. Sure, the smaller the memcg gets, the more noticeable these fluctuations would be. I would not expect memcg with 200 MB of memory to behave better (and also not much worse) than if I have a machine with that much memory... Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 9:14 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-23 9:14 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > > > It's not uncommon for me to see filesystems sleep on PG_writeback > > > pages during heavy writeback, within some lock or transaction, which in > > > turn stall many tasks that try to do IO or merely dirty some page in > > > memory. Random writes are especially susceptible to such stalls. The > > > stable page feature also vastly increase the chances of stalls by > > > locking the writeback pages. > > > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > > the case of direct reclaim, it means blocking random tasks that are > > > allocating memory in the system. > > > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > > not movable. This makes a big difference for high-order page allocations. > > > To make room for a 2MB huge page, vmscan has the option to migrate > > > PG_dirty pages, but for PG_writeback it has no better choices than to > > > wait for IO completion. > > > > > > The difficulty of THP allocation goes up *exponentially* with the > > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > > distributed in the physical memory space. Then we have formula > > > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > > Well, this implicitely assumes that PG_Writeback pages are scattered > > across memory uniformly at random. I'm not sure to which extent this is > > true... > > Yeah, when describing the problem I was also thinking about the > possibilities of optimization (it would be a very good general > improvements). Or maybe Mel already has some solutions :) > > > Also as a nitpick, this isn't really an exponential growth since > > the exponent is fixed (256 - actually it should be 512, right?). It's just > > Right, 512 4k pages to form one x86_64 2MB huge pages. > > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > > pages will cause relatively steep drop in the number of available huge > > pages. > > It's exponential indeed, because "1 - p(x)" here means "p(!x)". > It's exponential for a 10x increase in x resulting in 100x drop of y. If 'x' is the probability page has PG_Writeback set, then the probability a huge page has a single PG_Writeback page is (as you almost correctly wrote): (1-x)^512. This is a polynominal by the definition: It can be expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite. The expression decreases fast as x approaches to 1, that's for sure, but that does not make it exponential. Sorry, my mathematical part could not resist this terminology correction. > > ... > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > from its balanced state, leading to large fluctuations and program > > > > > stalls. > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > For example there are only 2 dd tasks doing buffered writes in the > > > system. Now consider the mismatch that cfq is dispatching their IO > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > weights. > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > at the same pace. The cfq weights will be defeated because the async > > > queue for the second dd (and cgroup) constantly runs empty. > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > you have those, things start working again. > > Right. I think Tejun was more of less aware of this. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > expect it to work well when used extensively. My plan was to set the > default memcg dirty_limit high enough, so that it's not hit in normal. > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > convert the dirty pages' backpressure into real dirty throttling rate. > No, that's just crazy idea! > > Come on, let's not over-use memcg's dirty_limit. It's there as the > *last resort* to keep dirty pages under control so as to maintain > interactive performance inside the cgroup. However if used extensively > in the system (like dozens of memcgs all hit their dirty limits), the > limit itself may stall random dirtiers and create interactive > performance issues! > > In the recent days I've come up with the idea of memcg.dirty_setpoint > for the blkcg backpressure stuff. We can use that instead. > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > Because if blkcg A and B does 10:1 weights and are both doing buffered > writes, their dirty pages should better be maintained around 10:1 > ratio to avoid underrun and hopefully achieve better IO size. > memcg.dirty_limit cannot guarantee that goal. I agree that to avoid stalls of throttled processes we shouldn't be hitting memcg.dirty_limit on a regular basis. When I wrote we need "per cgroup dirty limits" I actually imagined something like you write above - do complete throttling computations within each memcg - estimate throughput available for it, compute appropriate dirty rates for it's processes and from its dirty limit estimate appropriate setpoint to balance around. > But be warned! Partitioning the dirty pages always means more > fluctuations of dirty rates (and even stalls) that's perceivable by > the user. Which means another limiting factor for the backpressure > based IO controller to scale well. Sure, the smaller the memcg gets, the more noticeable these fluctuations would be. I would not expect memcg with 200 MB of memory to behave better (and also not much worse) than if I have a machine with that much memory... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 9:14 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-23 9:14 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > > > It's not uncommon for me to see filesystems sleep on PG_writeback > > > pages during heavy writeback, within some lock or transaction, which in > > > turn stall many tasks that try to do IO or merely dirty some page in > > > memory. Random writes are especially susceptible to such stalls. The > > > stable page feature also vastly increase the chances of stalls by > > > locking the writeback pages. > > > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > > the case of direct reclaim, it means blocking random tasks that are > > > allocating memory in the system. > > > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > > not movable. This makes a big difference for high-order page allocations. > > > To make room for a 2MB huge page, vmscan has the option to migrate > > > PG_dirty pages, but for PG_writeback it has no better choices than to > > > wait for IO completion. > > > > > > The difficulty of THP allocation goes up *exponentially* with the > > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > > distributed in the physical memory space. Then we have formula > > > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > > Well, this implicitely assumes that PG_Writeback pages are scattered > > across memory uniformly at random. I'm not sure to which extent this is > > true... > > Yeah, when describing the problem I was also thinking about the > possibilities of optimization (it would be a very good general > improvements). Or maybe Mel already has some solutions :) > > > Also as a nitpick, this isn't really an exponential growth since > > the exponent is fixed (256 - actually it should be 512, right?). It's just > > Right, 512 4k pages to form one x86_64 2MB huge pages. > > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > > pages will cause relatively steep drop in the number of available huge > > pages. > > It's exponential indeed, because "1 - p(x)" here means "p(!x)". > It's exponential for a 10x increase in x resulting in 100x drop of y. If 'x' is the probability page has PG_Writeback set, then the probability a huge page has a single PG_Writeback page is (as you almost correctly wrote): (1-x)^512. This is a polynominal by the definition: It can be expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite. The expression decreases fast as x approaches to 1, that's for sure, but that does not make it exponential. Sorry, my mathematical part could not resist this terminology correction. > > ... > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > from its balanced state, leading to large fluctuations and program > > > > > stalls. > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > For example there are only 2 dd tasks doing buffered writes in the > > > system. Now consider the mismatch that cfq is dispatching their IO > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > weights. > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > at the same pace. The cfq weights will be defeated because the async > > > queue for the second dd (and cgroup) constantly runs empty. > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > you have those, things start working again. > > Right. I think Tejun was more of less aware of this. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > expect it to work well when used extensively. My plan was to set the > default memcg dirty_limit high enough, so that it's not hit in normal. > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > convert the dirty pages' backpressure into real dirty throttling rate. > No, that's just crazy idea! > > Come on, let's not over-use memcg's dirty_limit. It's there as the > *last resort* to keep dirty pages under control so as to maintain > interactive performance inside the cgroup. However if used extensively > in the system (like dozens of memcgs all hit their dirty limits), the > limit itself may stall random dirtiers and create interactive > performance issues! > > In the recent days I've come up with the idea of memcg.dirty_setpoint > for the blkcg backpressure stuff. We can use that instead. > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > Because if blkcg A and B does 10:1 weights and are both doing buffered > writes, their dirty pages should better be maintained around 10:1 > ratio to avoid underrun and hopefully achieve better IO size. > memcg.dirty_limit cannot guarantee that goal. I agree that to avoid stalls of throttled processes we shouldn't be hitting memcg.dirty_limit on a regular basis. When I wrote we need "per cgroup dirty limits" I actually imagined something like you write above - do complete throttling computations within each memcg - estimate throughput available for it, compute appropriate dirty rates for it's processes and from its dirty limit estimate appropriate setpoint to balance around. > But be warned! Partitioning the dirty pages always means more > fluctuations of dirty rates (and even stalls) that's perceivable by > the user. Which means another limiting factor for the backpressure > based IO controller to scale well. Sure, the smaller the memcg gets, the more noticeable these fluctuations would be. I would not expect memcg with 200 MB of memory to behave better (and also not much worse) than if I have a machine with that much memory... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-23 9:14 ` Jan Kara @ 2012-04-23 10:24 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-23 10:24 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > > > > It's not uncommon for me to see filesystems sleep on PG_writeback > > > > pages during heavy writeback, within some lock or transaction, which in > > > > turn stall many tasks that try to do IO or merely dirty some page in > > > > memory. Random writes are especially susceptible to such stalls. The > > > > stable page feature also vastly increase the chances of stalls by > > > > locking the writeback pages. > > > > > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > > > the case of direct reclaim, it means blocking random tasks that are > > > > allocating memory in the system. > > > > > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > > > not movable. This makes a big difference for high-order page allocations. > > > > To make room for a 2MB huge page, vmscan has the option to migrate > > > > PG_dirty pages, but for PG_writeback it has no better choices than to > > > > wait for IO completion. > > > > > > > > The difficulty of THP allocation goes up *exponentially* with the > > > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > > > distributed in the physical memory space. Then we have formula > > > > > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > > > Well, this implicitely assumes that PG_Writeback pages are scattered > > > across memory uniformly at random. I'm not sure to which extent this is > > > true... > > > > Yeah, when describing the problem I was also thinking about the > > possibilities of optimization (it would be a very good general > > improvements). Or maybe Mel already has some solutions :) > > > > > Also as a nitpick, this isn't really an exponential growth since > > > the exponent is fixed (256 - actually it should be 512, right?). It's just > > > > Right, 512 4k pages to form one x86_64 2MB huge pages. > > > > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > > > pages will cause relatively steep drop in the number of available huge > > > pages. > > > > It's exponential indeed, because "1 - p(x)" here means "p(!x)". > > It's exponential for a 10x increase in x resulting in 100x drop of y. > If 'x' is the probability page has PG_Writeback set, then the probability > a huge page has a single PG_Writeback page is (as you almost correctly wrote): > (1-x)^512. This is a polynominal by the definition: It can be > expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite. > > The expression decreases fast as x approaches to 1, that's for sure, but > that does not make it exponential. Sorry, my mathematical part could not > resist this terminology correction. ok, ok :-) I actually got the equation wrong above, the one used in the script is correct. The correct one is "it takes all 512 component pages to be free of PG_writeback for the huge page to be free of PG_writeback and immediately reclaimable for THP". P(reclaimable for THP) = P(non-PG_writeback)^512 > > > ... > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > > from its balanced state, leading to large fluctuations and program > > > > > > stalls. > > > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > > For example there are only 2 dd tasks doing buffered writes in the > > > > system. Now consider the mismatch that cfq is dispatching their IO > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > > weights. > > > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > > at the same pace. The cfq weights will be defeated because the async > > > > queue for the second dd (and cgroup) constantly runs empty. > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > > > Come on, let's not over-use memcg's dirty_limit. It's there as the > > *last resort* to keep dirty pages under control so as to maintain > > interactive performance inside the cgroup. However if used extensively > > in the system (like dozens of memcgs all hit their dirty limits), the > > limit itself may stall random dirtiers and create interactive > > performance issues! > > > > In the recent days I've come up with the idea of memcg.dirty_setpoint > > for the blkcg backpressure stuff. We can use that instead. > > > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > > Because if blkcg A and B does 10:1 weights and are both doing buffered > > writes, their dirty pages should better be maintained around 10:1 > > ratio to avoid underrun and hopefully achieve better IO size. > > memcg.dirty_limit cannot guarantee that goal. > I agree that to avoid stalls of throttled processes we shouldn't be > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per > cgroup dirty limits" I actually imagined something like you write above - > do complete throttling computations within each memcg - estimate throughput > available for it, compute appropriate dirty rates for it's processes and > from its dirty limit estimate appropriate setpoint to balance around. > Yes. balance_dirty_pages() will need both dirty pages and dirty page writeout rate for the cgroup to do proper dirty throttling for it. > > But be warned! Partitioning the dirty pages always means more > > fluctuations of dirty rates (and even stalls) that's perceivable by > > the user. Which means another limiting factor for the backpressure > > based IO controller to scale well. > Sure, the smaller the memcg gets, the more noticeable these fluctuations > would be. I would not expect memcg with 200 MB of memory to behave better > (and also not much worse) than if I have a machine with that much memory... It would be much worse if it's one single flusher thread round robin over the cgroups... For a small machine with 200MB memory, its IO completion events can arrive continuously over time. However if its a 2000MB box divided into 10 cgroups and the flusher is writing out dirty pages, spending 0.5s on each cgroup and then go on to the next, then for any single cgroup, its IO completion events go quiet for every 9.5s and goes up on the other 0.5s. It becomes really hard to control the number of dirty pages. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 10:24 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-23 10:24 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > > > > It's not uncommon for me to see filesystems sleep on PG_writeback > > > > pages during heavy writeback, within some lock or transaction, which in > > > > turn stall many tasks that try to do IO or merely dirty some page in > > > > memory. Random writes are especially susceptible to such stalls. The > > > > stable page feature also vastly increase the chances of stalls by > > > > locking the writeback pages. > > > > > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > > > the case of direct reclaim, it means blocking random tasks that are > > > > allocating memory in the system. > > > > > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > > > not movable. This makes a big difference for high-order page allocations. > > > > To make room for a 2MB huge page, vmscan has the option to migrate > > > > PG_dirty pages, but for PG_writeback it has no better choices than to > > > > wait for IO completion. > > > > > > > > The difficulty of THP allocation goes up *exponentially* with the > > > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > > > distributed in the physical memory space. Then we have formula > > > > > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > > > Well, this implicitely assumes that PG_Writeback pages are scattered > > > across memory uniformly at random. I'm not sure to which extent this is > > > true... > > > > Yeah, when describing the problem I was also thinking about the > > possibilities of optimization (it would be a very good general > > improvements). Or maybe Mel already has some solutions :) > > > > > Also as a nitpick, this isn't really an exponential growth since > > > the exponent is fixed (256 - actually it should be 512, right?). It's just > > > > Right, 512 4k pages to form one x86_64 2MB huge pages. > > > > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > > > pages will cause relatively steep drop in the number of available huge > > > pages. > > > > It's exponential indeed, because "1 - p(x)" here means "p(!x)". > > It's exponential for a 10x increase in x resulting in 100x drop of y. > If 'x' is the probability page has PG_Writeback set, then the probability > a huge page has a single PG_Writeback page is (as you almost correctly wrote): > (1-x)^512. This is a polynominal by the definition: It can be > expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite. > > The expression decreases fast as x approaches to 1, that's for sure, but > that does not make it exponential. Sorry, my mathematical part could not > resist this terminology correction. ok, ok :-) I actually got the equation wrong above, the one used in the script is correct. The correct one is "it takes all 512 component pages to be free of PG_writeback for the huge page to be free of PG_writeback and immediately reclaimable for THP". P(reclaimable for THP) = P(non-PG_writeback)^512 > > > ... > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > > from its balanced state, leading to large fluctuations and program > > > > > > stalls. > > > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > > For example there are only 2 dd tasks doing buffered writes in the > > > > system. Now consider the mismatch that cfq is dispatching their IO > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > > weights. > > > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > > at the same pace. The cfq weights will be defeated because the async > > > > queue for the second dd (and cgroup) constantly runs empty. > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > > > Come on, let's not over-use memcg's dirty_limit. It's there as the > > *last resort* to keep dirty pages under control so as to maintain > > interactive performance inside the cgroup. However if used extensively > > in the system (like dozens of memcgs all hit their dirty limits), the > > limit itself may stall random dirtiers and create interactive > > performance issues! > > > > In the recent days I've come up with the idea of memcg.dirty_setpoint > > for the blkcg backpressure stuff. We can use that instead. > > > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > > Because if blkcg A and B does 10:1 weights and are both doing buffered > > writes, their dirty pages should better be maintained around 10:1 > > ratio to avoid underrun and hopefully achieve better IO size. > > memcg.dirty_limit cannot guarantee that goal. > I agree that to avoid stalls of throttled processes we shouldn't be > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per > cgroup dirty limits" I actually imagined something like you write above - > do complete throttling computations within each memcg - estimate throughput > available for it, compute appropriate dirty rates for it's processes and > from its dirty limit estimate appropriate setpoint to balance around. > Yes. balance_dirty_pages() will need both dirty pages and dirty page writeout rate for the cgroup to do proper dirty throttling for it. > > But be warned! Partitioning the dirty pages always means more > > fluctuations of dirty rates (and even stalls) that's perceivable by > > the user. Which means another limiting factor for the backpressure > > based IO controller to scale well. > Sure, the smaller the memcg gets, the more noticeable these fluctuations > would be. I would not expect memcg with 200 MB of memory to behave better > (and also not much worse) than if I have a machine with that much memory... It would be much worse if it's one single flusher thread round robin over the cgroups... For a small machine with 200MB memory, its IO completion events can arrive continuously over time. However if its a 2000MB box divided into 10 cgroups and the flusher is writing out dirty pages, spending 0.5s on each cgroup and then go on to the next, then for any single cgroup, its IO completion events go quiet for every 9.5s and goes up on the other 0.5s. It becomes really hard to control the number of dirty pages. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-23 10:24 ` Fengguang Wu @ 2012-04-23 12:42 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-23 12:42 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Mon 23-04-12 18:24:20, Wu Fengguang wrote: > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > > > ... > > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > > > from its balanced state, leading to large fluctuations and program > > > > > > > stalls. > > > > > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > > > For example there are only 2 dd tasks doing buffered writes in the > > > > > system. Now consider the mismatch that cfq is dispatching their IO > > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > > > weights. > > > > > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > > > at the same pace. The cfq weights will be defeated because the async > > > > > queue for the second dd (and cgroup) constantly runs empty. > > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > > you have those, things start working again. > > > > > > Right. I think Tejun was more of less aware of this. > > > > > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > > expect it to work well when used extensively. My plan was to set the > > > default memcg dirty_limit high enough, so that it's not hit in normal. > > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > > convert the dirty pages' backpressure into real dirty throttling rate. > > > No, that's just crazy idea! > > > > > > Come on, let's not over-use memcg's dirty_limit. It's there as the > > > *last resort* to keep dirty pages under control so as to maintain > > > interactive performance inside the cgroup. However if used extensively > > > in the system (like dozens of memcgs all hit their dirty limits), the > > > limit itself may stall random dirtiers and create interactive > > > performance issues! > > > > > > In the recent days I've come up with the idea of memcg.dirty_setpoint > > > for the blkcg backpressure stuff. We can use that instead. > > > > > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > > > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > > > Because if blkcg A and B does 10:1 weights and are both doing buffered > > > writes, their dirty pages should better be maintained around 10:1 > > > ratio to avoid underrun and hopefully achieve better IO size. > > > memcg.dirty_limit cannot guarantee that goal. > > I agree that to avoid stalls of throttled processes we shouldn't be > > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per > > cgroup dirty limits" I actually imagined something like you write above - > > do complete throttling computations within each memcg - estimate throughput > > available for it, compute appropriate dirty rates for it's processes and > > from its dirty limit estimate appropriate setpoint to balance around. > > > > Yes. balance_dirty_pages() will need both dirty pages and dirty page > writeout rate for the cgroup to do proper dirty throttling for it. > > > > But be warned! Partitioning the dirty pages always means more > > > fluctuations of dirty rates (and even stalls) that's perceivable by > > > the user. Which means another limiting factor for the backpressure > > > based IO controller to scale well. > > Sure, the smaller the memcg gets, the more noticeable these fluctuations > > would be. I would not expect memcg with 200 MB of memory to behave better > > (and also not much worse) than if I have a machine with that much memory... > > It would be much worse if it's one single flusher thread round robin > over the cgroups... > > For a small machine with 200MB memory, its IO completion events can > arrive continuously over time. However if its a 2000MB box divided > into 10 cgroups and the flusher is writing out dirty pages, spending > 0.5s on each cgroup and then go on to the next, then for any single > cgroup, its IO completion events go quiet for every 9.5s and goes up > on the other 0.5s. It becomes really hard to control the number of > dirty pages. Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s worth of IO for each cgroup. Since the throughput computed for each cgroup will be scaled down accordingly (and thus write_chunk will be scaled down as well), it should end up submitting 0.5s worth of IO for the whole system after it traverses all the cgroups, shouldn't it? Effectively we will work with smaller write_chunk which will lead to lower total throughput - that's the price of partitioning and higher fairness requirements (previously the requirement was to switch to a new inode every 0.5s, now the requirement is to switch to a new inode in each cgroup every 0.5s). In the end, we may end up increasing the write_chunk by some factor like \sqrt(number of memcgs) to get some middle ground between the guaranteed small latency and reasonable total throughput but before I'd go for such hacks, I'd wait to see real numbers - e.g. paying 10% of total throughput for partitioning the machine into 10 IO intensive cgroups (as in your tests with dd's) would be a reasonable cost in my opinion. Also the granularity of IO completions should depend more on the granularity of IO scheduler (CFQ) rather than the granularity of flusher thread as such so I wouldn't think that would be a problem. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 12:42 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-23 12:42 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Mon 23-04-12 18:24:20, Wu Fengguang wrote: > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > > > ... > > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > > > from its balanced state, leading to large fluctuations and program > > > > > > > stalls. > > > > > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > > > For example there are only 2 dd tasks doing buffered writes in the > > > > > system. Now consider the mismatch that cfq is dispatching their IO > > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > > > weights. > > > > > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > > > at the same pace. The cfq weights will be defeated because the async > > > > > queue for the second dd (and cgroup) constantly runs empty. > > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > > you have those, things start working again. > > > > > > Right. I think Tejun was more of less aware of this. > > > > > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > > expect it to work well when used extensively. My plan was to set the > > > default memcg dirty_limit high enough, so that it's not hit in normal. > > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > > convert the dirty pages' backpressure into real dirty throttling rate. > > > No, that's just crazy idea! > > > > > > Come on, let's not over-use memcg's dirty_limit. It's there as the > > > *last resort* to keep dirty pages under control so as to maintain > > > interactive performance inside the cgroup. However if used extensively > > > in the system (like dozens of memcgs all hit their dirty limits), the > > > limit itself may stall random dirtiers and create interactive > > > performance issues! > > > > > > In the recent days I've come up with the idea of memcg.dirty_setpoint > > > for the blkcg backpressure stuff. We can use that instead. > > > > > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > > > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > > > Because if blkcg A and B does 10:1 weights and are both doing buffered > > > writes, their dirty pages should better be maintained around 10:1 > > > ratio to avoid underrun and hopefully achieve better IO size. > > > memcg.dirty_limit cannot guarantee that goal. > > I agree that to avoid stalls of throttled processes we shouldn't be > > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per > > cgroup dirty limits" I actually imagined something like you write above - > > do complete throttling computations within each memcg - estimate throughput > > available for it, compute appropriate dirty rates for it's processes and > > from its dirty limit estimate appropriate setpoint to balance around. > > > > Yes. balance_dirty_pages() will need both dirty pages and dirty page > writeout rate for the cgroup to do proper dirty throttling for it. > > > > But be warned! Partitioning the dirty pages always means more > > > fluctuations of dirty rates (and even stalls) that's perceivable by > > > the user. Which means another limiting factor for the backpressure > > > based IO controller to scale well. > > Sure, the smaller the memcg gets, the more noticeable these fluctuations > > would be. I would not expect memcg with 200 MB of memory to behave better > > (and also not much worse) than if I have a machine with that much memory... > > It would be much worse if it's one single flusher thread round robin > over the cgroups... > > For a small machine with 200MB memory, its IO completion events can > arrive continuously over time. However if its a 2000MB box divided > into 10 cgroups and the flusher is writing out dirty pages, spending > 0.5s on each cgroup and then go on to the next, then for any single > cgroup, its IO completion events go quiet for every 9.5s and goes up > on the other 0.5s. It becomes really hard to control the number of > dirty pages. Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s worth of IO for each cgroup. Since the throughput computed for each cgroup will be scaled down accordingly (and thus write_chunk will be scaled down as well), it should end up submitting 0.5s worth of IO for the whole system after it traverses all the cgroups, shouldn't it? Effectively we will work with smaller write_chunk which will lead to lower total throughput - that's the price of partitioning and higher fairness requirements (previously the requirement was to switch to a new inode every 0.5s, now the requirement is to switch to a new inode in each cgroup every 0.5s). In the end, we may end up increasing the write_chunk by some factor like \sqrt(number of memcgs) to get some middle ground between the guaranteed small latency and reasonable total throughput but before I'd go for such hacks, I'd wait to see real numbers - e.g. paying 10% of total throughput for partitioning the machine into 10 IO intensive cgroups (as in your tests with dd's) would be a reasonable cost in my opinion. Also the granularity of IO completions should depend more on the granularity of IO scheduler (CFQ) rather than the granularity of flusher thread as such so I wouldn't think that would be a problem. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-23 12:42 ` Jan Kara @ 2012-04-23 14:31 ` Fengguang Wu -1 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-23 14:31 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Mon, Apr 23, 2012 at 02:42:40PM +0200, Jan Kara wrote: > On Mon 23-04-12 18:24:20, Wu Fengguang wrote: > > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > > > But be warned! Partitioning the dirty pages always means more > > > > fluctuations of dirty rates (and even stalls) that's perceivable by > > > > the user. Which means another limiting factor for the backpressure > > > > based IO controller to scale well. > > > Sure, the smaller the memcg gets, the more noticeable these fluctuations > > > would be. I would not expect memcg with 200 MB of memory to behave better > > > (and also not much worse) than if I have a machine with that much memory... > > > > It would be much worse if it's one single flusher thread round robin > > over the cgroups... > > > > For a small machine with 200MB memory, its IO completion events can > > arrive continuously over time. However if its a 2000MB box divided > > into 10 cgroups and the flusher is writing out dirty pages, spending > > 0.5s on each cgroup and then go on to the next, then for any single > > cgroup, its IO completion events go quiet for every 9.5s and goes up > > on the other 0.5s. It becomes really hard to control the number of > > dirty pages. > Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s > worth of IO for each cgroup. Right. > Since the throughput computed for each cgroup > will be scaled down accordingly (and thus write_chunk will be scaled down > as well), it should end up submitting 0.5s worth of IO for the whole system > after it traverses all the cgroups, shouldn't it? Effectively we will work > with smaller write_chunk which will lead to lower total throughput - that's > the price of partitioning and higher fairness requirements (previously the Sure you can do that. However I think we were talking about memcg dirty limits, in which case we still have good chances to keep the 0.5s per inode granularity by making the dirty limits high so that it won't be hit normally. Only when there comes lots of memory cgroups that the flusher cannot easily safeguard fairness among them, we may consider decreasing the writeback chunk size. > requirement was to switch to a new inode every 0.5s, now the requirement is > to switch to a new inode in each cgroup every 0.5s). In the end, we may end > up increasing the write_chunk by some factor like \sqrt(number of memcgs) > to get some middle ground between the guaranteed small latency and > reasonable total throughput but before I'd go for such hacks, I'd wait to > see real numbers - e.g. paying 10% of total throughput for partitioning the > machine into 10 IO intensive cgroups (as in your tests with dd's) would be > a reasonable cost in my opinion. For IO cgroups, I'd always prefer to avoid partitioning the dirty pages and async IO queue so as to avoid such embarrassing tradeoffs in the first place :-) > Also the granularity of IO completions should depend more on the > granularity of IO scheduler (CFQ) rather than the granularity of flusher > thread as such so I wouldn't think that would be a problem. By avoiding the partitions, we'll cancel the fairness problem. So the coarse granularity of flusher won't be a problem for IO cgroups at all. balance_dirty_pages() will do proper throttling when dirty pages are created, based directly on the blkcg weights and ongoing IO. After that all async IOs go as a single stream from the flusher to the storage. There are no need for page tracking. No split inode lists and hence granularity or shared inodes issues for the flusher. Above all there will be no degradation of performance at all, whether it be throughput, latency or responsiveness. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-23 14:31 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-23 14:31 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman On Mon, Apr 23, 2012 at 02:42:40PM +0200, Jan Kara wrote: > On Mon 23-04-12 18:24:20, Wu Fengguang wrote: > > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > > > But be warned! Partitioning the dirty pages always means more > > > > fluctuations of dirty rates (and even stalls) that's perceivable by > > > > the user. Which means another limiting factor for the backpressure > > > > based IO controller to scale well. > > > Sure, the smaller the memcg gets, the more noticeable these fluctuations > > > would be. I would not expect memcg with 200 MB of memory to behave better > > > (and also not much worse) than if I have a machine with that much memory... > > > > It would be much worse if it's one single flusher thread round robin > > over the cgroups... > > > > For a small machine with 200MB memory, its IO completion events can > > arrive continuously over time. However if its a 2000MB box divided > > into 10 cgroups and the flusher is writing out dirty pages, spending > > 0.5s on each cgroup and then go on to the next, then for any single > > cgroup, its IO completion events go quiet for every 9.5s and goes up > > on the other 0.5s. It becomes really hard to control the number of > > dirty pages. > Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s > worth of IO for each cgroup. Right. > Since the throughput computed for each cgroup > will be scaled down accordingly (and thus write_chunk will be scaled down > as well), it should end up submitting 0.5s worth of IO for the whole system > after it traverses all the cgroups, shouldn't it? Effectively we will work > with smaller write_chunk which will lead to lower total throughput - that's > the price of partitioning and higher fairness requirements (previously the Sure you can do that. However I think we were talking about memcg dirty limits, in which case we still have good chances to keep the 0.5s per inode granularity by making the dirty limits high so that it won't be hit normally. Only when there comes lots of memory cgroups that the flusher cannot easily safeguard fairness among them, we may consider decreasing the writeback chunk size. > requirement was to switch to a new inode every 0.5s, now the requirement is > to switch to a new inode in each cgroup every 0.5s). In the end, we may end > up increasing the write_chunk by some factor like \sqrt(number of memcgs) > to get some middle ground between the guaranteed small latency and > reasonable total throughput but before I'd go for such hacks, I'd wait to > see real numbers - e.g. paying 10% of total throughput for partitioning the > machine into 10 IO intensive cgroups (as in your tests with dd's) would be > a reasonable cost in my opinion. For IO cgroups, I'd always prefer to avoid partitioning the dirty pages and async IO queue so as to avoid such embarrassing tradeoffs in the first place :-) > Also the granularity of IO completions should depend more on the > granularity of IO scheduler (CFQ) rather than the granularity of flusher > thread as such so I wouldn't think that would be a problem. By avoiding the partitions, we'll cancel the fairness problem. So the coarse granularity of flusher won't be a problem for IO cgroups at all. balance_dirty_pages() will do proper throttling when dirty pages are created, based directly on the blkcg weights and ongoing IO. After that all async IOs go as a single stream from the flusher to the storage. There are no need for page tracking. No split inode lists and hence granularity or shared inodes issues for the flusher. Above all there will be no degradation of performance at all, whether it be throughput, latency or responsiveness. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120423124240.GE6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120423124240.GE6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-23 14:31 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-23 14:31 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman On Mon, Apr 23, 2012 at 02:42:40PM +0200, Jan Kara wrote: > On Mon 23-04-12 18:24:20, Wu Fengguang wrote: > > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > > > But be warned! Partitioning the dirty pages always means more > > > > fluctuations of dirty rates (and even stalls) that's perceivable by > > > > the user. Which means another limiting factor for the backpressure > > > > based IO controller to scale well. > > > Sure, the smaller the memcg gets, the more noticeable these fluctuations > > > would be. I would not expect memcg with 200 MB of memory to behave better > > > (and also not much worse) than if I have a machine with that much memory... > > > > It would be much worse if it's one single flusher thread round robin > > over the cgroups... > > > > For a small machine with 200MB memory, its IO completion events can > > arrive continuously over time. However if its a 2000MB box divided > > into 10 cgroups and the flusher is writing out dirty pages, spending > > 0.5s on each cgroup and then go on to the next, then for any single > > cgroup, its IO completion events go quiet for every 9.5s and goes up > > on the other 0.5s. It becomes really hard to control the number of > > dirty pages. > Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s > worth of IO for each cgroup. Right. > Since the throughput computed for each cgroup > will be scaled down accordingly (and thus write_chunk will be scaled down > as well), it should end up submitting 0.5s worth of IO for the whole system > after it traverses all the cgroups, shouldn't it? Effectively we will work > with smaller write_chunk which will lead to lower total throughput - that's > the price of partitioning and higher fairness requirements (previously the Sure you can do that. However I think we were talking about memcg dirty limits, in which case we still have good chances to keep the 0.5s per inode granularity by making the dirty limits high so that it won't be hit normally. Only when there comes lots of memory cgroups that the flusher cannot easily safeguard fairness among them, we may consider decreasing the writeback chunk size. > requirement was to switch to a new inode every 0.5s, now the requirement is > to switch to a new inode in each cgroup every 0.5s). In the end, we may end > up increasing the write_chunk by some factor like \sqrt(number of memcgs) > to get some middle ground between the guaranteed small latency and > reasonable total throughput but before I'd go for such hacks, I'd wait to > see real numbers - e.g. paying 10% of total throughput for partitioning the > machine into 10 IO intensive cgroups (as in your tests with dd's) would be > a reasonable cost in my opinion. For IO cgroups, I'd always prefer to avoid partitioning the dirty pages and async IO queue so as to avoid such embarrassing tradeoffs in the first place :-) > Also the granularity of IO completions should depend more on the > granularity of IO scheduler (CFQ) rather than the granularity of flusher > thread as such so I wouldn't think that would be a problem. By avoiding the partitions, we'll cancel the fairness problem. So the coarse granularity of flusher won't be a problem for IO cgroups at all. balance_dirty_pages() will do proper throttling when dirty pages are created, based directly on the blkcg weights and ongoing IO. After that all async IOs go as a single stream from the flusher to the storage. There are no need for page tracking. No split inode lists and hence granularity or shared inodes issues for the flusher. Above all there will be no degradation of performance at all, whether it be throughput, latency or responsiveness. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-23 10:24 ` Fengguang Wu (?) (?) @ 2012-04-23 12:42 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-23 12:42 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman On Mon 23-04-12 18:24:20, Wu Fengguang wrote: > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > > > ... > > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > > > from its balanced state, leading to large fluctuations and program > > > > > > > stalls. > > > > > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > > > For example there are only 2 dd tasks doing buffered writes in the > > > > > system. Now consider the mismatch that cfq is dispatching their IO > > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > > > weights. > > > > > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > > > at the same pace. The cfq weights will be defeated because the async > > > > > queue for the second dd (and cgroup) constantly runs empty. > > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > > you have those, things start working again. > > > > > > Right. I think Tejun was more of less aware of this. > > > > > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > > expect it to work well when used extensively. My plan was to set the > > > default memcg dirty_limit high enough, so that it's not hit in normal. > > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > > convert the dirty pages' backpressure into real dirty throttling rate. > > > No, that's just crazy idea! > > > > > > Come on, let's not over-use memcg's dirty_limit. It's there as the > > > *last resort* to keep dirty pages under control so as to maintain > > > interactive performance inside the cgroup. However if used extensively > > > in the system (like dozens of memcgs all hit their dirty limits), the > > > limit itself may stall random dirtiers and create interactive > > > performance issues! > > > > > > In the recent days I've come up with the idea of memcg.dirty_setpoint > > > for the blkcg backpressure stuff. We can use that instead. > > > > > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > > > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > > > Because if blkcg A and B does 10:1 weights and are both doing buffered > > > writes, their dirty pages should better be maintained around 10:1 > > > ratio to avoid underrun and hopefully achieve better IO size. > > > memcg.dirty_limit cannot guarantee that goal. > > I agree that to avoid stalls of throttled processes we shouldn't be > > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per > > cgroup dirty limits" I actually imagined something like you write above - > > do complete throttling computations within each memcg - estimate throughput > > available for it, compute appropriate dirty rates for it's processes and > > from its dirty limit estimate appropriate setpoint to balance around. > > > > Yes. balance_dirty_pages() will need both dirty pages and dirty page > writeout rate for the cgroup to do proper dirty throttling for it. > > > > But be warned! Partitioning the dirty pages always means more > > > fluctuations of dirty rates (and even stalls) that's perceivable by > > > the user. Which means another limiting factor for the backpressure > > > based IO controller to scale well. > > Sure, the smaller the memcg gets, the more noticeable these fluctuations > > would be. I would not expect memcg with 200 MB of memory to behave better > > (and also not much worse) than if I have a machine with that much memory... > > It would be much worse if it's one single flusher thread round robin > over the cgroups... > > For a small machine with 200MB memory, its IO completion events can > arrive continuously over time. However if its a 2000MB box divided > into 10 cgroups and the flusher is writing out dirty pages, spending > 0.5s on each cgroup and then go on to the next, then for any single > cgroup, its IO completion events go quiet for every 9.5s and goes up > on the other 0.5s. It becomes really hard to control the number of > dirty pages. Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s worth of IO for each cgroup. Since the throughput computed for each cgroup will be scaled down accordingly (and thus write_chunk will be scaled down as well), it should end up submitting 0.5s worth of IO for the whole system after it traverses all the cgroups, shouldn't it? Effectively we will work with smaller write_chunk which will lead to lower total throughput - that's the price of partitioning and higher fairness requirements (previously the requirement was to switch to a new inode every 0.5s, now the requirement is to switch to a new inode in each cgroup every 0.5s). In the end, we may end up increasing the write_chunk by some factor like \sqrt(number of memcgs) to get some middle ground between the guaranteed small latency and reasonable total throughput but before I'd go for such hacks, I'd wait to see real numbers - e.g. paying 10% of total throughput for partitioning the machine into 10 IO intensive cgroups (as in your tests with dd's) would be a reasonable cost in my opinion. Also the granularity of IO completions should depend more on the granularity of IO scheduler (CFQ) rather than the granularity of flusher thread as such so I wouldn't think that would be a problem. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120423091432.GC6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120423091432.GC6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-23 10:24 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-23 10:24 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote: > On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > > > > It's not uncommon for me to see filesystems sleep on PG_writeback > > > > pages during heavy writeback, within some lock or transaction, which in > > > > turn stall many tasks that try to do IO or merely dirty some page in > > > > memory. Random writes are especially susceptible to such stalls. The > > > > stable page feature also vastly increase the chances of stalls by > > > > locking the writeback pages. > > > > > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > > > the case of direct reclaim, it means blocking random tasks that are > > > > allocating memory in the system. > > > > > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > > > not movable. This makes a big difference for high-order page allocations. > > > > To make room for a 2MB huge page, vmscan has the option to migrate > > > > PG_dirty pages, but for PG_writeback it has no better choices than to > > > > wait for IO completion. > > > > > > > > The difficulty of THP allocation goes up *exponentially* with the > > > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > > > distributed in the physical memory space. Then we have formula > > > > > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > > > Well, this implicitely assumes that PG_Writeback pages are scattered > > > across memory uniformly at random. I'm not sure to which extent this is > > > true... > > > > Yeah, when describing the problem I was also thinking about the > > possibilities of optimization (it would be a very good general > > improvements). Or maybe Mel already has some solutions :) > > > > > Also as a nitpick, this isn't really an exponential growth since > > > the exponent is fixed (256 - actually it should be 512, right?). It's just > > > > Right, 512 4k pages to form one x86_64 2MB huge pages. > > > > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > > > pages will cause relatively steep drop in the number of available huge > > > pages. > > > > It's exponential indeed, because "1 - p(x)" here means "p(!x)". > > It's exponential for a 10x increase in x resulting in 100x drop of y. > If 'x' is the probability page has PG_Writeback set, then the probability > a huge page has a single PG_Writeback page is (as you almost correctly wrote): > (1-x)^512. This is a polynominal by the definition: It can be > expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite. > > The expression decreases fast as x approaches to 1, that's for sure, but > that does not make it exponential. Sorry, my mathematical part could not > resist this terminology correction. ok, ok :-) I actually got the equation wrong above, the one used in the script is correct. The correct one is "it takes all 512 component pages to be free of PG_writeback for the huge page to be free of PG_writeback and immediately reclaimable for THP". P(reclaimable for THP) = P(non-PG_writeback)^512 > > > ... > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > > from its balanced state, leading to large fluctuations and program > > > > > > stalls. > > > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > > For example there are only 2 dd tasks doing buffered writes in the > > > > system. Now consider the mismatch that cfq is dispatching their IO > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > > weights. > > > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > > at the same pace. The cfq weights will be defeated because the async > > > > queue for the second dd (and cgroup) constantly runs empty. > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > > > Come on, let's not over-use memcg's dirty_limit. It's there as the > > *last resort* to keep dirty pages under control so as to maintain > > interactive performance inside the cgroup. However if used extensively > > in the system (like dozens of memcgs all hit their dirty limits), the > > limit itself may stall random dirtiers and create interactive > > performance issues! > > > > In the recent days I've come up with the idea of memcg.dirty_setpoint > > for the blkcg backpressure stuff. We can use that instead. > > > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > > Because if blkcg A and B does 10:1 weights and are both doing buffered > > writes, their dirty pages should better be maintained around 10:1 > > ratio to avoid underrun and hopefully achieve better IO size. > > memcg.dirty_limit cannot guarantee that goal. > I agree that to avoid stalls of throttled processes we shouldn't be > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per > cgroup dirty limits" I actually imagined something like you write above - > do complete throttling computations within each memcg - estimate throughput > available for it, compute appropriate dirty rates for it's processes and > from its dirty limit estimate appropriate setpoint to balance around. > Yes. balance_dirty_pages() will need both dirty pages and dirty page writeout rate for the cgroup to do proper dirty throttling for it. > > But be warned! Partitioning the dirty pages always means more > > fluctuations of dirty rates (and even stalls) that's perceivable by > > the user. Which means another limiting factor for the backpressure > > based IO controller to scale well. > Sure, the smaller the memcg gets, the more noticeable these fluctuations > would be. I would not expect memcg with 200 MB of memory to behave better > (and also not much worse) than if I have a machine with that much memory... It would be much worse if it's one single flusher thread round robin over the cgroups... For a small machine with 200MB memory, its IO completion events can arrive continuously over time. However if its a 2000MB box divided into 10 cgroups and the flusher is writing out dirty pages, spending 0.5s on each cgroup and then go on to the next, then for any single cgroup, its IO completion events go quiet for every 9.5s and goes up on the other 0.5s. It becomes really hard to control the number of dirty pages. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120419202635.GA4795-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120419202635.GA4795-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-20 13:34 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-20 13:34 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > On Thu 19-04-12 22:23:43, Wu Fengguang wrote: > > For one instance, splitting the request queues will give rise to > > PG_writeback pages. Those pages have been the biggest source of > > latency issues in the various parts of the system. > Well, if we allow more requests to be in flight in total then yes, number > of PG_Writeback pages can be higher as well. Exactly. > > It's not uncommon for me to see filesystems sleep on PG_writeback > > pages during heavy writeback, within some lock or transaction, which in > > turn stall many tasks that try to do IO or merely dirty some page in > > memory. Random writes are especially susceptible to such stalls. The > > stable page feature also vastly increase the chances of stalls by > > locking the writeback pages. > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > the case of direct reclaim, it means blocking random tasks that are > > allocating memory in the system. > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > not movable. This makes a big difference for high-order page allocations. > > To make room for a 2MB huge page, vmscan has the option to migrate > > PG_dirty pages, but for PG_writeback it has no better choices than to > > wait for IO completion. > > > > The difficulty of THP allocation goes up *exponentially* with the > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > distributed in the physical memory space. Then we have formula > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > Well, this implicitely assumes that PG_Writeback pages are scattered > across memory uniformly at random. I'm not sure to which extent this is > true... Yeah, when describing the problem I was also thinking about the possibilities of optimization (it would be a very good general improvements). Or maybe Mel already has some solutions :) > Also as a nitpick, this isn't really an exponential growth since > the exponent is fixed (256 - actually it should be 512, right?). It's just Right, 512 4k pages to form one x86_64 2MB huge pages. > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > pages will cause relatively steep drop in the number of available huge > pages. It's exponential indeed, because "1 - p(x)" here means "p(!x)". It's exponential for a 10x increase in x resulting in 100x drop of y. > ... > > It's worth to note that running multiple flusher threads per bdi means > > not only disk seeks for spin disks, smaller IO size for SSD, but also > > lock contentions and cache bouncing for metadata heavy workloads and > > fast storage. > Well, this heavily depends on particular implementation (and chosen > data structures). But yes, we should have that in mind. The lock contentions and cache bouncing actually mainly happen in fs code due to concurrent IO submissions. Also when replying Vivek's email I realized that the disk seeks and/or smaller IO size are more fundamentally tied to the split async queues in cfq which makes it switch inodes on every async slice time (typically 40ms). > ... > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > from its balanced state, leading to large fluctuations and program > > > > stalls. > > > > > > Just do the same 1:1 inside each cgroup. > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > For example there are only 2 dd tasks doing buffered writes in the > > system. Now consider the mismatch that cfq is dispatching their IO > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > weights. > > > > What will happen in the end? The 1:1 ratio imposed by > > balance_dirty_pages() will take effect and the dd tasks will progress > > at the same pace. The cfq weights will be defeated because the async > > queue for the second dd (and cgroup) constantly runs empty. > Yup. This just shows that you have to have per-cgroup dirty limits. Once > you have those, things start working again. Right. I think Tejun was more of less aware of this. I was rather upset by this per-memcg dirty_limit idea indeed. I never expect it to work well when used extensively. My plan was to set the default memcg dirty_limit high enough, so that it's not hit in normal. Then Tejun came and proposed to (mis-)use dirty_limit as the way to convert the dirty pages' backpressure into real dirty throttling rate. No, that's just crazy idea! Come on, let's not over-use memcg's dirty_limit. It's there as the *last resort* to keep dirty pages under control so as to maintain interactive performance inside the cgroup. However if used extensively in the system (like dozens of memcgs all hit their dirty limits), the limit itself may stall random dirtiers and create interactive performance issues! In the recent days I've come up with the idea of memcg.dirty_setpoint for the blkcg backpressure stuff. We can use that instead. memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. Imagine bdi_setpoint. It's all the same concepts. Why we need this? Because if blkcg A and B does 10:1 weights and are both doing buffered writes, their dirty pages should better be maintained around 10:1 ratio to avoid underrun and hopefully achieve better IO size. memcg.dirty_limit cannot guarantee that goal. But be warned! Partitioning the dirty pages always means more fluctuations of dirty rates (and even stalls) that's perceivable by the user. Which means another limiting factor for the backpressure based IO controller to scale well. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-04-19 14:23 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA Hi Tejun, On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote: > > Fortunately, the above gap can be easily filled judging from the > > block/cfq IO controller code. By adding some direct IO accounting > > and changing several lines of my patches to make use of the collected > > stats, the semantics of the blkio.throttle.write_bps interfaces can be > > changed from "limit for direct IO" to "limit for direct+buffered IOs". > > Ditto for blkio.weight and blkio.write_iops, as long as some > > iops/device time stats are made available to balance_dirty_pages(). > > > > It would be a fairly *easy* change. :-) It's merely adding some > > accounting code and there is no need to change the block IO > > controlling algorithm at all. I'll do the work of accounting (which > > is basically independent of the IO controlling) and use the new stats > > in balance_dirty_pages(). > > I don't really understand how this can work. For hard limits, maybe, Yeah, hard limits are the easiest. > but for proportional IO, you have to know which cgroups have IOs > before assigning the proportions, so blkcg assigning IO bandwidth > without knowing async writes simply can't work. > > For example, let's say cgroups A and B have 2:8 split. If A has IOs > on queue and B doesn't, blkcg will assign all IO bandwidth to A. I > can't wrap my head around how writeback is gonna make use of the > resulting stats but let's say it decides it needs to put out some IOs > out for both cgroups. What happens then? Do all the async writes go > through the root cgroup controlled by and affecting the ratio between > rootcg and cgroup A and B? Or do they have to be accounted as part of > cgroups A and B? If so, what if the added bandwidth goes over the > limit? Let's say if we implement overcharge; then, I suppose we'll > have to communicate that upwards too, right? The trick is to do the throttling for buffered writes at page dirty time, when balance_dirty_pages() knows exactly what cgroup the dirtier task belongs to, the dirty rate and whether or not it's an aggressive dirtier. The cgroup's direct IO rate can also be measured. The only missing information is whether it's a non-aggressive direct writer (only cfq may know about that). Now I'm simply assuming direct writers are all aggressive. So if A and B have 2:8 split and A only submits async IO and B only submits direct IO, there will be no cfqg exist for A at all. cfq will be serving B and root cgroup interleavely. In the patch I just posted, blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root cgroup for use by the flusher. In the end the flusher gets weight 2 and B gets weight 8. Here we need to distinguish the weight assigned by user and the weight after the async/sync adjustment. The other missing information is the real cost when the dirtied pages eventually hit the disk after perhaps dozens of seconds. For that part I'm assuming simple dd at this time and balance_dirty_pages() is now splitting out the flusher's overall writeout progress to the dirtier tasks' dirty ratelimit based on bandwidth fairness. > This is still easy. What about hierarchical propio? What happens > then? You can't do hierarchical proportional allocation without > knowing how much IOs are pending for which group. How is that > information gonna be communicated between blkcg and writeback? Are we > gonna have two separate hierarchical proportional IO allocators? How > is that gonna work at all? If we're gonna have single allocator in > block layer, writeback would have to feed the amount of IOs it may > generate into the allocator, get the resulting allocation and then > issue IO and then block layer again will have to account these to the > originating cgroups. It's just crazy. No I have not got the idea on how to do the hierarchical proportional IO controller without physically splitting up the async IO streams. It's pretty hard and I'd better break out before it drives me crazy. So in the following discussion, let's assume cfq will move async requests from the current root cgroup to individual IO issuer's cfqgs and schedule service for the async streams there. And thus the need to create "backpressure" for balance_dirty_pages() to eventually throttle the individual dirtier tasks. That said, I still don't think we've come up with any satisfactory solutions. It's hard problem after all. > > The only problem I can see now, is that balance_dirty_pages() works > > per-bdi and blkcg works per-device. So the two ends may not match > > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where > > sdb is shared by lv0 and lv1. However it should be rare situations and > > be much more acceptable than the problems arise from the "push back" > > approach which impacts everyone. > > I don't know. What problems? AFAICS, the biggest issue is writeback > of different inodes getting mixed resulting in poor performance, but > if you think about it, that's about the frequency of switching cgroups > and a problem which can and should be dealt with from block layer > (e.g. use larger time slice if all the pending IOs are async). Yeah increasing time slice would help that case. In general it's not merely the frequency of switching cgroup if take hard disk' writeback cache into account. Think about some inodes with async IO: A1, A2, A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different cgroups. So when the root cgroup holds all async inodes, the cfq may schedule IO interleavely like this A1, A1, A1, A2, A1, A2, ... D1, D2, D3, D4, D5, D6, ... Now it becomes A1, A2, A3, A4, A5, A6, ... D1, D2, D3, D4, D5, D6, ... The difference is that it's now switching the async inodes each time. At cfq level, the seek costs look the same, however the disk's writeback cache may help merge the data chunks from the same inode A1. Well, it may cost some latency for spin disks. But how about SSD? It can run deeper queue and benefit from large writes. > Writeback's duty is generating stream of async writes which can be > served efficiently for the *cgroup* and keeping the buffer filled as > necessary and chaining the backpressure from there to the actual > dirtier. That's what writeback does without cgroup. Nothing > fundamental changes with cgroup. It's just finer grained. Believe me, physically partitioning the dirty pages and async IO streams comes at big costs. It won't scale well in many ways. For one instance, splitting the request queues will give rise to PG_writeback pages. Those pages have been the biggest source of latency issues in the various parts of the system. It's not uncommon for me to see filesystems sleep on PG_writeback pages during heavy writeback, within some lock or transaction, which in turn stall many tasks that try to do IO or merely dirty some page in memory. Random writes are especially susceptible to such stalls. The stable page feature also vastly increase the chances of stalls by locking the writeback pages. Page reclaim may also block on PG_writeback and/or PG_dirty pages. In the case of direct reclaim, it means blocking random tasks that are allocating memory in the system. PG_writeback pages are much worse than PG_dirty pages in that they are not movable. This makes a big difference for high-order page allocations. To make room for a 2MB huge page, vmscan has the option to migrate PG_dirty pages, but for PG_writeback it has no better choices than to wait for IO completion. The difficulty of THP allocation goes up *exponentially* with the number of PG_writeback pages. Assume PG_writeback pages are randomly distributed in the physical memory space. Then we have formula P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 That's the possibly for a contiguous range of 256 pages to be free of PG_writeback, so that it's immediately reclaimable for use by transparent huge page. This ruby script shows us the concrete numbers. irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 } P(hit PG_writeback) P(reclaimable for THP) 0.001 0.599 0.002 0.359 0.003 0.215 0.004 0.128 0.005 0.077 0.006 0.046 0.007 0.027 0.008 0.016 0.009 0.010 0.010 0.006 The numbers show that when the PG_writeback pages go up from 0.1% to 1% of system memory, the THP reclaim success ratio drops quickly from 60% to 0.6%. It indicates that in order to use THP without constantly running into stalls, the reasonable PG_writeback ratio is <= 0.1%. Going beyond that threshold, it quickly becomes intolerable. That makes a limit of 256MB writeback pages for a mem=256GB system. Looking at the real vmstat:nr_writeback numbers in dd write tests: JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009 JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335 JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026 JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099 JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058 JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335 Oops btrfs has 4GB writeback pages -- which asks for some bug fixing. Even ext4's 800MB still looks way too high, but that's ~1s worth of data per queue (or 130ms worth of data for the high performance Intel SSD, which is perhaps in danger of queue underruns?). So this system would require 512GB memory to comfortably run KVM instances with THP support. Judging from the above numbers, we can hardly afford to split up the IO queues and proliferate writeback pages. It's worth to note that running multiple flusher threads per bdi means not only disk seeks for spin disks, smaller IO size for SSD, but also lock contentions and cache bouncing for metadata heavy workloads and fast storage. To give some concrete examples on how much CPU overheads can be saved by reducing multiple IO submitters, here are some summaries for the IO-less dirty throttling gains. Tests show that it yields huge benefits for reducing IO seeks as well as CPU overheads. For example, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". (by Dave Chinner) - "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the _global_ page states) - the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path - "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. And for simple dd tests - "throughput for a _single_ large dd (100GB) increase from ~650MB/s to 700MB/s" - "On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s." > > > No, no, it's not about standing in my way. As Vivek said in the other > > > reply, it's that the "gap" that you filled was created *because* > > > writeback wasn't cgroup aware and now you're in turn filling that gap > > > by making writeback work around that "gap". I mean, my mind boggles. > > > Doesn't yours? I strongly believe everyone's should. > > > > Heh. It's a hard problem indeed. I felt great pains in the IO-less > > dirty throttling work. I did a lot reasoning about it, and have in > > fact kept cgroup IO controller in mind since its early days. Now I'd > > say it's hands down for it to adapt to the gap between the total IO > > limit and what's carried out by the block IO controller. > > You're not providing any valid counter arguments about the issues > being raised about the messed up design. How is anything "hands down" > here? Yeah sadly, it turns out to be not "hands down" when it comes to the proportional async/sync splits, and it's even prohibiting when comes to the hierarchical support.. > > > There's where I'm confused. How is the said split supposed to work? > > > They aren't independent. I mean, who gets to decide what and where > > > are those decisions enforced? > > > > Yeah it's not independent. It's about > > > > - keep block IO cgroup untouched (in its current algorithm, for > > throttling direct IO) > > > > - let balance_dirty_pages() adapt to the throttling target > > > > buffered_write_limit = total_limit - direct_IOs > > Think about proportional allocation. You don't have a number until > you know who have pending IOs and how much. We have the IO rate. The above formula is actually working on "rates". That's good enough for calculating the ratelimit for buffered writes. We don't have to know every transient states of the pending IOs. Because the direct IOs are handled by cfq based on cfqg weight and for async IOs, there are plenty of dirty pages for buffering/tolerating small errors in the dirty rate control. > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. > > Just do the same 1:1 inside each cgroup. Sure. But the ratio mismatch I'm talking about is inter-cgroup. For example there are only 2 dd tasks doing buffered writes in the system. Now consider the mismatch that cfq is dispatching their IO requests at 10:1 weights, while balance_dirty_pages() is throttling the dd tasks at 1:1 equal split because it's not aware of the cgroup weights. What will happen in the end? The 1:1 ratio imposed by balance_dirty_pages() will take effect and the dd tasks will progress at the same pace. The cfq weights will be defeated because the async queue for the second dd (and cgroup) constantly runs empty. > > This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > > Because splitting a resource into two pieces arbitrarily with > different amount of consumptions on each side and then applying the > same proportion on both doesn't mean anything? Sorry, I don't quite catch your words here. > > The balance_dirty_pages() is already deeply involved in dirty throttling. > > As you can see from this patchset, the same algorithms can be extended > > trivially to work with cgroup IO limits. > > > > buffered write IO controller in balance_dirty_pages() > > https://lkml.org/lkml/2012/3/28/275 > > It is half broken thing with fundamental design flaws which can't be > corrected without complete reimplementation. I don't know what to > say. I'm fully aware of that, and so have been exploring new ways out :) > > In the "back pressure" scheme, memcg is a must because only it has all > > the infrastructure to track dirty pages upon which you can apply some > > dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > For now, per-inode tracking seems good enough. There are actually two directions of information passing. 1) pass the dirtier ownership down to bio. For this part, it's mostly enough to do the light weight per-inode tracking. 2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO submit) as well as to balance_dirty_pages() (to actually throttle the dirty tasks). The flusher naturally works on inode granularities. However balance_dirty_pages() is about limiting dirty pages. For this part, it needs to know the total number of dirty pages and writeout bandwidth for each cgroup in order to do proper dirty throttling. And to maintain proper number of dirty pages to avoid the queue underrun issue explained in the above 2-dd example. > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > > To me, you seem to be not addressing the issues I've been raising at > all and just repeating the same points again and again. If I'm > misunderstanding something, please point out. Hopefully the renewed patch can dismiss some of your questions. It's a pity that I didn't thought about the hierarchical requirement at the time. Otherwise the complexity of calculations still looks manageable. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-06 9:59 ` Fengguang Wu @ 2012-04-18 6:57 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-18 6:57 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri 06-04-12 02:59:34, Wu Fengguang wrote: ... > > > > Let's please keep the layering clear. IO limitations will be applied > > > > at the block layer and pressure will be formed there and then > > > > propagated upwards eventually to the originator. Sure, exposing the > > > > whole information might result in better behavior for certain > > > > workloads, but down the road, say, in three or five years, devices > > > > which can be shared without worrying too much about seeks might be > > > > commonplace and we could be swearing at a disgusting structural mess, > > > > and sadly various cgroup support seems to be a prominent source of > > > > such design failures. > > > > > > Super fast storages are coming which will make us regret to make the > > > IO path over complex. Spinning disks are not going away anytime soon. > > > I doubt Google is willing to afford the disk seek costs on its > > > millions of disks and has the patience to wait until switching all of > > > the spin disks to SSD years later (if it will ever happen). > > > > This is new. Let's keep the damn employer out of the discussion. > > While the area I work on is affected by my employment (writeback isn't > > even my area BTW), I'm not gonna do something adverse to upstream even > > if it's beneficial to google and I'm much more likely to do something > > which may hurt google a bit if it's gonna benefit upstream. > > > > As for the faster / newer storage argument, that is *exactly* why we > > want to keep the layering proper. Writeback works from the pressure > > from the IO stack. If IO technology changes, we update the IO stack > > and writeback still works from the pressure. It may need to be > > adjusted but the principles don't change. > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > It's always there doing 1:1 proportional throttling. Then you try to > kick in to add *double* throttling in block/cfq layer. Now the low > layer may enforce 10:1 throttling and push balance_dirty_pages() away > from its balanced state, leading to large fluctuations and program > stalls. This can be avoided by telling balance_dirty_pages(): "your > balance goal is no longer 1:1, but 10:1". With this information > balance_dirty_pages() will behave right. Then there is the question: > if balance_dirty_pages() will work just well provided the information, > why bother doing the throttling at low layer and "push back" the > pressure all the way up? Fengguang, maybe we should first agree on some basics: The two main goals of balance_dirty_pages() are (and always have been AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages in memory to allow for efficient writeback. Secondary goals are to also keep amount of dirty pages somewhat fair among bdis and processes. Agreed? Thus shift to trying to control *IO throughput* (or even just buffered write throughput) from balance_dirty_pages() is a fundamental shift in the goals of balance_dirty_pages(), not just some tweak (although technically, it might be relatively easy to do for buffered writes given the current implementation). ... > > Well, I tried and I hope some of it got through. I also wrote a lot > > of questions, mainly regarding how what you have in mind is supposed > > to work through what path. Maybe I'm just not seeing what you're > > seeing but I just can't see where all the IOs would go through and > > come together. Can you please elaborate more on that? > > What I can see is, it looks pretty simple and nature to let > balance_dirty_pages() fill the gap towards a total solution :-) > > - add direct IO accounting in some convenient point of the IO path > IO submission or completion point, either is fine. > > - change several lines of the buffered write IO controller to > integrate the direct IO rate into the formula to fit the "total > IO" limit > > - in future, add more accounting as well as feedback control to make > balance_dirty_pages() work with IOPS and disk time Sorry Fengguang but I also think this is a wrong way to go. balance_dirty_pages() must primarily control the amount of dirty pages. Trying to bend it to control IO throughput by including direct IO and reads in the accounting will just make the logic even more complex than it already is. Honza ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-18 6:57 ` Jan Kara 0 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-18 6:57 UTC (permalink / raw) To: Fengguang Wu Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Fri 06-04-12 02:59:34, Wu Fengguang wrote: ... > > > > Let's please keep the layering clear. IO limitations will be applied > > > > at the block layer and pressure will be formed there and then > > > > propagated upwards eventually to the originator. Sure, exposing the > > > > whole information might result in better behavior for certain > > > > workloads, but down the road, say, in three or five years, devices > > > > which can be shared without worrying too much about seeks might be > > > > commonplace and we could be swearing at a disgusting structural mess, > > > > and sadly various cgroup support seems to be a prominent source of > > > > such design failures. > > > > > > Super fast storages are coming which will make us regret to make the > > > IO path over complex. Spinning disks are not going away anytime soon. > > > I doubt Google is willing to afford the disk seek costs on its > > > millions of disks and has the patience to wait until switching all of > > > the spin disks to SSD years later (if it will ever happen). > > > > This is new. Let's keep the damn employer out of the discussion. > > While the area I work on is affected by my employment (writeback isn't > > even my area BTW), I'm not gonna do something adverse to upstream even > > if it's beneficial to google and I'm much more likely to do something > > which may hurt google a bit if it's gonna benefit upstream. > > > > As for the faster / newer storage argument, that is *exactly* why we > > want to keep the layering proper. Writeback works from the pressure > > from the IO stack. If IO technology changes, we update the IO stack > > and writeback still works from the pressure. It may need to be > > adjusted but the principles don't change. > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > It's always there doing 1:1 proportional throttling. Then you try to > kick in to add *double* throttling in block/cfq layer. Now the low > layer may enforce 10:1 throttling and push balance_dirty_pages() away > from its balanced state, leading to large fluctuations and program > stalls. This can be avoided by telling balance_dirty_pages(): "your > balance goal is no longer 1:1, but 10:1". With this information > balance_dirty_pages() will behave right. Then there is the question: > if balance_dirty_pages() will work just well provided the information, > why bother doing the throttling at low layer and "push back" the > pressure all the way up? Fengguang, maybe we should first agree on some basics: The two main goals of balance_dirty_pages() are (and always have been AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages in memory to allow for efficient writeback. Secondary goals are to also keep amount of dirty pages somewhat fair among bdis and processes. Agreed? Thus shift to trying to control *IO throughput* (or even just buffered write throughput) from balance_dirty_pages() is a fundamental shift in the goals of balance_dirty_pages(), not just some tweak (although technically, it might be relatively easy to do for buffered writes given the current implementation). ... > > Well, I tried and I hope some of it got through. I also wrote a lot > > of questions, mainly regarding how what you have in mind is supposed > > to work through what path. Maybe I'm just not seeing what you're > > seeing but I just can't see where all the IOs would go through and > > come together. Can you please elaborate more on that? > > What I can see is, it looks pretty simple and nature to let > balance_dirty_pages() fill the gap towards a total solution :-) > > - add direct IO accounting in some convenient point of the IO path > IO submission or completion point, either is fine. > > - change several lines of the buffered write IO controller to > integrate the direct IO rate into the formula to fit the "total > IO" limit > > - in future, add more accounting as well as feedback control to make > balance_dirty_pages() work with IOPS and disk time Sorry Fengguang but I also think this is a wrong way to go. balance_dirty_pages() must primarily control the amount of dirty pages. Trying to bend it to control IO throughput by including direct IO and reads in the accounting will just make the logic even more complex than it already is. Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-18 7:58 ` Fengguang Wu @ 2012-04-18 7:58 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-18 7:58 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote: > On Fri 06-04-12 02:59:34, Wu Fengguang wrote: > ... > > > > > Let's please keep the layering clear. IO limitations will be applied > > > > > at the block layer and pressure will be formed there and then > > > > > propagated upwards eventually to the originator. Sure, exposing the > > > > > whole information might result in better behavior for certain > > > > > workloads, but down the road, say, in three or five years, devices > > > > > which can be shared without worrying too much about seeks might be > > > > > commonplace and we could be swearing at a disgusting structural mess, > > > > > and sadly various cgroup support seems to be a prominent source of > > > > > such design failures. > > > > > > > > Super fast storages are coming which will make us regret to make the > > > > IO path over complex. Spinning disks are not going away anytime soon. > > > > I doubt Google is willing to afford the disk seek costs on its > > > > millions of disks and has the patience to wait until switching all of > > > > the spin disks to SSD years later (if it will ever happen). > > > > > > This is new. Let's keep the damn employer out of the discussion. > > > While the area I work on is affected by my employment (writeback isn't > > > even my area BTW), I'm not gonna do something adverse to upstream even > > > if it's beneficial to google and I'm much more likely to do something > > > which may hurt google a bit if it's gonna benefit upstream. > > > > > > As for the faster / newer storage argument, that is *exactly* why we > > > want to keep the layering proper. Writeback works from the pressure > > > from the IO stack. If IO technology changes, we update the IO stack > > > and writeback still works from the pressure. It may need to be > > > adjusted but the principles don't change. > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > Fengguang, maybe we should first agree on some basics: > The two main goals of balance_dirty_pages() are (and always have been > AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages > in memory to allow for efficient writeback. Secondary goals are to also > keep amount of dirty pages somewhat fair among bdis and processes. Agreed? Agreed. In fact, before the IO-less change, balance_dirty_pages() had no much explicit control over the dirty rate and fairness. > Thus shift to trying to control *IO throughput* (or even just buffered > write throughput) from balance_dirty_pages() is a fundamental shift in the > goals of balance_dirty_pages(), not just some tweak (although technically, > it might be relatively easy to do for buffered writes given the current > implementation). Yes, it has been a bit shift to the rate based dirty control. > ... > > > Well, I tried and I hope some of it got through. I also wrote a lot > > > of questions, mainly regarding how what you have in mind is supposed > > > to work through what path. Maybe I'm just not seeing what you're > > > seeing but I just can't see where all the IOs would go through and > > > come together. Can you please elaborate more on that? > > > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > Sorry Fengguang but I also think this is a wrong way to go. > balance_dirty_pages() must primarily control the amount of dirty pages. > Trying to bend it to control IO throughput by including direct IO and > reads in the accounting will just make the logic even more complex than it > already is. Right, I have been adding too much complexity to balance_dirty_pages(). The control algorithms are pretty hard to understand and get right for all cases. OK, I'll post results of my experiments up to now, answer some questions and take a comfortable break. Phooo.. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-18 7:58 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-18 7:58 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote: > On Fri 06-04-12 02:59:34, Wu Fengguang wrote: > ... > > > > > Let's please keep the layering clear. IO limitations will be applied > > > > > at the block layer and pressure will be formed there and then > > > > > propagated upwards eventually to the originator. Sure, exposing the > > > > > whole information might result in better behavior for certain > > > > > workloads, but down the road, say, in three or five years, devices > > > > > which can be shared without worrying too much about seeks might be > > > > > commonplace and we could be swearing at a disgusting structural mess, > > > > > and sadly various cgroup support seems to be a prominent source of > > > > > such design failures. > > > > > > > > Super fast storages are coming which will make us regret to make the > > > > IO path over complex. Spinning disks are not going away anytime soon. > > > > I doubt Google is willing to afford the disk seek costs on its > > > > millions of disks and has the patience to wait until switching all of > > > > the spin disks to SSD years later (if it will ever happen). > > > > > > This is new. Let's keep the damn employer out of the discussion. > > > While the area I work on is affected by my employment (writeback isn't > > > even my area BTW), I'm not gonna do something adverse to upstream even > > > if it's beneficial to google and I'm much more likely to do something > > > which may hurt google a bit if it's gonna benefit upstream. > > > > > > As for the faster / newer storage argument, that is *exactly* why we > > > want to keep the layering proper. Writeback works from the pressure > > > from the IO stack. If IO technology changes, we update the IO stack > > > and writeback still works from the pressure. It may need to be > > > adjusted but the principles don't change. > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > Fengguang, maybe we should first agree on some basics: > The two main goals of balance_dirty_pages() are (and always have been > AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages > in memory to allow for efficient writeback. Secondary goals are to also > keep amount of dirty pages somewhat fair among bdis and processes. Agreed? Agreed. In fact, before the IO-less change, balance_dirty_pages() had no much explicit control over the dirty rate and fairness. > Thus shift to trying to control *IO throughput* (or even just buffered > write throughput) from balance_dirty_pages() is a fundamental shift in the > goals of balance_dirty_pages(), not just some tweak (although technically, > it might be relatively easy to do for buffered writes given the current > implementation). Yes, it has been a bit shift to the rate based dirty control. > ... > > > Well, I tried and I hope some of it got through. I also wrote a lot > > > of questions, mainly regarding how what you have in mind is supposed > > > to work through what path. Maybe I'm just not seeing what you're > > > seeing but I just can't see where all the IOs would go through and > > > come together. Can you please elaborate more on that? > > > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > Sorry Fengguang but I also think this is a wrong way to go. > balance_dirty_pages() must primarily control the amount of dirty pages. > Trying to bend it to control IO throughput by including direct IO and > reads in the accounting will just make the logic even more complex than it > already is. Right, I have been adding too much complexity to balance_dirty_pages(). The control algorithms are pretty hard to understand and get right for all cases. OK, I'll post results of my experiments up to now, answer some questions and take a comfortable break. Phooo.. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-18 7:58 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-18 7:58 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote: > On Fri 06-04-12 02:59:34, Wu Fengguang wrote: > ... > > > > > Let's please keep the layering clear. IO limitations will be applied > > > > > at the block layer and pressure will be formed there and then > > > > > propagated upwards eventually to the originator. Sure, exposing the > > > > > whole information might result in better behavior for certain > > > > > workloads, but down the road, say, in three or five years, devices > > > > > which can be shared without worrying too much about seeks might be > > > > > commonplace and we could be swearing at a disgusting structural mess, > > > > > and sadly various cgroup support seems to be a prominent source of > > > > > such design failures. > > > > > > > > Super fast storages are coming which will make us regret to make the > > > > IO path over complex. Spinning disks are not going away anytime soon. > > > > I doubt Google is willing to afford the disk seek costs on its > > > > millions of disks and has the patience to wait until switching all of > > > > the spin disks to SSD years later (if it will ever happen). > > > > > > This is new. Let's keep the damn employer out of the discussion. > > > While the area I work on is affected by my employment (writeback isn't > > > even my area BTW), I'm not gonna do something adverse to upstream even > > > if it's beneficial to google and I'm much more likely to do something > > > which may hurt google a bit if it's gonna benefit upstream. > > > > > > As for the faster / newer storage argument, that is *exactly* why we > > > want to keep the layering proper. Writeback works from the pressure > > > from the IO stack. If IO technology changes, we update the IO stack > > > and writeback still works from the pressure. It may need to be > > > adjusted but the principles don't change. > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > Fengguang, maybe we should first agree on some basics: > The two main goals of balance_dirty_pages() are (and always have been > AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages > in memory to allow for efficient writeback. Secondary goals are to also > keep amount of dirty pages somewhat fair among bdis and processes. Agreed? Agreed. In fact, before the IO-less change, balance_dirty_pages() had no much explicit control over the dirty rate and fairness. > Thus shift to trying to control *IO throughput* (or even just buffered > write throughput) from balance_dirty_pages() is a fundamental shift in the > goals of balance_dirty_pages(), not just some tweak (although technically, > it might be relatively easy to do for buffered writes given the current > implementation). Yes, it has been a bit shift to the rate based dirty control. > ... > > > Well, I tried and I hope some of it got through. I also wrote a lot > > > of questions, mainly regarding how what you have in mind is supposed > > > to work through what path. Maybe I'm just not seeing what you're > > > seeing but I just can't see where all the IOs would go through and > > > come together. Can you please elaborate more on that? > > > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > Sorry Fengguang but I also think this is a wrong way to go. > balance_dirty_pages() must primarily control the amount of dirty pages. > Trying to bend it to control IO throughput by including direct IO and > reads in the accounting will just make the logic even more complex than it > already is. Right, I have been adding too much complexity to balance_dirty_pages(). The control algorithms are pretty hard to understand and get right for all cases. OK, I'll post results of my experiments up to now, answer some questions and take a comfortable break. Phooo.. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2012-04-18 7:58 ` Fengguang Wu 0 siblings, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-18 7:58 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote: > On Fri 06-04-12 02:59:34, Wu Fengguang wrote: > ... > > > > > Let's please keep the layering clear. IO limitations will be applied > > > > > at the block layer and pressure will be formed there and then > > > > > propagated upwards eventually to the originator. Sure, exposing the > > > > > whole information might result in better behavior for certain > > > > > workloads, but down the road, say, in three or five years, devices > > > > > which can be shared without worrying too much about seeks might be > > > > > commonplace and we could be swearing at a disgusting structural mess, > > > > > and sadly various cgroup support seems to be a prominent source of > > > > > such design failures. > > > > > > > > Super fast storages are coming which will make us regret to make the > > > > IO path over complex. Spinning disks are not going away anytime soon. > > > > I doubt Google is willing to afford the disk seek costs on its > > > > millions of disks and has the patience to wait until switching all of > > > > the spin disks to SSD years later (if it will ever happen). > > > > > > This is new. Let's keep the damn employer out of the discussion. > > > While the area I work on is affected by my employment (writeback isn't > > > even my area BTW), I'm not gonna do something adverse to upstream even > > > if it's beneficial to google and I'm much more likely to do something > > > which may hurt google a bit if it's gonna benefit upstream. > > > > > > As for the faster / newer storage argument, that is *exactly* why we > > > want to keep the layering proper. Writeback works from the pressure > > > from the IO stack. If IO technology changes, we update the IO stack > > > and writeback still works from the pressure. It may need to be > > > adjusted but the principles don't change. > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > It's always there doing 1:1 proportional throttling. Then you try to > > kick in to add *double* throttling in block/cfq layer. Now the low > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > from its balanced state, leading to large fluctuations and program > > stalls. This can be avoided by telling balance_dirty_pages(): "your > > balance goal is no longer 1:1, but 10:1". With this information > > balance_dirty_pages() will behave right. Then there is the question: > > if balance_dirty_pages() will work just well provided the information, > > why bother doing the throttling at low layer and "push back" the > > pressure all the way up? > Fengguang, maybe we should first agree on some basics: > The two main goals of balance_dirty_pages() are (and always have been > AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages > in memory to allow for efficient writeback. Secondary goals are to also > keep amount of dirty pages somewhat fair among bdis and processes. Agreed? Agreed. In fact, before the IO-less change, balance_dirty_pages() had no much explicit control over the dirty rate and fairness. > Thus shift to trying to control *IO throughput* (or even just buffered > write throughput) from balance_dirty_pages() is a fundamental shift in the > goals of balance_dirty_pages(), not just some tweak (although technically, > it might be relatively easy to do for buffered writes given the current > implementation). Yes, it has been a bit shift to the rate based dirty control. > ... > > > Well, I tried and I hope some of it got through. I also wrote a lot > > > of questions, mainly regarding how what you have in mind is supposed > > > to work through what path. Maybe I'm just not seeing what you're > > > seeing but I just can't see where all the IOs would go through and > > > come together. Can you please elaborate more on that? > > > > What I can see is, it looks pretty simple and nature to let > > balance_dirty_pages() fill the gap towards a total solution :-) > > > > - add direct IO accounting in some convenient point of the IO path > > IO submission or completion point, either is fine. > > > > - change several lines of the buffered write IO controller to > > integrate the direct IO rate into the formula to fit the "total > > IO" limit > > > > - in future, add more accounting as well as feedback control to make > > balance_dirty_pages() work with IOPS and disk time > Sorry Fengguang but I also think this is a wrong way to go. > balance_dirty_pages() must primarily control the amount of dirty pages. > Trying to bend it to control IO throughput by including direct IO and > reads in the accounting will just make the logic even more complex than it > already is. Right, I have been adding too much complexity to balance_dirty_pages(). The control algorithms are pretty hard to understand and get right for all cases. OK, I'll post results of my experiments up to now, answer some questions and take a comfortable break. Phooo.. Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-06 9:59 ` Fengguang Wu ` (4 preceding siblings ...) (?) @ 2012-04-18 6:57 ` Jan Kara -1 siblings, 0 replies; 261+ messages in thread From: Jan Kara @ 2012-04-18 6:57 UTC (permalink / raw) To: Fengguang Wu Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA On Fri 06-04-12 02:59:34, Wu Fengguang wrote: ... > > > > Let's please keep the layering clear. IO limitations will be applied > > > > at the block layer and pressure will be formed there and then > > > > propagated upwards eventually to the originator. Sure, exposing the > > > > whole information might result in better behavior for certain > > > > workloads, but down the road, say, in three or five years, devices > > > > which can be shared without worrying too much about seeks might be > > > > commonplace and we could be swearing at a disgusting structural mess, > > > > and sadly various cgroup support seems to be a prominent source of > > > > such design failures. > > > > > > Super fast storages are coming which will make us regret to make the > > > IO path over complex. Spinning disks are not going away anytime soon. > > > I doubt Google is willing to afford the disk seek costs on its > > > millions of disks and has the patience to wait until switching all of > > > the spin disks to SSD years later (if it will ever happen). > > > > This is new. Let's keep the damn employer out of the discussion. > > While the area I work on is affected by my employment (writeback isn't > > even my area BTW), I'm not gonna do something adverse to upstream even > > if it's beneficial to google and I'm much more likely to do something > > which may hurt google a bit if it's gonna benefit upstream. > > > > As for the faster / newer storage argument, that is *exactly* why we > > want to keep the layering proper. Writeback works from the pressure > > from the IO stack. If IO technology changes, we update the IO stack > > and writeback still works from the pressure. It may need to be > > adjusted but the principles don't change. > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > It's always there doing 1:1 proportional throttling. Then you try to > kick in to add *double* throttling in block/cfq layer. Now the low > layer may enforce 10:1 throttling and push balance_dirty_pages() away > from its balanced state, leading to large fluctuations and program > stalls. This can be avoided by telling balance_dirty_pages(): "your > balance goal is no longer 1:1, but 10:1". With this information > balance_dirty_pages() will behave right. Then there is the question: > if balance_dirty_pages() will work just well provided the information, > why bother doing the throttling at low layer and "push back" the > pressure all the way up? Fengguang, maybe we should first agree on some basics: The two main goals of balance_dirty_pages() are (and always have been AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages in memory to allow for efficient writeback. Secondary goals are to also keep amount of dirty pages somewhat fair among bdis and processes. Agreed? Thus shift to trying to control *IO throughput* (or even just buffered write throughput) from balance_dirty_pages() is a fundamental shift in the goals of balance_dirty_pages(), not just some tweak (although technically, it might be relatively easy to do for buffered writes given the current implementation). ... > > Well, I tried and I hope some of it got through. I also wrote a lot > > of questions, mainly regarding how what you have in mind is supposed > > to work through what path. Maybe I'm just not seeing what you're > > seeing but I just can't see where all the IOs would go through and > > come together. Can you please elaborate more on that? > > What I can see is, it looks pretty simple and nature to let > balance_dirty_pages() fill the gap towards a total solution :-) > > - add direct IO accounting in some convenient point of the IO path > IO submission or completion point, either is fine. > > - change several lines of the buffered write IO controller to > integrate the direct IO rate into the formula to fit the "total > IO" limit > > - in future, add more accounting as well as feedback control to make > balance_dirty_pages() work with IOPS and disk time Sorry Fengguang but I also think this is a wrong way to go. balance_dirty_pages() must primarily control the amount of dirty pages. Trying to bend it to control IO throughput by including direct IO and reads in the accounting will just make the logic even more complex than it already is. Honza ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>]
* Re: [RFC] writeback and cgroup 2012-04-04 19:33 ` Tejun Heo (?) @ 2012-04-04 20:18 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote: > Hey, Fengguang. > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > Yeah it should be trivial to apply the balance_dirty_pages() > > throttling algorithm to the read/direct IOs. However up to now I don't > > see much added value to *duplicate* the current block IO controller > > functionalities, assuming the current users and developers are happy > > with it. > > Heh, trust me. It's half broken and people ain't happy. I get that > your algorithm can be updatd to consider all IOs and I believe that > but what I don't get is how would such information get to writeback > and in turn how writeback would enforce the result on reads and direct > IOs. Through what path? Will all reads and direct IOs travel through > balance_dirty_pages() even direct IOs on raw block devices? Or would > the writeback algorithm take the configuration from cfq, apply the > algorithm and give back the limits to enforce to cfq? If the latter, > isn't that at least somewhat messed up? I think he wanted to get the configuration with the help of blkcg interface and just implement those policies up there without any further interaction with CFQ or lower layers. [..] > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > There's where I'm confused. How is the said split supposed to work? > They aren't independent. I mean, who gets to decide what and where > are those decisions enforced? As you said, split is just a temporary gap filling in the absense of a good solutiong for throttling buffered writes (which is often a source of problem for sync IO latencies). So with this solution one could put independetly control the buffered write rate of a cgroup. Lower layers will not throttle that traffic again as it would show up in root cgroup. Hence blkcg and writeback need not to communicate much as such except for confirations knobs and possibly for some stats. [..] > > - running concurrent flusher threads for cgroups, which adds back the > > disk seeks and lock contentions. And still has problems with sync > > and shared inodes. > Or, export the notion of per group per bdi congestion and flusher does not try to submit IO from an inode if device is congested. That way flusher will not get blocked and we don't have to create one flusher thread per cgroup and be happy with one flusher per bdi. And with the comprobmise of one inode belonging to one cgroup, we will still dispatch a bunch of IO from one inode and then move to next. Depending on size of chunk we can reduce the seek a bit. Size of quantum will decide tradeoff between seek and fairness of writes from inodes. [..] > > - the mess of metadata handling > > Does throttling from writeback actually solve this problem? What > about fsync()? Does that already go through balance_dirty_pages()? By throttling the process at the time of dirtying memory, you just allowed enough IO from process as allowed by the limits. Now fsync() has to send only those pages to the disk and does not have to be throttled again. So throttling process while you are admitting IO avoids these issues with filesystem metadata. But at the same time it does not feel right to throttle read and AIO synchronously. Current behavior of kernel queuing up bio and throttling it asynchronously is desirable. Only buffered write is a special case as we anyway throttle it actively based on amount of dirty memory. [..] > > > - unnecessarily coupled with memcg, in order to take advantage of the > > per-memcg dirty limits for balance_dirty_pages() to actually convert > > the "pushed back" dirty pages pressure into lowered dirty rate. Why > > the hell the users *have to* setup memcg (suffering from all the > > inconvenience and overheads) in order to do IO throttling? Please, > > this is really ugly! And the "back pressure" may constantly push the > > memcg dirty pages to the limits. I'm not going to support *miss use* > > of per-memcg dirty limits like this! > > Writeback sits between blkcg and memcg and it indeed can be hairy to > consider both sides especially given the current sorry complex state > of cgroup and I can see why it would seem tempting to add a separate > controller or at least knobs to support that. That said, I *think* > given that memcg controls all other memory parameters it probably > would make most sense giving that parameter to memcg too. I don't > think this is really relevant to this discussion tho. Who owns > dirty_limits is a separate issue. I agree that dirty_limit control resembles more closely to memcg than blkcg as it is all about writing to memory and that's the resource controlled by memcg. I think Fegguang wanted to keep those knobs in blkcg as he thinks that in writeback logic he can actively throttle readers and direct IO too. But that does not sounds little messy to me too. Hey how about reconsidering my other proposal for which I had posted the patches. And that is keep throttling still at device level. Reads and direct IO get throttled asynchronously but buffered writes get throttled synchronously. Advantages of this scheme. - There are no separate knobs. - All the IO (read, direct IO and buffered write) is controlled using same set of knobs and goes in queue of same cgroup. - Writeback logic has no knowledge of throttling. It just invokes a hook into throttling logic of device queue. I guess this is a hybrid of active writeback throttling and back pressure mechanism. But it still does not solve the NFS issue as well as for direct IO, filesystems still can get serialized, so metadata issue still needs to be resolved. So one can argue that why not go for full "back pressure" method, despite it being more complex. Here is the link, just to refresh the memory. Something to keep in mind while assessing alternatives. https://lkml.org/lkml/2011/6/28/243 Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 20:18 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote: > Hey, Fengguang. > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > Yeah it should be trivial to apply the balance_dirty_pages() > > throttling algorithm to the read/direct IOs. However up to now I don't > > see much added value to *duplicate* the current block IO controller > > functionalities, assuming the current users and developers are happy > > with it. > > Heh, trust me. It's half broken and people ain't happy. I get that > your algorithm can be updatd to consider all IOs and I believe that > but what I don't get is how would such information get to writeback > and in turn how writeback would enforce the result on reads and direct > IOs. Through what path? Will all reads and direct IOs travel through > balance_dirty_pages() even direct IOs on raw block devices? Or would > the writeback algorithm take the configuration from cfq, apply the > algorithm and give back the limits to enforce to cfq? If the latter, > isn't that at least somewhat messed up? I think he wanted to get the configuration with the help of blkcg interface and just implement those policies up there without any further interaction with CFQ or lower layers. [..] > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > There's where I'm confused. How is the said split supposed to work? > They aren't independent. I mean, who gets to decide what and where > are those decisions enforced? As you said, split is just a temporary gap filling in the absense of a good solutiong for throttling buffered writes (which is often a source of problem for sync IO latencies). So with this solution one could put independetly control the buffered write rate of a cgroup. Lower layers will not throttle that traffic again as it would show up in root cgroup. Hence blkcg and writeback need not to communicate much as such except for confirations knobs and possibly for some stats. [..] > > - running concurrent flusher threads for cgroups, which adds back the > > disk seeks and lock contentions. And still has problems with sync > > and shared inodes. > Or, export the notion of per group per bdi congestion and flusher does not try to submit IO from an inode if device is congested. That way flusher will not get blocked and we don't have to create one flusher thread per cgroup and be happy with one flusher per bdi. And with the comprobmise of one inode belonging to one cgroup, we will still dispatch a bunch of IO from one inode and then move to next. Depending on size of chunk we can reduce the seek a bit. Size of quantum will decide tradeoff between seek and fairness of writes from inodes. [..] > > - the mess of metadata handling > > Does throttling from writeback actually solve this problem? What > about fsync()? Does that already go through balance_dirty_pages()? By throttling the process at the time of dirtying memory, you just allowed enough IO from process as allowed by the limits. Now fsync() has to send only those pages to the disk and does not have to be throttled again. So throttling process while you are admitting IO avoids these issues with filesystem metadata. But at the same time it does not feel right to throttle read and AIO synchronously. Current behavior of kernel queuing up bio and throttling it asynchronously is desirable. Only buffered write is a special case as we anyway throttle it actively based on amount of dirty memory. [..] > > > - unnecessarily coupled with memcg, in order to take advantage of the > > per-memcg dirty limits for balance_dirty_pages() to actually convert > > the "pushed back" dirty pages pressure into lowered dirty rate. Why > > the hell the users *have to* setup memcg (suffering from all the > > inconvenience and overheads) in order to do IO throttling? Please, > > this is really ugly! And the "back pressure" may constantly push the > > memcg dirty pages to the limits. I'm not going to support *miss use* > > of per-memcg dirty limits like this! > > Writeback sits between blkcg and memcg and it indeed can be hairy to > consider both sides especially given the current sorry complex state > of cgroup and I can see why it would seem tempting to add a separate > controller or at least knobs to support that. That said, I *think* > given that memcg controls all other memory parameters it probably > would make most sense giving that parameter to memcg too. I don't > think this is really relevant to this discussion tho. Who owns > dirty_limits is a separate issue. I agree that dirty_limit control resembles more closely to memcg than blkcg as it is all about writing to memory and that's the resource controlled by memcg. I think Fegguang wanted to keep those knobs in blkcg as he thinks that in writeback logic he can actively throttle readers and direct IO too. But that does not sounds little messy to me too. Hey how about reconsidering my other proposal for which I had posted the patches. And that is keep throttling still at device level. Reads and direct IO get throttled asynchronously but buffered writes get throttled synchronously. Advantages of this scheme. - There are no separate knobs. - All the IO (read, direct IO and buffered write) is controlled using same set of knobs and goes in queue of same cgroup. - Writeback logic has no knowledge of throttling. It just invokes a hook into throttling logic of device queue. I guess this is a hybrid of active writeback throttling and back pressure mechanism. But it still does not solve the NFS issue as well as for direct IO, filesystems still can get serialized, so metadata issue still needs to be resolved. So one can argue that why not go for full "back pressure" method, despite it being more complex. Here is the link, just to refresh the memory. Something to keep in mind while assessing alternatives. https://lkml.org/lkml/2011/6/28/243 Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-04 20:18 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote: > Hey, Fengguang. > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > Yeah it should be trivial to apply the balance_dirty_pages() > > throttling algorithm to the read/direct IOs. However up to now I don't > > see much added value to *duplicate* the current block IO controller > > functionalities, assuming the current users and developers are happy > > with it. > > Heh, trust me. It's half broken and people ain't happy. I get that > your algorithm can be updatd to consider all IOs and I believe that > but what I don't get is how would such information get to writeback > and in turn how writeback would enforce the result on reads and direct > IOs. Through what path? Will all reads and direct IOs travel through > balance_dirty_pages() even direct IOs on raw block devices? Or would > the writeback algorithm take the configuration from cfq, apply the > algorithm and give back the limits to enforce to cfq? If the latter, > isn't that at least somewhat messed up? I think he wanted to get the configuration with the help of blkcg interface and just implement those policies up there without any further interaction with CFQ or lower layers. [..] > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > There's where I'm confused. How is the said split supposed to work? > They aren't independent. I mean, who gets to decide what and where > are those decisions enforced? As you said, split is just a temporary gap filling in the absense of a good solutiong for throttling buffered writes (which is often a source of problem for sync IO latencies). So with this solution one could put independetly control the buffered write rate of a cgroup. Lower layers will not throttle that traffic again as it would show up in root cgroup. Hence blkcg and writeback need not to communicate much as such except for confirations knobs and possibly for some stats. [..] > > - running concurrent flusher threads for cgroups, which adds back the > > disk seeks and lock contentions. And still has problems with sync > > and shared inodes. > Or, export the notion of per group per bdi congestion and flusher does not try to submit IO from an inode if device is congested. That way flusher will not get blocked and we don't have to create one flusher thread per cgroup and be happy with one flusher per bdi. And with the comprobmise of one inode belonging to one cgroup, we will still dispatch a bunch of IO from one inode and then move to next. Depending on size of chunk we can reduce the seek a bit. Size of quantum will decide tradeoff between seek and fairness of writes from inodes. [..] > > - the mess of metadata handling > > Does throttling from writeback actually solve this problem? What > about fsync()? Does that already go through balance_dirty_pages()? By throttling the process at the time of dirtying memory, you just allowed enough IO from process as allowed by the limits. Now fsync() has to send only those pages to the disk and does not have to be throttled again. So throttling process while you are admitting IO avoids these issues with filesystem metadata. But at the same time it does not feel right to throttle read and AIO synchronously. Current behavior of kernel queuing up bio and throttling it asynchronously is desirable. Only buffered write is a special case as we anyway throttle it actively based on amount of dirty memory. [..] > > > - unnecessarily coupled with memcg, in order to take advantage of the > > per-memcg dirty limits for balance_dirty_pages() to actually convert > > the "pushed back" dirty pages pressure into lowered dirty rate. Why > > the hell the users *have to* setup memcg (suffering from all the > > inconvenience and overheads) in order to do IO throttling? Please, > > this is really ugly! And the "back pressure" may constantly push the > > memcg dirty pages to the limits. I'm not going to support *miss use* > > of per-memcg dirty limits like this! > > Writeback sits between blkcg and memcg and it indeed can be hairy to > consider both sides especially given the current sorry complex state > of cgroup and I can see why it would seem tempting to add a separate > controller or at least knobs to support that. That said, I *think* > given that memcg controls all other memory parameters it probably > would make most sense giving that parameter to memcg too. I don't > think this is really relevant to this discussion tho. Who owns > dirty_limits is a separate issue. I agree that dirty_limit control resembles more closely to memcg than blkcg as it is all about writing to memory and that's the resource controlled by memcg. I think Fegguang wanted to keep those knobs in blkcg as he thinks that in writeback logic he can actively throttle readers and direct IO too. But that does not sounds little messy to me too. Hey how about reconsidering my other proposal for which I had posted the patches. And that is keep throttling still at device level. Reads and direct IO get throttled asynchronously but buffered writes get throttled synchronously. Advantages of this scheme. - There are no separate knobs. - All the IO (read, direct IO and buffered write) is controlled using same set of knobs and goes in queue of same cgroup. - Writeback logic has no knowledge of throttling. It just invokes a hook into throttling logic of device queue. I guess this is a hybrid of active writeback throttling and back pressure mechanism. But it still does not solve the NFS issue as well as for direct IO, filesystems still can get serialized, so metadata issue still needs to be resolved. So one can argue that why not go for full "back pressure" method, despite it being more complex. Here is the link, just to refresh the memory. Something to keep in mind while assessing alternatives. https://lkml.org/lkml/2011/6/28/243 Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-04 20:18 ` Vivek Goyal @ 2012-04-05 16:31 ` Tejun Heo -1 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Vivek. On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote: > Hey how about reconsidering my other proposal for which I had posted > the patches. And that is keep throttling still at device level. Reads > and direct IO get throttled asynchronously but buffered writes get > throttled synchronously. > > Advantages of this scheme. > > - There are no separate knobs. > > - All the IO (read, direct IO and buffered write) is controlled using > same set of knobs and goes in queue of same cgroup. > > - Writeback logic has no knowledge of throttling. It just invokes a > hook into throttling logic of device queue. > > I guess this is a hybrid of active writeback throttling and back pressure > mechanism. > > But it still does not solve the NFS issue as well as for direct IO, > filesystems still can get serialized, so metadata issue still needs to > be resolved. So one can argue that why not go for full "back pressure" > method, despite it being more complex. > > Here is the link, just to refresh the memory. Something to keep in mind > while assessing alternatives. > > https://lkml.org/lkml/2011/6/28/243 Hmmm... so, this only works for blk-throttle and not with the weight. How do you manage interaction between buffered writes and direct writes for the same cgroup? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-05 16:31 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf Hey, Vivek. On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote: > Hey how about reconsidering my other proposal for which I had posted > the patches. And that is keep throttling still at device level. Reads > and direct IO get throttled asynchronously but buffered writes get > throttled synchronously. > > Advantages of this scheme. > > - There are no separate knobs. > > - All the IO (read, direct IO and buffered write) is controlled using > same set of knobs and goes in queue of same cgroup. > > - Writeback logic has no knowledge of throttling. It just invokes a > hook into throttling logic of device queue. > > I guess this is a hybrid of active writeback throttling and back pressure > mechanism. > > But it still does not solve the NFS issue as well as for direct IO, > filesystems still can get serialized, so metadata issue still needs to > be resolved. So one can argue that why not go for full "back pressure" > method, despite it being more complex. > > Here is the link, just to refresh the memory. Something to keep in mind > while assessing alternatives. > > https://lkml.org/lkml/2011/6/28/243 Hmmm... so, this only works for blk-throttle and not with the weight. How do you manage interaction between buffered writes and direct writes for the same cgroup? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup 2012-04-05 16:31 ` Tejun Heo @ 2012-04-05 17:09 ` Vivek Goyal -1 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote: > > Hey how about reconsidering my other proposal for which I had posted > > the patches. And that is keep throttling still at device level. Reads > > and direct IO get throttled asynchronously but buffered writes get > > throttled synchronously. > > > > Advantages of this scheme. > > > > - There are no separate knobs. > > > > - All the IO (read, direct IO and buffered write) is controlled using > > same set of knobs and goes in queue of same cgroup. > > > > - Writeback logic has no knowledge of throttling. It just invokes a > > hook into throttling logic of device queue. > > > > I guess this is a hybrid of active writeback throttling and back pressure > > mechanism. > > > > But it still does not solve the NFS issue as well as for direct IO, > > filesystems still can get serialized, so metadata issue still needs to > > be resolved. So one can argue that why not go for full "back pressure" > > method, despite it being more complex. > > > > Here is the link, just to refresh the memory. Something to keep in mind > > while assessing alternatives. > > > > https://lkml.org/lkml/2011/6/28/243 > > Hmmm... so, this only works for blk-throttle and not with the weight. > How do you manage interaction between buffered writes and direct > writes for the same cgroup? > Yes, it is only for blk-throttle. We just account for buffered write in balance_dirty_pages() instead of when they are actually submitted to device by flusher thread. IIRC, I just had two queues. In one queue I had bios and in another queue I had tasks with information how much memory they are dirtying. So I did round robin in terms of dispatch between two queues depending on throttling rate. I will allow dispatch bio from direct IO queue, then look at the other queue and see how much IO other task wanted to do and when sufficient time had passed based on throttling rate, I will remove that task from my wait queue and wake it up. That way it becomes equivalent to that two IO paths (direct IO + buffered write), doing IO to single pipe which has throttling limit. Both the IOs are sujected to same common limit (and no split). Just that we round robin between two types of IO and try to divide available bandwidth equally (This ofcourse could be made tunable). Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup @ 2012-04-05 17:09 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw) To: Tejun Heo Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni, lsf On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote: > > Hey how about reconsidering my other proposal for which I had posted > > the patches. And that is keep throttling still at device level. Reads > > and direct IO get throttled asynchronously but buffered writes get > > throttled synchronously. > > > > Advantages of this scheme. > > > > - There are no separate knobs. > > > > - All the IO (read, direct IO and buffered write) is controlled using > > same set of knobs and goes in queue of same cgroup. > > > > - Writeback logic has no knowledge of throttling. It just invokes a > > hook into throttling logic of device queue. > > > > I guess this is a hybrid of active writeback throttling and back pressure > > mechanism. > > > > But it still does not solve the NFS issue as well as for direct IO, > > filesystems still can get serialized, so metadata issue still needs to > > be resolved. So one can argue that why not go for full "back pressure" > > method, despite it being more complex. > > > > Here is the link, just to refresh the memory. Something to keep in mind > > while assessing alternatives. > > > > https://lkml.org/lkml/2011/6/28/243 > > Hmmm... so, this only works for blk-throttle and not with the weight. > How do you manage interaction between buffered writes and direct > writes for the same cgroup? > Yes, it is only for blk-throttle. We just account for buffered write in balance_dirty_pages() instead of when they are actually submitted to device by flusher thread. IIRC, I just had two queues. In one queue I had bios and in another queue I had tasks with information how much memory they are dirtying. So I did round robin in terms of dispatch between two queues depending on throttling rate. I will allow dispatch bio from direct IO queue, then look at the other queue and see how much IO other task wanted to do and when sufficient time had passed based on throttling rate, I will remove that task from my wait queue and wake it up. That way it becomes equivalent to that two IO paths (direct IO + buffered write), doing IO to single pipe which has throttling limit. Both the IOs are sujected to same common limit (and no split). Just that we round robin between two types of IO and try to divide available bandwidth equally (This ofcourse could be made tunable). Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-04-05 17:09 ` Vivek Goyal 0 siblings, 0 replies; 261+ messages in thread From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote: > Hey, Vivek. > > On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote: > > Hey how about reconsidering my other proposal for which I had posted > > the patches. And that is keep throttling still at device level. Reads > > and direct IO get throttled asynchronously but buffered writes get > > throttled synchronously. > > > > Advantages of this scheme. > > > > - There are no separate knobs. > > > > - All the IO (read, direct IO and buffered write) is controlled using > > same set of knobs and goes in queue of same cgroup. > > > > - Writeback logic has no knowledge of throttling. It just invokes a > > hook into throttling logic of device queue. > > > > I guess this is a hybrid of active writeback throttling and back pressure > > mechanism. > > > > But it still does not solve the NFS issue as well as for direct IO, > > filesystems still can get serialized, so metadata issue still needs to > > be resolved. So one can argue that why not go for full "back pressure" > > method, despite it being more complex. > > > > Here is the link, just to refresh the memory. Something to keep in mind > > while assessing alternatives. > > > > https://lkml.org/lkml/2011/6/28/243 > > Hmmm... so, this only works for blk-throttle and not with the weight. > How do you manage interaction between buffered writes and direct > writes for the same cgroup? > Yes, it is only for blk-throttle. We just account for buffered write in balance_dirty_pages() instead of when they are actually submitted to device by flusher thread. IIRC, I just had two queues. In one queue I had bios and in another queue I had tasks with information how much memory they are dirtying. So I did round robin in terms of dispatch between two queues depending on throttling rate. I will allow dispatch bio from direct IO queue, then look at the other queue and see how much IO other task wanted to do and when sufficient time had passed based on throttling rate, I will remove that task from my wait queue and wake it up. That way it becomes equivalent to that two IO paths (direct IO + buffered write), doing IO to single pipe which has throttling limit. Both the IOs are sujected to same common limit (and no split). Just that we round robin between two types of IO and try to divide available bandwidth equally (This ofcourse could be made tunable). Thanks Vivek ^ permalink raw reply [flat|nested] 261+ messages in thread
[parent not found: <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] writeback and cgroup [not found] ` <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-04-05 16:31 ` Tejun Heo 0 siblings, 0 replies; 261+ messages in thread From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu Hey, Vivek. On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote: > Hey how about reconsidering my other proposal for which I had posted > the patches. And that is keep throttling still at device level. Reads > and direct IO get throttled asynchronously but buffered writes get > throttled synchronously. > > Advantages of this scheme. > > - There are no separate knobs. > > - All the IO (read, direct IO and buffered write) is controlled using > same set of knobs and goes in queue of same cgroup. > > - Writeback logic has no knowledge of throttling. It just invokes a > hook into throttling logic of device queue. > > I guess this is a hybrid of active writeback throttling and back pressure > mechanism. > > But it still does not solve the NFS issue as well as for direct IO, > filesystems still can get serialized, so metadata issue still needs to > be resolved. So one can argue that why not go for full "back pressure" > method, despite it being more complex. > > Here is the link, just to refresh the memory. Something to keep in mind > while assessing alternatives. > > https://lkml.org/lkml/2011/6/28/243 Hmmm... so, this only works for blk-throttle and not with the weight. How do you manage interaction between buffered writes and direct writes for the same cgroup? Thanks. -- tejun ^ permalink raw reply [flat|nested] 261+ messages in thread
* Re: [RFC] writeback and cgroup [not found] ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 20:18 ` Vivek Goyal @ 2012-04-06 9:59 ` Fengguang Wu 1 sibling, 0 replies; 261+ messages in thread From: Fengguang Wu @ 2012-04-06 9:59 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara, rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA Hi Tejun, On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote: > Hey, Fengguang. > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > Yeah it should be trivial to apply the balance_dirty_pages() > > throttling algorithm to the read/direct IOs. However up to now I don't > > see much added value to *duplicate* the current block IO controller > > functionalities, assuming the current users and developers are happy > > with it. > > Heh, trust me. It's half broken and people ain't happy. I get that Yeah, although the balance_dirty_pages() IO controller for buffered writes looks perfect in itself, it's not enough to meet user demands. The user expectation should be: hey, please throttle *all* IOs from this cgroup to this amount, either in absolute bps/iops limits or in some proportional weight value (or both, whatever the lower takes effect). And if necessary, he may request further limits/weights for each type of IO inside the cgroup. Now the blkio cgroup supports direct IO and the balance_dirty_pages() IO controller supports buffered writes. They are providing limits/weights for either direct IO or buffered writes, which is fine if it's pure direct IO or pure buffered write. For the common mixed IO workloads, it's obviously not enough. Fortunately, the above gap can be easily filled judging from the block/cfq IO controller code. By adding some direct IO accounting and changing several lines of my patches to make use of the collected stats, the semantics of the blkio.throttle.write_bps interfaces can be changed from "limit for direct IO" to "limit for direct+buffered IOs". Ditto for blkio.weight and blkio.write_iops, as long as some iops/device time stats are made available to balance_dirty_pages(). It would be a fairly *easy* change. :-) It's merely adding some accounting code and there is no need to change the block IO controlling algorithm at all. I'll do the work of accounting (which is basically independent of the IO controlling) and use the new stats in balance_dirty_pages(). The only problem I can see now, is that balance_dirty_pages() works per-bdi and blkcg works per-device. So the two ends may not match nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where sdb is shared by lv0 and lv1. However it should be rare situations and be much more acceptable than the problems arise from the "push back" approach which impacts everyone. > your algorithm can be updatd to consider all IOs and I believe that > but what I don't get is how would such information get to writeback > and in turn how writeback would enforce the result on reads and direct > IOs. Through what path? Will all reads and direct IOs travel through > balance_dirty_pages() even direct IOs on raw block devices? Or would > the writeback algorithm take the configuration from cfq, apply the > algorithm and give back the limits to enforce to cfq? If the latter, > isn't that at least somewhat messed up? cfq is working well and don't need any modifications. Let's just make balance_dirty_pages() cgroup aware and fill the gap of the current block IO controller. If the balance_dirty_pages() throttling algorithms will ever be applied to read and direct IOs, it would be for NFS, CIFS etc. Even for them, there may be better throttling choices. For example, Trond mentioned the RPC layer to me during the summit. > > I did the buffered write IO controller mainly to fill the gap. If I > > happen to stand in your way, sorry that's not my initial intention. > > No, no, it's not about standing in my way. As Vivek said in the other > reply, it's that the "gap" that you filled was created *because* > writeback wasn't cgroup aware and now you're in turn filling that gap > by making writeback work around that "gap". I mean, my mind boggles. > Doesn't yours? I strongly believe everyone's should. Heh. It's a hard problem indeed. I felt great pains in the IO-less dirty throttling work. I did a lot reasoning about it, and have in fact kept cgroup IO controller in mind since its early days. Now I'd say it's hands down for it to adapt to the gap between the total IO limit and what's carried out by the block IO controller. > > It's a pity and surprise that Google as a big user does not buy in > > this simple solution. You may prefer more comprehensive controls which > > may not be easily achievable with the simple scheme. However the > > complexities and overheads involved in throttling the flusher IOs > > really upsets me. > > Heh, believe it or not, I'm not really wearing google hat on this > subject and google's writeback people may have completely different > opinions on the subject than mine. In fact, I'm not even sure how > much "work" time I'll be able to assign to this. :( OK, understand. > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > There's where I'm confused. How is the said split supposed to work? > They aren't independent. I mean, who gets to decide what and where > are those decisions enforced? Yeah it's not independent. It's about - keep block IO cgroup untouched (in its current algorithm, for throttling direct IO) - let balance_dirty_pages() adapt to the throttling target buffered_write_limit = total_limit - direct_IOs > > What I'm interested is, what's Google and other users' use schemes in > > practice. What's their desired interfaces. Whether and how the > > combined bdp+blkcg throttling can fulfill the goals. > > I'm not too privy of mm and writeback in google and even if so I > probably shouldn't talk too much about it. Confidentiality and all. > That said, I have the general feeling that goog already figured out > how to at least work around the existing implementation and would be > able to continue no matter how upstream development fans out. > > That said, wearing the cgroup maintainer and general kernel > contributor hat, I'd really like to avoid another design mess up. To me it looks a pretty clean split and find it to be an easy solution (after sorting it out the hard way). I'll show the code and test results after some time. > > > Let's please keep the layering clear. IO limitations will be applied > > > at the block layer and pressure will be formed there and then > > > propagated upwards eventually to the originator. Sure, exposing the > > > whole information might result in better behavior for certain > > > workloads, but down the road, say, in three or five years, devices > > > which can be shared without worrying too much about seeks might be > > > commonplace and we could be swearing at a disgusting structural mess, > > > and sadly various cgroup support seems to be a prominent source of > > > such design failures. > > > > Super fast storages are coming which will make us regret to make the > > IO path over complex. Spinning disks are not going away anytime soon. > > I doubt Google is willing to afford the disk seek costs on its > > millions of disks and has the patience to wait until switching all of > > the spin disks to SSD years later (if it will ever happen). > > This is new. Let's keep the damn employer out of the discussion. > While the area I work on is affected by my employment (writeback isn't > even my area BTW), I'm not gonna do something adverse to upstream even > if it's beneficial to google and I'm much more likely to do something > which may hurt google a bit if it's gonna benefit upstream. > > As for the faster / newer storage argument, that is *exactly* why we > want to keep the layering proper. Writeback works from the pressure > from the IO stack. If IO technology changes, we update the IO stack > and writeback still works from the pressure. It may need to be > adjusted but the principles don't change. To me, balance_dirty_pages() is *the* proper layer for buffered writes. It's always there doing 1:1 proportional throttling. Then you try to kick in to add *double* throttling in block/cfq layer. Now the low layer may enforce 10:1 throttling and push balance_dirty_pages() away from its balanced state, leading to large fluctuations and program stalls. This can be avoided by telling balance_dirty_pages(): "your balance goal is no longer 1:1, but 10:1". With this information balance_dirty_pages() will behave right. Then there is the question: if balance_dirty_pages() will work just well provided the information, why bother doing the throttling at low layer and "push back" the pressure all the way up? > > It's obvious that your below proposal involves a lot of complexities, > > overheads, and will hurt performance. It basically involves > > Hmmm... that's not the impression I got from the discussion. > According to Jan, applying the current writeback logic to cgroup'fied > bdi shouldn't be too complex, no? In the sense of "avoidable" complexity :-) > > - running concurrent flusher threads for cgroups, which adds back the > > disk seeks and lock contentions. And still has problems with sync > > and shared inodes. > > I agree this is an actual concern but if the user wants to split one > spindle to multiple resource domains, there's gonna be considerable > amount of overhead no matter what. If you want to improve how block > layer handles the split, you're welcome to dive into the block layer, > where the split is made, and improve it. > > > - splitting device queue for cgroups, possibly scaling up the pool of > > writeback pages (and locked pages in the case of stable pages) which > > could stall random processes in the system > > Sure, it'll take up more buffering and memory but that's the overhead > of the cgroup business. I want it to be less intrusive at the cost of > somewhat more resource consumption. ie. I don't want writeback logic > itself deeply involved in block IO cgroup enforcement even if that > means somewhat less efficient resource usage. The balance_dirty_pages() is already deeply involved in dirty throttling. As you can see from this patchset, the same algorithms can be extended trivially to work with cgroup IO limits. buffered write IO controller in balance_dirty_pages() https://lkml.org/lkml/2012/3/28/275 It does not require forking off the flusher threads and splitting up the IO queue at all. > > - the mess of metadata handling > > Does throttling from writeback actually solve this problem? What > about fsync()? Does that already go through balance_dirty_pages()? balance_dirty_pages() does throttling at safe points outside of fs transactions/locks. fsync() only submits IO for already dirtied pages and won't be throttled by balance_dirty_pages(). Throttling happens at earlier times when the task is dirtying the pages. > > - unnecessarily coupled with memcg, in order to take advantage of the > > per-memcg dirty limits for balance_dirty_pages() to actually convert > > the "pushed back" dirty pages pressure into lowered dirty rate. Why > > the hell the users *have to* setup memcg (suffering from all the > > inconvenience and overheads) in order to do IO throttling? Please, > > this is really ugly! And the "back pressure" may constantly push the > > memcg dirty pages to the limits. I'm not going to support *miss use* > > of per-memcg dirty limits like this! > > Writeback sits between blkcg and memcg and it indeed can be hairy to > consider both sides especially given the current sorry complex state > of cgroup and I can see why it would seem tempting to add a separate > controller or at least knobs to support that. That said, I *think* > given that memcg controls all other memory parameters it probably > would make most sense giving that parameter to memcg too. I don't > think this is really relevant to this discussion tho. Who owns > dirty_limits is a separate issue. In the "back pressure" scheme, memcg is a must because only it has all the infrastructure to track dirty pages upon which you can apply some dirty_limits. Don't tell me you want to account dirty pages in blkcg... > > I cannot believe you would keep overlooking all the problems without > > good reasons. Please do tell us the reasons that matter. > > Well, I tried and I hope some of it got through. I also wrote a lot > of questions, mainly regarding how what you have in mind is supposed > to work through what path. Maybe I'm just not seeing what you're > seeing but I just can't see where all the IOs would go through and > come together. Can you please elaborate more on that? What I can see is, it looks pretty simple and nature to let balance_dirty_pages() fill the gap towards a total solution :-) - add direct IO accounting in some convenient point of the IO path IO submission or completion point, either is fine. - change several lines of the buffered write IO controller to integrate the direct IO rate into the formula to fit the "total IO" limit - in future, add more accounting as well as feedback control to make balance_dirty_pages() work with IOPS and disk time Thanks, Fengguang ^ permalink raw reply [flat|nested] 261+ messages in thread
end of thread, other threads:[~2012-04-25 15:47 UTC | newest] Thread overview: 261+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-03 18:36 [RFC] writeback and cgroup Tejun Heo 2012-04-03 18:36 ` Tejun Heo 2012-04-03 18:36 ` Tejun Heo 2012-04-04 14:51 ` Vivek Goyal 2012-04-04 14:51 ` Vivek Goyal 2012-04-04 15:36 ` [Lsf] " Steve French 2012-04-04 15:36 ` Steve French 2012-04-04 15:36 ` Steve French 2012-04-04 18:56 ` Tejun Heo 2012-04-04 18:56 ` Tejun Heo 2012-04-04 19:19 ` Vivek Goyal 2012-04-04 19:19 ` Vivek Goyal 2012-04-25 8:47 ` Suresh Jayaraman 2012-04-25 8:47 ` Suresh Jayaraman 2012-04-25 8:47 ` Suresh Jayaraman [not found] ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-25 8:47 ` Suresh Jayaraman [not found] ` <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 19:19 ` Vivek Goyal [not found] ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-04-04 18:56 ` Tejun Heo 2012-04-04 18:49 ` Tejun Heo 2012-04-04 18:49 ` Tejun Heo 2012-04-04 18:49 ` Tejun Heo 2012-04-04 19:23 ` [Lsf] " Steve French 2012-04-04 19:23 ` Steve French [not found] ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-04-14 12:15 ` Peter Zijlstra 2012-04-14 12:15 ` Peter Zijlstra 2012-04-14 12:15 ` Peter Zijlstra 2012-04-14 12:15 ` Peter Zijlstra [not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 19:23 ` Steve French 2012-04-04 20:32 ` Vivek Goyal 2012-04-05 16:38 ` Tejun Heo 2012-04-14 11:53 ` [Lsf] " Peter Zijlstra 2012-04-04 20:32 ` Vivek Goyal 2012-04-04 20:32 ` Vivek Goyal [not found] ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 23:02 ` Tejun Heo 2012-04-04 23:02 ` Tejun Heo 2012-04-04 23:02 ` Tejun Heo 2012-04-04 23:02 ` Tejun Heo 2012-04-05 16:38 ` Tejun Heo 2012-04-05 16:38 ` Tejun Heo 2012-04-05 16:38 ` Tejun Heo [not found] ` <20120405163854.GE12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-05 17:13 ` Vivek Goyal 2012-04-05 17:13 ` Vivek Goyal 2012-04-05 17:13 ` Vivek Goyal 2012-04-14 11:53 ` [Lsf] " Peter Zijlstra 2012-04-14 11:53 ` Peter Zijlstra 2012-04-14 11:53 ` Peter Zijlstra 2012-04-16 1:25 ` Steve French [not found] ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 15:36 ` Steve French 2012-04-04 18:49 ` Tejun Heo 2012-04-07 8:00 ` Jan Kara 2012-04-07 8:00 ` Jan Kara 2012-04-07 8:00 ` Jan Kara [not found] ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-10 16:23 ` [Lsf] " Steve French 2012-04-10 18:06 ` Vivek Goyal 2012-04-10 16:23 ` [Lsf] " Steve French 2012-04-10 16:23 ` Steve French 2012-04-10 16:23 ` Steve French [not found] ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-04-10 18:16 ` Vivek Goyal 2012-04-10 18:16 ` Vivek Goyal 2012-04-10 18:16 ` Vivek Goyal 2012-04-10 18:06 ` Vivek Goyal 2012-04-10 18:06 ` Vivek Goyal [not found] ` <20120410180653.GJ21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-10 21:05 ` Jan Kara 2012-04-10 21:05 ` Jan Kara 2012-04-10 21:05 ` Jan Kara [not found] ` <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-10 21:20 ` Vivek Goyal 2012-04-10 21:20 ` Vivek Goyal 2012-04-10 21:20 ` Vivek Goyal [not found] ` <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-10 22:24 ` Jan Kara 2012-04-10 22:24 ` Jan Kara 2012-04-10 22:24 ` Jan Kara [not found] ` <20120410222425.GF4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-11 15:40 ` Vivek Goyal 2012-04-11 15:40 ` Vivek Goyal 2012-04-11 15:40 ` Vivek Goyal 2012-04-11 15:45 ` Vivek Goyal 2012-04-11 15:45 ` Vivek Goyal [not found] ` <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-11 17:05 ` Jan Kara 2012-04-11 17:05 ` Jan Kara 2012-04-11 17:05 ` Jan Kara 2012-04-11 17:23 ` Vivek Goyal 2012-04-11 17:23 ` Vivek Goyal [not found] ` <20120411172311.GF16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-11 19:44 ` Jan Kara 2012-04-11 19:44 ` Jan Kara 2012-04-11 19:44 ` Jan Kara [not found] ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-11 17:23 ` Vivek Goyal 2012-04-17 21:48 ` Tejun Heo 2012-04-17 21:48 ` Tejun Heo 2012-04-17 21:48 ` Tejun Heo 2012-04-17 21:48 ` Tejun Heo 2012-04-18 18:18 ` Vivek Goyal 2012-04-18 18:18 ` Vivek Goyal [not found] ` <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-18 18:18 ` Vivek Goyal [not found] ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-11 15:45 ` Vivek Goyal 2012-04-11 19:22 ` Jan Kara 2012-04-14 12:25 ` [Lsf] " Peter Zijlstra 2012-04-11 19:22 ` Jan Kara 2012-04-11 19:22 ` Jan Kara [not found] ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-12 20:37 ` Vivek Goyal 2012-04-12 20:37 ` Vivek Goyal 2012-04-12 20:37 ` Vivek Goyal 2012-04-12 20:51 ` Tejun Heo 2012-04-12 20:51 ` Tejun Heo [not found] ` <20120412205148.GA24056-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-14 14:36 ` Fengguang Wu 2012-04-14 14:36 ` Fengguang Wu 2012-04-16 14:57 ` Vivek Goyal 2012-04-16 14:57 ` Vivek Goyal 2012-04-24 11:33 ` Fengguang Wu 2012-04-24 11:33 ` Fengguang Wu 2012-04-24 14:56 ` Jan Kara 2012-04-24 14:56 ` Jan Kara 2012-04-24 14:56 ` Jan Kara 2012-04-24 14:56 ` Jan Kara 2012-04-24 15:58 ` Vivek Goyal 2012-04-24 15:58 ` Vivek Goyal [not found] ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-25 2:42 ` Fengguang Wu 2012-04-25 2:42 ` Fengguang Wu 2012-04-25 2:42 ` Fengguang Wu 2012-04-25 2:42 ` Fengguang Wu [not found] ` <20120424145655.GA1474-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-24 15:58 ` Vivek Goyal 2012-04-25 3:16 ` Fengguang Wu 2012-04-25 3:16 ` Fengguang Wu 2012-04-25 9:01 ` Jan Kara 2012-04-25 9:01 ` Jan Kara 2012-04-25 9:01 ` Jan Kara [not found] ` <20120425090156.GB12568-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-25 12:05 ` Fengguang Wu 2012-04-25 12:05 ` Fengguang Wu 2012-04-25 9:01 ` Jan Kara [not found] ` <20120416145744.GA15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-24 11:33 ` Fengguang Wu 2012-04-16 14:57 ` Vivek Goyal 2012-04-15 11:37 ` [Lsf] " Peter Zijlstra 2012-04-15 11:37 ` Peter Zijlstra 2012-04-15 11:37 ` Peter Zijlstra [not found] ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-12 20:51 ` Tejun Heo 2012-04-15 11:37 ` [Lsf] " Peter Zijlstra 2012-04-17 22:01 ` Tejun Heo 2012-04-17 22:01 ` Tejun Heo 2012-04-17 22:01 ` Tejun Heo 2012-04-17 22:01 ` Tejun Heo 2012-04-18 6:30 ` Jan Kara 2012-04-18 6:30 ` Jan Kara [not found] ` <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-18 6:30 ` Jan Kara 2012-04-14 12:25 ` [Lsf] " Peter Zijlstra 2012-04-14 12:25 ` Peter Zijlstra 2012-04-14 12:25 ` Peter Zijlstra 2012-04-16 12:54 ` Vivek Goyal 2012-04-16 12:54 ` Vivek Goyal 2012-04-16 12:54 ` Vivek Goyal 2012-04-16 13:07 ` Fengguang Wu 2012-04-16 13:07 ` Fengguang Wu 2012-04-16 14:19 ` Fengguang Wu 2012-04-16 14:19 ` Fengguang Wu 2012-04-16 14:19 ` Fengguang Wu 2012-04-16 15:52 ` Vivek Goyal 2012-04-16 15:52 ` Vivek Goyal 2012-04-16 15:52 ` Vivek Goyal [not found] ` <20120416155207.GB15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-17 2:14 ` Fengguang Wu 2012-04-17 2:14 ` Fengguang Wu 2012-04-17 2:14 ` Fengguang Wu [not found] ` <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-16 13:07 ` Fengguang Wu [not found] ` <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 14:51 ` Vivek Goyal 2012-04-04 17:51 ` Fengguang Wu 2012-04-04 17:51 ` Fengguang Wu 2012-04-04 17:51 ` Fengguang Wu 2012-04-04 18:35 ` Vivek Goyal 2012-04-04 18:35 ` Vivek Goyal [not found] ` <20120404183528.GJ12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-04 21:42 ` Fengguang Wu 2012-04-04 21:42 ` Fengguang Wu 2012-04-04 21:42 ` Fengguang Wu 2012-04-05 15:10 ` Vivek Goyal 2012-04-05 15:10 ` Vivek Goyal [not found] ` <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-06 0:32 ` Fengguang Wu 2012-04-06 0:32 ` Fengguang Wu 2012-04-06 0:32 ` Fengguang Wu 2012-04-05 15:10 ` Vivek Goyal 2012-04-04 18:35 ` Vivek Goyal 2012-04-04 19:33 ` Tejun Heo 2012-04-04 19:33 ` Tejun Heo 2012-04-04 19:33 ` Tejun Heo 2012-04-06 9:59 ` Fengguang Wu 2012-04-06 9:59 ` Fengguang Wu 2012-04-06 9:59 ` Fengguang Wu 2012-04-17 22:38 ` Tejun Heo 2012-04-17 22:38 ` Tejun Heo 2012-04-17 22:38 ` Tejun Heo 2012-04-17 22:38 ` Tejun Heo 2012-04-19 14:23 ` Fengguang Wu 2012-04-19 14:23 ` Fengguang Wu 2012-04-19 14:23 ` Fengguang Wu 2012-04-19 18:31 ` Vivek Goyal 2012-04-19 18:31 ` Vivek Goyal [not found] ` <20120419183118.GM10216-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-20 12:45 ` Fengguang Wu 2012-04-20 12:45 ` Fengguang Wu 2012-04-20 12:45 ` Fengguang Wu 2012-04-20 19:29 ` Vivek Goyal 2012-04-20 19:29 ` Vivek Goyal 2012-04-20 21:33 ` Tejun Heo 2012-04-20 21:33 ` Tejun Heo 2012-04-22 14:26 ` Fengguang Wu 2012-04-22 14:26 ` Fengguang Wu [not found] ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-22 14:26 ` Fengguang Wu 2012-04-23 12:30 ` Vivek Goyal 2012-04-23 12:30 ` Vivek Goyal 2012-04-23 12:30 ` Vivek Goyal 2012-04-23 16:04 ` Tejun Heo 2012-04-23 16:04 ` Tejun Heo [not found] ` <20120423123011.GA8103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-23 16:04 ` Tejun Heo [not found] ` <20120420192930.GR22419-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-20 21:33 ` Tejun Heo 2012-04-20 19:29 ` Vivek Goyal 2012-04-19 18:31 ` Vivek Goyal 2012-04-19 20:26 ` Jan Kara 2012-04-19 20:26 ` Jan Kara 2012-04-19 20:26 ` Jan Kara 2012-04-20 13:34 ` Fengguang Wu 2012-04-20 13:34 ` Fengguang Wu 2012-04-20 19:08 ` Tejun Heo 2012-04-20 19:08 ` Tejun Heo 2012-04-20 19:08 ` Tejun Heo [not found] ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-22 14:46 ` Fengguang Wu 2012-04-22 14:46 ` Fengguang Wu 2012-04-22 14:46 ` Fengguang Wu 2012-04-22 14:46 ` Fengguang Wu 2012-04-23 16:56 ` Tejun Heo 2012-04-23 16:56 ` Tejun Heo 2012-04-23 16:56 ` Tejun Heo [not found] ` <20120423165626.GB5406-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-24 7:58 ` Fengguang Wu 2012-04-24 7:58 ` Fengguang Wu 2012-04-24 7:58 ` Fengguang Wu 2012-04-25 15:47 ` Tejun Heo 2012-04-25 15:47 ` Tejun Heo 2012-04-25 15:47 ` Tejun Heo 2012-04-23 9:14 ` Jan Kara 2012-04-23 9:14 ` Jan Kara 2012-04-23 9:14 ` Jan Kara 2012-04-23 10:24 ` Fengguang Wu 2012-04-23 10:24 ` Fengguang Wu 2012-04-23 12:42 ` Jan Kara 2012-04-23 12:42 ` Jan Kara 2012-04-23 14:31 ` Fengguang Wu 2012-04-23 14:31 ` Fengguang Wu [not found] ` <20120423124240.GE6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-23 14:31 ` Fengguang Wu 2012-04-23 12:42 ` Jan Kara [not found] ` <20120423091432.GC6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-23 10:24 ` Fengguang Wu [not found] ` <20120419202635.GA4795-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-20 13:34 ` Fengguang Wu [not found] ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-19 14:23 ` Fengguang Wu 2012-04-18 6:57 ` Jan Kara 2012-04-18 6:57 ` Jan Kara 2012-04-18 7:58 ` Fengguang Wu 2012-04-18 7:58 ` Fengguang Wu 2012-04-18 7:58 ` Fengguang Wu [not found] ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2012-04-18 7:58 ` Fengguang Wu 2012-04-18 6:57 ` Jan Kara [not found] ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org> 2012-04-04 20:18 ` Vivek Goyal 2012-04-04 20:18 ` Vivek Goyal 2012-04-04 20:18 ` Vivek Goyal 2012-04-05 16:31 ` Tejun Heo 2012-04-05 16:31 ` Tejun Heo 2012-04-05 17:09 ` Vivek Goyal 2012-04-05 17:09 ` Vivek Goyal [not found] ` <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-04-05 17:09 ` Vivek Goyal [not found] ` <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-04-05 16:31 ` Tejun Heo 2012-04-06 9:59 ` Fengguang Wu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.