* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-04 23:51 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-04 23:51 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Wed, Feb 04 2015, Tejun Heo wrote:
> Hello,
>
> On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
>> If a machine has several top level memcg trying to get some form of
>> isolation (using low, min, soft limit) then a shared libc will be
>> moved to the root memcg where it's not protected from global memory
>> pressure. At least with the current per page accounting such shared
>> pages often land into some protected memcg.
>
> Yes, it becomes interesting with the low limit as the pressure
> direction is reversed but at the same time overcommitting low limits
> doesn't lead to a sane setup to begin with as it's asking for global
> OOMs anyway, which means that things like libc would end up competing
> at least fairly with other pages for global pressure and should stay
> in memory under most circumstances, which may or may not be
> sufficient.
I agree. Clarification... I don't plan to overcommit low or min limits.
On machines without overcommited min limits the existing system offers
some protection for shared libs from global reclaim. Pushing them to
root doesn't.
> Hmm.... need to think more about it but this only becomes a problem
> with the root cgroup because it doesn't have min setting which is
> expected to be inclusive of all descendants, right? Maybe the right
> thing to do here is treating the inodes which get pushed to the root
> as a special case and we can implement a mechanism where the root is
> effectively borrowing from the mins of its children which doesn't have
> to be completely correct - e.g. just charge it against all children
> repeatedly and if any has min protection, put it under min protection.
> IOW, make it the baseload for all of them.
I think the linux-next low (and the TBD min) limits also have the
problem for more than just the root memcg. I'm thinking of a 2M file
shared between C and D below. The file will be charged to common parent
B.
A
+-B (usage=2M lim=3M min=2M)
+-C (usage=0 lim=2M min=1M shared_usage=2M)
+-D (usage=0 lim=2M min=1M shared_usage=2M)
\-E (usage=0 lim=2M min=0)
The problem arises if A/B/E allocates more than 1M of private
reclaimable file data. This pushes A/B into reclaim which will reclaim
both the shared file from A/B and private file from A/B/E. In contrast,
the current per-page memcg would've protected the shared file in either
C or D leaving A/B reclaim to only attack A/B/E.
Pinning the shared file to either C or D, using TBD policy such as mount
option, would solve this for tightly shared files. But for wide fanout
file (libc) the admin would need to assign a global bucket and this
would be a pain to size due to various job requirements.
>> If two cgroups collude they can use more memory than their limit and
>> oom the entire machine. Admittedly the current per-page system isn't
>> perfect because deleting a memcg which contains mlocked memory
>> (referenced by a remote memcg) moves the mlocked memory to root
>> resulting in the same issue. But I'd argue this is more likely with
>
> Hmmm... why does it do that? Can you point me to where it's
> happening?
My mistake, I was thinking of older kernels which reparent memory.
Though I can't say v3.19-rc7 handles this collusion any better. Instead
of reparenting the mlocked memory, it's left in an invisible (offline)
memcg. Unlike older kernels the memory doesn't appear in
root/memory.stat[unevictable], instead it buried in
root/memory.stat[total_unevictable] which includes mlocked memory in
visible (online) and invisible (offline) children.
>> the RFC because it doesn't involve the cgroup deletion/reparenting. A
>
> One approach could be expanding on the forementioned scheme and make
> all sharing cgroups to get charged for the shared inodes they're
> using, which should render such collusions entirely pointless.
> e.g. let's say we start with the following.
>
> A (usage=48M)
> +-B (usage=16M)
> \-C (usage=32M)
>
> And let's say, C starts accessing an inode which is 8M and currently
> associated with B.
>
> A (usage=48M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> \-C (usage=32M, shared= 8M)
>
> The only extra charging that we'd be doing is charing C with extra
> 8M. Let's say another cgroup D gets created and uses 4M.
>
> A (usage=56M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> +-C (usage=32M, shared= 8M)
> \-D (usage= 8M)
>
> and it also accesses the inode.
>
> A (usage=56M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> +-C (usage=32M, shared= 8M)
> \-D (usage= 8M, shared= 8M)
>
> We'd need to track the shared charges separately as they should count
> only once in the parent but that shouldn't be too hard. The problem
> here is that we'd need to track which inodes are being accessed by
> which children, which can get painful for things like libc. Maybe we
> can limit it to be level-by-level - track sharing only from the
> immediate children and always move a shared inode at one level at a
> time. That would lose some ability to track the sharing beyond the
> immediate children but it should be enough to solve the root case and
> allow us to adapt to changing usage pattern over time. Given that
> sharing is mostly a corner case, this could be good enough.
>
> Now, if D accesses 4M area of the inode which hasn't been accessed by
> others yet. We'd want it to look like the following.
>
> A (usage=64M, hosted=16M)
> +-B (usage= 8M, shared=16M)
> +-C (usage=32M, shared=16M)
> \-D (usage= 8M, shared=16M)
>
> But charging it to B, C at the same time prolly wouldn't be
> particularly convenient. We can prolly just do D -> A charging and
> let B and C sort themselves out later. Note that such charging would
> still maintain the overall integrity of memory limits. The only thing
> which may overflow is the pseudo shared charges to keep sharing in
> check and dealing with them later when B and C try to create further
> charges should be completely fine.
>
> Note that we can also try to split the shared charge across the users;
> however, charging the full amount seems like the better approach to
> me. We don't have any way to tell how the usage is distributed
> anyway. For use cases where this sort of sharing is expected, I think
> it's perfectly reasonable to provision the sharing children to have
> enough to accomodate the possible full size of the shared resource.
>
>> possible tweak to shore up the current system is to move such mlocked
>> pages to the memcg of the surviving locker. When the machine is oom
>> it's often nice to examine memcg state to determine which container is
>> using the memory. Tracking down who's contributing to a shared
>> container is non-trivial.
>>
>> I actually have a set of patches which add a memcg=M mount option to
>> memory backed file systems. I was planning on proposing them,
>> regardless of this RFC, and this discussion makes them even more
>> appealing. If we go in this direction, then we'd need a similar
>> notion for disk based filesystems. As Konstantin suggested, it'd be
>> really nice to specify charge policy on a per file, or directory, or
>> bind mount basis. This allows shared files to be deterministically
>
> I'm not too sure about that. We might add that later if absolutely
> justifiable but designing assuming that level of intervention from
> userland may not be such a good idea.
>
>> When there's large incidental sharing, then things get sticky. A
>> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
>> a small container would pull all pages to the root memcg where they
>> are exposed to root pressure which breaks isolation. This is
>> concerning. Perhaps the such accesses could be decorated with
>> (O_NO_MOVEMEM).
>
> If such thing is really necessary, FADV_NOREUSE would be a better
> indicator; however, yes, such incidental sharing is easier to handle
> with per-page scheme as such scanner can be limited in the number of
> pages it can carry throughout its operation regardless of which cgroup
> it's looking at. It still has the nasty corner case where random
> target cgroups can latch onto pages faulted in by the scanner and
> keeping accessing them tho, so, even now, FADV_NOREUSE would be a good
> idea. Note that such scanning, if repeated on cgroups under high
> memory pressure, is *likely* to accumulate residue escaped pages and
> if such a management cgroup is transient, those escaped pages will
> accumulate over time outside any limit in a way which is unpredictable
> and invisible.
>
>> So this RFC change will introduce significant change to user space
>> machine managers and perturb isolation. Is the resulting system
>> better? It's not clear, it's the devil know vs devil unknown. Maybe
>> it'd be easier if the memcg's I'm talking about were not allowed to
>> share page cache (aka copy-on-read) even for files which are jointly
>> visible. That would provide today's interface while avoiding the
>> problematic sharing.
>
> Yeah, compatibility would be the stickiest part.
>
> Thanks.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-04 23:51 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-04 23:51 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Wed, Feb 04 2015, Tejun Heo wrote:
> Hello,
>
> On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
>> If a machine has several top level memcg trying to get some form of
>> isolation (using low, min, soft limit) then a shared libc will be
>> moved to the root memcg where it's not protected from global memory
>> pressure. At least with the current per page accounting such shared
>> pages often land into some protected memcg.
>
> Yes, it becomes interesting with the low limit as the pressure
> direction is reversed but at the same time overcommitting low limits
> doesn't lead to a sane setup to begin with as it's asking for global
> OOMs anyway, which means that things like libc would end up competing
> at least fairly with other pages for global pressure and should stay
> in memory under most circumstances, which may or may not be
> sufficient.
I agree. Clarification... I don't plan to overcommit low or min limits.
On machines without overcommited min limits the existing system offers
some protection for shared libs from global reclaim. Pushing them to
root doesn't.
> Hmm.... need to think more about it but this only becomes a problem
> with the root cgroup because it doesn't have min setting which is
> expected to be inclusive of all descendants, right? Maybe the right
> thing to do here is treating the inodes which get pushed to the root
> as a special case and we can implement a mechanism where the root is
> effectively borrowing from the mins of its children which doesn't have
> to be completely correct - e.g. just charge it against all children
> repeatedly and if any has min protection, put it under min protection.
> IOW, make it the baseload for all of them.
I think the linux-next low (and the TBD min) limits also have the
problem for more than just the root memcg. I'm thinking of a 2M file
shared between C and D below. The file will be charged to common parent
B.
A
+-B (usage=2M lim=3M min=2M)
+-C (usage=0 lim=2M min=1M shared_usage=2M)
+-D (usage=0 lim=2M min=1M shared_usage=2M)
\-E (usage=0 lim=2M min=0)
The problem arises if A/B/E allocates more than 1M of private
reclaimable file data. This pushes A/B into reclaim which will reclaim
both the shared file from A/B and private file from A/B/E. In contrast,
the current per-page memcg would've protected the shared file in either
C or D leaving A/B reclaim to only attack A/B/E.
Pinning the shared file to either C or D, using TBD policy such as mount
option, would solve this for tightly shared files. But for wide fanout
file (libc) the admin would need to assign a global bucket and this
would be a pain to size due to various job requirements.
>> If two cgroups collude they can use more memory than their limit and
>> oom the entire machine. Admittedly the current per-page system isn't
>> perfect because deleting a memcg which contains mlocked memory
>> (referenced by a remote memcg) moves the mlocked memory to root
>> resulting in the same issue. But I'd argue this is more likely with
>
> Hmmm... why does it do that? Can you point me to where it's
> happening?
My mistake, I was thinking of older kernels which reparent memory.
Though I can't say v3.19-rc7 handles this collusion any better. Instead
of reparenting the mlocked memory, it's left in an invisible (offline)
memcg. Unlike older kernels the memory doesn't appear in
root/memory.stat[unevictable], instead it buried in
root/memory.stat[total_unevictable] which includes mlocked memory in
visible (online) and invisible (offline) children.
>> the RFC because it doesn't involve the cgroup deletion/reparenting. A
>
> One approach could be expanding on the forementioned scheme and make
> all sharing cgroups to get charged for the shared inodes they're
> using, which should render such collusions entirely pointless.
> e.g. let's say we start with the following.
>
> A (usage=48M)
> +-B (usage=16M)
> \-C (usage=32M)
>
> And let's say, C starts accessing an inode which is 8M and currently
> associated with B.
>
> A (usage=48M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> \-C (usage=32M, shared= 8M)
>
> The only extra charging that we'd be doing is charing C with extra
> 8M. Let's say another cgroup D gets created and uses 4M.
>
> A (usage=56M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> +-C (usage=32M, shared= 8M)
> \-D (usage= 8M)
>
> and it also accesses the inode.
>
> A (usage=56M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> +-C (usage=32M, shared= 8M)
> \-D (usage= 8M, shared= 8M)
>
> We'd need to track the shared charges separately as they should count
> only once in the parent but that shouldn't be too hard. The problem
> here is that we'd need to track which inodes are being accessed by
> which children, which can get painful for things like libc. Maybe we
> can limit it to be level-by-level - track sharing only from the
> immediate children and always move a shared inode at one level at a
> time. That would lose some ability to track the sharing beyond the
> immediate children but it should be enough to solve the root case and
> allow us to adapt to changing usage pattern over time. Given that
> sharing is mostly a corner case, this could be good enough.
>
> Now, if D accesses 4M area of the inode which hasn't been accessed by
> others yet. We'd want it to look like the following.
>
> A (usage=64M, hosted=16M)
> +-B (usage= 8M, shared=16M)
> +-C (usage=32M, shared=16M)
> \-D (usage= 8M, shared=16M)
>
> But charging it to B, C at the same time prolly wouldn't be
> particularly convenient. We can prolly just do D -> A charging and
> let B and C sort themselves out later. Note that such charging would
> still maintain the overall integrity of memory limits. The only thing
> which may overflow is the pseudo shared charges to keep sharing in
> check and dealing with them later when B and C try to create further
> charges should be completely fine.
>
> Note that we can also try to split the shared charge across the users;
> however, charging the full amount seems like the better approach to
> me. We don't have any way to tell how the usage is distributed
> anyway. For use cases where this sort of sharing is expected, I think
> it's perfectly reasonable to provision the sharing children to have
> enough to accomodate the possible full size of the shared resource.
>
>> possible tweak to shore up the current system is to move such mlocked
>> pages to the memcg of the surviving locker. When the machine is oom
>> it's often nice to examine memcg state to determine which container is
>> using the memory. Tracking down who's contributing to a shared
>> container is non-trivial.
>>
>> I actually have a set of patches which add a memcg=M mount option to
>> memory backed file systems. I was planning on proposing them,
>> regardless of this RFC, and this discussion makes them even more
>> appealing. If we go in this direction, then we'd need a similar
>> notion for disk based filesystems. As Konstantin suggested, it'd be
>> really nice to specify charge policy on a per file, or directory, or
>> bind mount basis. This allows shared files to be deterministically
>
> I'm not too sure about that. We might add that later if absolutely
> justifiable but designing assuming that level of intervention from
> userland may not be such a good idea.
>
>> When there's large incidental sharing, then things get sticky. A
>> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
>> a small container would pull all pages to the root memcg where they
>> are exposed to root pressure which breaks isolation. This is
>> concerning. Perhaps the such accesses could be decorated with
>> (O_NO_MOVEMEM).
>
> If such thing is really necessary, FADV_NOREUSE would be a better
> indicator; however, yes, such incidental sharing is easier to handle
> with per-page scheme as such scanner can be limited in the number of
> pages it can carry throughout its operation regardless of which cgroup
> it's looking at. It still has the nasty corner case where random
> target cgroups can latch onto pages faulted in by the scanner and
> keeping accessing them tho, so, even now, FADV_NOREUSE would be a good
> idea. Note that such scanning, if repeated on cgroups under high
> memory pressure, is *likely* to accumulate residue escaped pages and
> if such a management cgroup is transient, those escaped pages will
> accumulate over time outside any limit in a way which is unpredictable
> and invisible.
>
>> So this RFC change will introduce significant change to user space
>> machine managers and perturb isolation. Is the resulting system
>> better? It's not clear, it's the devil know vs devil unknown. Maybe
>> it'd be easier if the memcg's I'm talking about were not allowed to
>> share page cache (aka copy-on-read) even for files which are jointly
>> visible. That would provide today's interface while avoiding the
>> problematic sharing.
>
> Yeah, compatibility would be the stickiest part.
>
> Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-04 23:51 ` Greg Thelen
@ 2015-02-05 13:15 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-05 13:15 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
> I think the linux-next low (and the TBD min) limits also have the
> problem for more than just the root memcg. I'm thinking of a 2M file
> shared between C and D below. The file will be charged to common parent
> B.
>
> A
> +-B (usage=2M lim=3M min=2M)
> +-C (usage=0 lim=2M min=1M shared_usage=2M)
> +-D (usage=0 lim=2M min=1M shared_usage=2M)
> \-E (usage=0 lim=2M min=0)
>
> The problem arises if A/B/E allocates more than 1M of private
> reclaimable file data. This pushes A/B into reclaim which will reclaim
> both the shared file from A/B and private file from A/B/E. In contrast,
> the current per-page memcg would've protected the shared file in either
> C or D leaving A/B reclaim to only attack A/B/E.
>
> Pinning the shared file to either C or D, using TBD policy such as mount
> option, would solve this for tightly shared files. But for wide fanout
> file (libc) the admin would need to assign a global bucket and this
> would be a pain to size due to various job requirements.
Shouldn't we be able to handle it the same way as I proposed for
handling sharing? The above would look like
A
+-B (usage=2M lim=3M min=2M hosted_usage=2M)
+-C (usage=0 lim=2M min=1M shared_usage=2M)
+-D (usage=0 lim=2M min=1M shared_usage=2M)
\-E (usage=0 lim=2M min=0)
Now, we don't wanna use B's min verbatim on the hosted inodes shared
by children but we're unconditionally charging the shared amount to
all sharing children, which means that we're eating into the min
settings of all participating children, so, we should be able to use
sum of all sharing children's min-covered amount as the inode's min,
which of course is to be contained inside the min of the parent.
Above, we're charging 2M to C and D, each of which has 1M min which is
being consumed by the shared charge (the shared part won't get
reclaimed from the internal pressure of children, so we're really
taking that part away from it). Summing them up, the shared inode
would have 2M protection which is honored as long as B as a whole is
under its 3M limit. This is similar to creating a dedicated child for
each shared resource for low limits. The downside is that we end up
guarding the shared inodes more than non-shared ones, but, after all,
we're charging it to everybody who's using it.
Would something like this work?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-05 13:15 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-05 13:15 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
> I think the linux-next low (and the TBD min) limits also have the
> problem for more than just the root memcg. I'm thinking of a 2M file
> shared between C and D below. The file will be charged to common parent
> B.
>
> A
> +-B (usage=2M lim=3M min=2M)
> +-C (usage=0 lim=2M min=1M shared_usage=2M)
> +-D (usage=0 lim=2M min=1M shared_usage=2M)
> \-E (usage=0 lim=2M min=0)
>
> The problem arises if A/B/E allocates more than 1M of private
> reclaimable file data. This pushes A/B into reclaim which will reclaim
> both the shared file from A/B and private file from A/B/E. In contrast,
> the current per-page memcg would've protected the shared file in either
> C or D leaving A/B reclaim to only attack A/B/E.
>
> Pinning the shared file to either C or D, using TBD policy such as mount
> option, would solve this for tightly shared files. But for wide fanout
> file (libc) the admin would need to assign a global bucket and this
> would be a pain to size due to various job requirements.
Shouldn't we be able to handle it the same way as I proposed for
handling sharing? The above would look like
A
+-B (usage=2M lim=3M min=2M hosted_usage=2M)
+-C (usage=0 lim=2M min=1M shared_usage=2M)
+-D (usage=0 lim=2M min=1M shared_usage=2M)
\-E (usage=0 lim=2M min=0)
Now, we don't wanna use B's min verbatim on the hosted inodes shared
by children but we're unconditionally charging the shared amount to
all sharing children, which means that we're eating into the min
settings of all participating children, so, we should be able to use
sum of all sharing children's min-covered amount as the inode's min,
which of course is to be contained inside the min of the parent.
Above, we're charging 2M to C and D, each of which has 1M min which is
being consumed by the shared charge (the shared part won't get
reclaimed from the internal pressure of children, so we're really
taking that part away from it). Summing them up, the shared inode
would have 2M protection which is honored as long as B as a whole is
under its 3M limit. This is similar to creating a dedicated child for
each shared resource for low limits. The downside is that we end up
guarding the shared inodes more than non-shared ones, but, after all,
we're charging it to everybody who's using it.
Would something like this work?
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-05 13:15 ` Tejun Heo
@ 2015-02-05 22:05 ` Greg Thelen
-1 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-05 22:05 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Thu, Feb 05 2015, Tejun Heo wrote:
> Hello, Greg.
>
> On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
>> I think the linux-next low (and the TBD min) limits also have the
>> problem for more than just the root memcg. I'm thinking of a 2M file
>> shared between C and D below. The file will be charged to common parent
>> B.
>>
>> A
>> +-B (usage=2M lim=3M min=2M)
>> +-C (usage=0 lim=2M min=1M shared_usage=2M)
>> +-D (usage=0 lim=2M min=1M shared_usage=2M)
>> \-E (usage=0 lim=2M min=0)
>>
>> The problem arises if A/B/E allocates more than 1M of private
>> reclaimable file data. This pushes A/B into reclaim which will reclaim
>> both the shared file from A/B and private file from A/B/E. In contrast,
>> the current per-page memcg would've protected the shared file in either
>> C or D leaving A/B reclaim to only attack A/B/E.
>>
>> Pinning the shared file to either C or D, using TBD policy such as mount
>> option, would solve this for tightly shared files. But for wide fanout
>> file (libc) the admin would need to assign a global bucket and this
>> would be a pain to size due to various job requirements.
>
> Shouldn't we be able to handle it the same way as I proposed for
> handling sharing? The above would look like
>
> A
> +-B (usage=2M lim=3M min=2M hosted_usage=2M)
> +-C (usage=0 lim=2M min=1M shared_usage=2M)
> +-D (usage=0 lim=2M min=1M shared_usage=2M)
> \-E (usage=0 lim=2M min=0)
>
> Now, we don't wanna use B's min verbatim on the hosted inodes shared
> by children but we're unconditionally charging the shared amount to
> all sharing children, which means that we're eating into the min
> settings of all participating children, so, we should be able to use
> sum of all sharing children's min-covered amount as the inode's min,
> which of course is to be contained inside the min of the parent.
>
> Above, we're charging 2M to C and D, each of which has 1M min which is
> being consumed by the shared charge (the shared part won't get
> reclaimed from the internal pressure of children, so we're really
> taking that part away from it). Summing them up, the shared inode
> would have 2M protection which is honored as long as B as a whole is
> under its 3M limit. This is similar to creating a dedicated child for
> each shared resource for low limits. The downside is that we end up
> guarding the shared inodes more than non-shared ones, but, after all,
> we're charging it to everybody who's using it.
>
> Would something like this work?
Maybe, but I want to understand more about how pressure works in the
child. As C (or D) allocates non shared memory does it perform reclaim
to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
by C. Are you thinking that charge failures on cgroups with non zero
shared_usage would, as needed, induce reclaim of parent's hosted_usage?
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-05 22:05 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-05 22:05 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Thu, Feb 05 2015, Tejun Heo wrote:
> Hello, Greg.
>
> On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
>> I think the linux-next low (and the TBD min) limits also have the
>> problem for more than just the root memcg. I'm thinking of a 2M file
>> shared between C and D below. The file will be charged to common parent
>> B.
>>
>> A
>> +-B (usage=2M lim=3M min=2M)
>> +-C (usage=0 lim=2M min=1M shared_usage=2M)
>> +-D (usage=0 lim=2M min=1M shared_usage=2M)
>> \-E (usage=0 lim=2M min=0)
>>
>> The problem arises if A/B/E allocates more than 1M of private
>> reclaimable file data. This pushes A/B into reclaim which will reclaim
>> both the shared file from A/B and private file from A/B/E. In contrast,
>> the current per-page memcg would've protected the shared file in either
>> C or D leaving A/B reclaim to only attack A/B/E.
>>
>> Pinning the shared file to either C or D, using TBD policy such as mount
>> option, would solve this for tightly shared files. But for wide fanout
>> file (libc) the admin would need to assign a global bucket and this
>> would be a pain to size due to various job requirements.
>
> Shouldn't we be able to handle it the same way as I proposed for
> handling sharing? The above would look like
>
> A
> +-B (usage=2M lim=3M min=2M hosted_usage=2M)
> +-C (usage=0 lim=2M min=1M shared_usage=2M)
> +-D (usage=0 lim=2M min=1M shared_usage=2M)
> \-E (usage=0 lim=2M min=0)
>
> Now, we don't wanna use B's min verbatim on the hosted inodes shared
> by children but we're unconditionally charging the shared amount to
> all sharing children, which means that we're eating into the min
> settings of all participating children, so, we should be able to use
> sum of all sharing children's min-covered amount as the inode's min,
> which of course is to be contained inside the min of the parent.
>
> Above, we're charging 2M to C and D, each of which has 1M min which is
> being consumed by the shared charge (the shared part won't get
> reclaimed from the internal pressure of children, so we're really
> taking that part away from it). Summing them up, the shared inode
> would have 2M protection which is honored as long as B as a whole is
> under its 3M limit. This is similar to creating a dedicated child for
> each shared resource for low limits. The downside is that we end up
> guarding the shared inodes more than non-shared ones, but, after all,
> we're charging it to everybody who's using it.
>
> Would something like this work?
Maybe, but I want to understand more about how pressure works in the
child. As C (or D) allocates non shared memory does it perform reclaim
to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
by C. Are you thinking that charge failures on cgroups with non zero
shared_usage would, as needed, induce reclaim of parent's hosted_usage?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-05 22:05 ` Greg Thelen
(?)
@ 2015-02-05 22:25 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-05 22:25 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hey,
On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
> > A
> > +-B (usage=2M lim=3M min=2M hosted_usage=2M)
> > +-C (usage=0 lim=2M min=1M shared_usage=2M)
> > +-D (usage=0 lim=2M min=1M shared_usage=2M)
> > \-E (usage=0 lim=2M min=0)
...
> Maybe, but I want to understand more about how pressure works in the
> child. As C (or D) allocates non shared memory does it perform reclaim
> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
Yes.
> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
> by C. Are you thinking that charge failures on cgroups with non zero
> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
Hmmm.... I'm not really sure but why not? If we properly account for
the low protection when pushing inodes to the parent, I don't think
it'd break anything. IOW, allow the amount beyond the sum of low
limits to be reclaimed when one of the sharers is under pressure.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-05 22:25 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-05 22:25 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
Hey,
On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
> > A
> > +-B (usage=2M lim=3M min=2M hosted_usage=2M)
> > +-C (usage=0 lim=2M min=1M shared_usage=2M)
> > +-D (usage=0 lim=2M min=1M shared_usage=2M)
> > \-E (usage=0 lim=2M min=0)
...
> Maybe, but I want to understand more about how pressure works in the
> child. As C (or D) allocates non shared memory does it perform reclaim
> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
Yes.
> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
> by C. Are you thinking that charge failures on cgroups with non zero
> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
Hmmm.... I'm not really sure but why not? If we properly account for
the low protection when pushing inodes to the parent, I don't think
it'd break anything. IOW, allow the amount beyond the sum of low
limits to be reclaimed when one of the sharers is under pressure.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-05 22:25 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-05 22:25 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hey,
On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
> > A
> > +-B (usage=2M lim=3M min=2M hosted_usage=2M)
> > +-C (usage=0 lim=2M min=1M shared_usage=2M)
> > +-D (usage=0 lim=2M min=1M shared_usage=2M)
> > \-E (usage=0 lim=2M min=0)
...
> Maybe, but I want to understand more about how pressure works in the
> child. As C (or D) allocates non shared memory does it perform reclaim
> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
Yes.
> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
> by C. Are you thinking that charge failures on cgroups with non zero
> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
Hmmm.... I'm not really sure but why not? If we properly account for
the low protection when pushing inodes to the parent, I don't think
it'd break anything. IOW, allow the amount beyond the sum of low
limits to be reclaimed when one of the sharers is under pressure.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-05 22:25 ` Tejun Heo
@ 2015-02-06 0:03 ` Greg Thelen
-1 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-06 0:03 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Thu, Feb 05 2015, Tejun Heo wrote:
> Hey,
>
> On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
>> > A
>> > +-B (usage=2M lim=3M min=2M hosted_usage=2M)
>> > +-C (usage=0 lim=2M min=1M shared_usage=2M)
>> > +-D (usage=0 lim=2M min=1M shared_usage=2M)
>> > \-E (usage=0 lim=2M min=0)
> ...
>> Maybe, but I want to understand more about how pressure works in the
>> child. As C (or D) allocates non shared memory does it perform reclaim
>> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
>
> Yes.
>
>> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
>> by C. Are you thinking that charge failures on cgroups with non zero
>> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
>
> Hmmm.... I'm not really sure but why not? If we properly account for
> the low protection when pushing inodes to the parent, I don't think
> it'd break anything. IOW, allow the amount beyond the sum of low
> limits to be reclaimed when one of the sharers is under pressure.
>
> Thanks.
I'm not saying that it'd break anything. I think it's required that
children perform reclaim on shared data hosted in the parent. The child
is limited by shared_usage, so it needs ability to reclaim it. So I
think we're in agreement. Child will reclaim parent's hosted_usage when
the child is charged for shared_usage. Ideally the only parental memory
reclaimed in this situation would be shared. But I think (though I
can't claim to have followed the new memcg philosophy discussions) that
internal nodes in the cgroup tree (i.e. parents) do not have any
resources charged directly to them. All resources are charged to leaf
cgroups which linger until resources are uncharged. Thus the LRUs of
parent will only contain hosted (shared) memory. This thankfully focus
parental reclaim easy on shared pages. Child pressure will,
unfortunately, reclaim shared pages used by any container. But if
shared pages were charged all sharing containers, then it will help
relieve pressure in the caller.
So this is a system which charges all cgroups using a shared inode
(recharge on read) for all resident pages of that shared inode. There's
only one copy of the page in memory on just one LRU, but the page may be
charged to multiple container's (shared_)usage.
Perhaps I missed it, but what happens when a child's limit is
insufficient to accept all pages shared by its siblings? Example
starting with 2M cached of a shared file:
A
+-B (usage=2M lim=3M hosted_usage=2M)
+-C (usage=0 lim=2M shared_usage=2M)
+-D (usage=0 lim=2M shared_usage=2M)
\-E (usage=0 lim=1M shared_usage=0)
If E faults in a new 4K page within the shared file, then E is a sharing
participant so it'd be charged the 2M+4K, which pushes E over it's
limit.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-06 0:03 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-06 0:03 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Thu, Feb 05 2015, Tejun Heo wrote:
> Hey,
>
> On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
>> > A
>> > +-B (usage=2M lim=3M min=2M hosted_usage=2M)
>> > +-C (usage=0 lim=2M min=1M shared_usage=2M)
>> > +-D (usage=0 lim=2M min=1M shared_usage=2M)
>> > \-E (usage=0 lim=2M min=0)
> ...
>> Maybe, but I want to understand more about how pressure works in the
>> child. As C (or D) allocates non shared memory does it perform reclaim
>> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
>
> Yes.
>
>> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
>> by C. Are you thinking that charge failures on cgroups with non zero
>> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
>
> Hmmm.... I'm not really sure but why not? If we properly account for
> the low protection when pushing inodes to the parent, I don't think
> it'd break anything. IOW, allow the amount beyond the sum of low
> limits to be reclaimed when one of the sharers is under pressure.
>
> Thanks.
I'm not saying that it'd break anything. I think it's required that
children perform reclaim on shared data hosted in the parent. The child
is limited by shared_usage, so it needs ability to reclaim it. So I
think we're in agreement. Child will reclaim parent's hosted_usage when
the child is charged for shared_usage. Ideally the only parental memory
reclaimed in this situation would be shared. But I think (though I
can't claim to have followed the new memcg philosophy discussions) that
internal nodes in the cgroup tree (i.e. parents) do not have any
resources charged directly to them. All resources are charged to leaf
cgroups which linger until resources are uncharged. Thus the LRUs of
parent will only contain hosted (shared) memory. This thankfully focus
parental reclaim easy on shared pages. Child pressure will,
unfortunately, reclaim shared pages used by any container. But if
shared pages were charged all sharing containers, then it will help
relieve pressure in the caller.
So this is a system which charges all cgroups using a shared inode
(recharge on read) for all resident pages of that shared inode. There's
only one copy of the page in memory on just one LRU, but the page may be
charged to multiple container's (shared_)usage.
Perhaps I missed it, but what happens when a child's limit is
insufficient to accept all pages shared by its siblings? Example
starting with 2M cached of a shared file:
A
+-B (usage=2M lim=3M hosted_usage=2M)
+-C (usage=0 lim=2M shared_usage=2M)
+-D (usage=0 lim=2M shared_usage=2M)
\-E (usage=0 lim=1M shared_usage=0)
If E faults in a new 4K page within the shared file, then E is a sharing
participant so it'd be charged the 2M+4K, which pushes E over it's
limit.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-06 0:03 ` Greg Thelen
@ 2015-02-06 14:17 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-06 14:17 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
> So this is a system which charges all cgroups using a shared inode
> (recharge on read) for all resident pages of that shared inode. There's
> only one copy of the page in memory on just one LRU, but the page may be
> charged to multiple container's (shared_)usage.
Yeap.
> Perhaps I missed it, but what happens when a child's limit is
> insufficient to accept all pages shared by its siblings? Example
> starting with 2M cached of a shared file:
>
> A
> +-B (usage=2M lim=3M hosted_usage=2M)
> +-C (usage=0 lim=2M shared_usage=2M)
> +-D (usage=0 lim=2M shared_usage=2M)
> \-E (usage=0 lim=1M shared_usage=0)
>
> If E faults in a new 4K page within the shared file, then E is a sharing
> participant so it'd be charged the 2M+4K, which pushes E over it's
> limit.
OOM? It shouldn't be participating in sharing of an inode if it can't
match others' protection on the inode, I think. What we're doing now
w/ page based charging is kinda unfair because in the situations like
above the one under pressure can end up siphoning off of the larger
cgroups' protection if they actually use overlapping areas; however,
for disjoint areas, per-page charging would behave correctly.
So, this part comes down to the same question - whether multiple
cgroups accessing disjoint areas of a single inode is an important
enough use case. If we say yes to that, we better make writeback
support that too.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-06 14:17 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-06 14:17 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
> So this is a system which charges all cgroups using a shared inode
> (recharge on read) for all resident pages of that shared inode. There's
> only one copy of the page in memory on just one LRU, but the page may be
> charged to multiple container's (shared_)usage.
Yeap.
> Perhaps I missed it, but what happens when a child's limit is
> insufficient to accept all pages shared by its siblings? Example
> starting with 2M cached of a shared file:
>
> A
> +-B (usage=2M lim=3M hosted_usage=2M)
> +-C (usage=0 lim=2M shared_usage=2M)
> +-D (usage=0 lim=2M shared_usage=2M)
> \-E (usage=0 lim=1M shared_usage=0)
>
> If E faults in a new 4K page within the shared file, then E is a sharing
> participant so it'd be charged the 2M+4K, which pushes E over it's
> limit.
OOM? It shouldn't be participating in sharing of an inode if it can't
match others' protection on the inode, I think. What we're doing now
w/ page based charging is kinda unfair because in the situations like
above the one under pressure can end up siphoning off of the larger
cgroups' protection if they actually use overlapping areas; however,
for disjoint areas, per-page charging would behave correctly.
So, this part comes down to the same question - whether multiple
cgroups accessing disjoint areas of a single inode is an important
enough use case. If we say yes to that, we better make writeback
support that too.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-06 14:17 ` Tejun Heo
@ 2015-02-06 23:43 ` Greg Thelen
-1 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-06 23:43 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Fri, Feb 6, 2015 at 6:17 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Greg.
>
> On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
>> So this is a system which charges all cgroups using a shared inode
>> (recharge on read) for all resident pages of that shared inode. There's
>> only one copy of the page in memory on just one LRU, but the page may be
>> charged to multiple container's (shared_)usage.
>
> Yeap.
>
>> Perhaps I missed it, but what happens when a child's limit is
>> insufficient to accept all pages shared by its siblings? Example
>> starting with 2M cached of a shared file:
>>
>> A
>> +-B (usage=2M lim=3M hosted_usage=2M)
>> +-C (usage=0 lim=2M shared_usage=2M)
>> +-D (usage=0 lim=2M shared_usage=2M)
>> \-E (usage=0 lim=1M shared_usage=0)
>>
>> If E faults in a new 4K page within the shared file, then E is a sharing
>> participant so it'd be charged the 2M+4K, which pushes E over it's
>> limit.
>
> OOM? It shouldn't be participating in sharing of an inode if it can't
> match others' protection on the inode, I think. What we're doing now
> w/ page based charging is kinda unfair because in the situations like
> above the one under pressure can end up siphoning off of the larger
> cgroups' protection if they actually use overlapping areas; however,
> for disjoint areas, per-page charging would behave correctly.
>
> So, this part comes down to the same question - whether multiple
> cgroups accessing disjoint areas of a single inode is an important
> enough use case. If we say yes to that, we better make writeback
> support that too.
If cgroups are about isolation then writing to shared files should be
rare, so I'm willing to say that we don't need to handle shared
writers well. Shared readers seem like a more valuable use cases
(thin provisioning). I'm getting overwhelmed with the thought
exercise of automatically moving inodes to common ancestors and back
charging the sharers for shared_usage. I haven't wrapped my head
around how these shared data pages will get protected. It seems like
they'd no longer be protected by child min watermarks.
So I know this thread opened with the claim "both memcg and blkcg must
be looking at the same picture. Deviating them is highly likely to
lead to long-term issues forcing us to look at this again anyway, only
with far more baggage." But I'm still wondering if the following is
simpler:
(1) leave memcg as a per page controller.
(2) maintain a per inode i_memcg which is set to the common dirtying
ancestor. If not shared then it'll point to the memcg that the page
was charged to.
(3) when memcg dirtying page pressure is seen, walk up the cgroup tree
writing dirty inodes, this will write shared inodes using blkcg
priority of the respective levels.
(4) background limit wb_check_background_flush() and time based
wb_check_old_data_flush() can feel free to attack shared inodes to
hopefully restore them to non-shared state.
For non-shared inodes, this should behave the same. For shared inodes
it should only affect those in the hierarchy which is sharing.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-06 23:43 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-06 23:43 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Fri, Feb 6, 2015 at 6:17 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Greg.
>
> On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
>> So this is a system which charges all cgroups using a shared inode
>> (recharge on read) for all resident pages of that shared inode. There's
>> only one copy of the page in memory on just one LRU, but the page may be
>> charged to multiple container's (shared_)usage.
>
> Yeap.
>
>> Perhaps I missed it, but what happens when a child's limit is
>> insufficient to accept all pages shared by its siblings? Example
>> starting with 2M cached of a shared file:
>>
>> A
>> +-B (usage=2M lim=3M hosted_usage=2M)
>> +-C (usage=0 lim=2M shared_usage=2M)
>> +-D (usage=0 lim=2M shared_usage=2M)
>> \-E (usage=0 lim=1M shared_usage=0)
>>
>> If E faults in a new 4K page within the shared file, then E is a sharing
>> participant so it'd be charged the 2M+4K, which pushes E over it's
>> limit.
>
> OOM? It shouldn't be participating in sharing of an inode if it can't
> match others' protection on the inode, I think. What we're doing now
> w/ page based charging is kinda unfair because in the situations like
> above the one under pressure can end up siphoning off of the larger
> cgroups' protection if they actually use overlapping areas; however,
> for disjoint areas, per-page charging would behave correctly.
>
> So, this part comes down to the same question - whether multiple
> cgroups accessing disjoint areas of a single inode is an important
> enough use case. If we say yes to that, we better make writeback
> support that too.
If cgroups are about isolation then writing to shared files should be
rare, so I'm willing to say that we don't need to handle shared
writers well. Shared readers seem like a more valuable use cases
(thin provisioning). I'm getting overwhelmed with the thought
exercise of automatically moving inodes to common ancestors and back
charging the sharers for shared_usage. I haven't wrapped my head
around how these shared data pages will get protected. It seems like
they'd no longer be protected by child min watermarks.
So I know this thread opened with the claim "both memcg and blkcg must
be looking at the same picture. Deviating them is highly likely to
lead to long-term issues forcing us to look at this again anyway, only
with far more baggage." But I'm still wondering if the following is
simpler:
(1) leave memcg as a per page controller.
(2) maintain a per inode i_memcg which is set to the common dirtying
ancestor. If not shared then it'll point to the memcg that the page
was charged to.
(3) when memcg dirtying page pressure is seen, walk up the cgroup tree
writing dirty inodes, this will write shared inodes using blkcg
priority of the respective levels.
(4) background limit wb_check_background_flush() and time based
wb_check_old_data_flush() can feel free to attack shared inodes to
hopefully restore them to non-shared state.
For non-shared inodes, this should behave the same. For shared inodes
it should only affect those in the hierarchy which is sharing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-06 23:43 ` Greg Thelen
(?)
@ 2015-02-07 14:38 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-07 14:38 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Fri, Feb 06, 2015 at 03:43:11PM -0800, Greg Thelen wrote:
> If cgroups are about isolation then writing to shared files should be
> rare, so I'm willing to say that we don't need to handle shared
> writers well. Shared readers seem like a more valuable use cases
> (thin provisioning). I'm getting overwhelmed with the thought
> exercise of automatically moving inodes to common ancestors and back
> charging the sharers for shared_usage. I haven't wrapped my head
> around how these shared data pages will get protected. It seems like
> they'd no longer be protected by child min watermarks.
Yes, this is challenging and what my current thought is around taking
the maximum of the low settings of the sharing children but I need to
think more about it. One problem is that the shared inodes will
preemptively take away the amount shared from the children's low
protection. They won't compete fairly with other inodes or anons but
they can't really as they don't really belong to any single sharer.
> So I know this thread opened with the claim "both memcg and blkcg must
> be looking at the same picture. Deviating them is highly likely to
> lead to long-term issues forcing us to look at this again anyway, only
> with far more baggage." But I'm still wondering if the following is
> simpler:
> (1) leave memcg as a per page controller.
> (2) maintain a per inode i_memcg which is set to the common dirtying
> ancestor. If not shared then it'll point to the memcg that the page
> was charged to.
> (3) when memcg dirtying page pressure is seen, walk up the cgroup tree
> writing dirty inodes, this will write shared inodes using blkcg
> priority of the respective levels.
> (4) background limit wb_check_background_flush() and time based
> wb_check_old_data_flush() can feel free to attack shared inodes to
> hopefully restore them to non-shared state.
> For non-shared inodes, this should behave the same. For shared inodes
> it should only affect those in the hierarchy which is sharing.
The thing which breaks when you de-couple what memcg sees from the
rest of the stack is that the amount of memory which may be available
to a given cgroup and how much of that is dirty is the main linkage
propagating IO pressure to actual dirtying tasks. If you decouple the
two worldviews, you lose the ability to propagate IO pressure to
dirtiers in a controlled manner and that's why anything inside a memcg
currently is always triggering direct reclaim path instead of being
properly dirty throttled.
You can argue that an inode being actively dirtied from multiple
cgroups is a rare case which we can sweep under the rug and that
*might* be the case but I have a nagging feeling that that would be a
decision which is made merely out of immediate convenience and would
much prefer having a well defined model of sharing inodes and anons
across cgroups so that the behaviors shown in thoses cases aren't mere
accidental consequences without any innate meaning.
If we can argue that memcg and blkcg having different views is
meaningful and characterize and justify the behaviors stemming from
the deviation, sure, that'd be fine, but I don't think we have that as
of now.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-07 14:38 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-07 14:38 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Fri, Feb 06, 2015 at 03:43:11PM -0800, Greg Thelen wrote:
> If cgroups are about isolation then writing to shared files should be
> rare, so I'm willing to say that we don't need to handle shared
> writers well. Shared readers seem like a more valuable use cases
> (thin provisioning). I'm getting overwhelmed with the thought
> exercise of automatically moving inodes to common ancestors and back
> charging the sharers for shared_usage. I haven't wrapped my head
> around how these shared data pages will get protected. It seems like
> they'd no longer be protected by child min watermarks.
Yes, this is challenging and what my current thought is around taking
the maximum of the low settings of the sharing children but I need to
think more about it. One problem is that the shared inodes will
preemptively take away the amount shared from the children's low
protection. They won't compete fairly with other inodes or anons but
they can't really as they don't really belong to any single sharer.
> So I know this thread opened with the claim "both memcg and blkcg must
> be looking at the same picture. Deviating them is highly likely to
> lead to long-term issues forcing us to look at this again anyway, only
> with far more baggage." But I'm still wondering if the following is
> simpler:
> (1) leave memcg as a per page controller.
> (2) maintain a per inode i_memcg which is set to the common dirtying
> ancestor. If not shared then it'll point to the memcg that the page
> was charged to.
> (3) when memcg dirtying page pressure is seen, walk up the cgroup tree
> writing dirty inodes, this will write shared inodes using blkcg
> priority of the respective levels.
> (4) background limit wb_check_background_flush() and time based
> wb_check_old_data_flush() can feel free to attack shared inodes to
> hopefully restore them to non-shared state.
> For non-shared inodes, this should behave the same. For shared inodes
> it should only affect those in the hierarchy which is sharing.
The thing which breaks when you de-couple what memcg sees from the
rest of the stack is that the amount of memory which may be available
to a given cgroup and how much of that is dirty is the main linkage
propagating IO pressure to actual dirtying tasks. If you decouple the
two worldviews, you lose the ability to propagate IO pressure to
dirtiers in a controlled manner and that's why anything inside a memcg
currently is always triggering direct reclaim path instead of being
properly dirty throttled.
You can argue that an inode being actively dirtied from multiple
cgroups is a rare case which we can sweep under the rug and that
*might* be the case but I have a nagging feeling that that would be a
decision which is made merely out of immediate convenience and would
much prefer having a well defined model of sharing inodes and anons
across cgroups so that the behaviors shown in thoses cases aren't mere
accidental consequences without any innate meaning.
If we can argue that memcg and blkcg having different views is
meaningful and characterize and justify the behaviors stemming from
the deviation, sure, that'd be fine, but I don't think we have that as
of now.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-07 14:38 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-07 14:38 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Fri, Feb 06, 2015 at 03:43:11PM -0800, Greg Thelen wrote:
> If cgroups are about isolation then writing to shared files should be
> rare, so I'm willing to say that we don't need to handle shared
> writers well. Shared readers seem like a more valuable use cases
> (thin provisioning). I'm getting overwhelmed with the thought
> exercise of automatically moving inodes to common ancestors and back
> charging the sharers for shared_usage. I haven't wrapped my head
> around how these shared data pages will get protected. It seems like
> they'd no longer be protected by child min watermarks.
Yes, this is challenging and what my current thought is around taking
the maximum of the low settings of the sharing children but I need to
think more about it. One problem is that the shared inodes will
preemptively take away the amount shared from the children's low
protection. They won't compete fairly with other inodes or anons but
they can't really as they don't really belong to any single sharer.
> So I know this thread opened with the claim "both memcg and blkcg must
> be looking at the same picture. Deviating them is highly likely to
> lead to long-term issues forcing us to look at this again anyway, only
> with far more baggage." But I'm still wondering if the following is
> simpler:
> (1) leave memcg as a per page controller.
> (2) maintain a per inode i_memcg which is set to the common dirtying
> ancestor. If not shared then it'll point to the memcg that the page
> was charged to.
> (3) when memcg dirtying page pressure is seen, walk up the cgroup tree
> writing dirty inodes, this will write shared inodes using blkcg
> priority of the respective levels.
> (4) background limit wb_check_background_flush() and time based
> wb_check_old_data_flush() can feel free to attack shared inodes to
> hopefully restore them to non-shared state.
> For non-shared inodes, this should behave the same. For shared inodes
> it should only affect those in the hierarchy which is sharing.
The thing which breaks when you de-couple what memcg sees from the
rest of the stack is that the amount of memory which may be available
to a given cgroup and how much of that is dirty is the main linkage
propagating IO pressure to actual dirtying tasks. If you decouple the
two worldviews, you lose the ability to propagate IO pressure to
dirtiers in a controlled manner and that's why anything inside a memcg
currently is always triggering direct reclaim path instead of being
properly dirty throttled.
You can argue that an inode being actively dirtied from multiple
cgroups is a rare case which we can sweep under the rug and that
*might* be the case but I have a nagging feeling that that would be a
decision which is made merely out of immediate convenience and would
much prefer having a well defined model of sharing inodes and anons
across cgroups so that the behaviors shown in thoses cases aren't mere
accidental consequences without any innate meaning.
If we can argue that memcg and blkcg having different views is
meaningful and characterize and justify the behaviors stemming from
the deviation, sure, that'd be fine, but I don't think we have that as
of now.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-07 14:38 ` Tejun Heo
(?)
@ 2015-02-11 2:19 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 2:19 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, again.
On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> If we can argue that memcg and blkcg having different views is
> meaningful and characterize and justify the behaviors stemming from
> the deviation, sure, that'd be fine, but I don't think we have that as
> of now.
If we assume that memcg and blkcg having different views is something
which represents an acceptable compromise considering the use cases
and implementation convenience - IOW, if we assume that read-sharing
is something which can happen regularly while write sharing is a
corner case and that while not completely correct the existing
self-corrective behavior from tracking ownership per-page at the point
of instantiation is good enough (as a memcg under pressure is likely
to give up shared pages to be re-instantiated by another sharer w/
more budget), we need to do the impedance matching between memcg and
blkcg at the writeback layer.
The main issue there is that the last chain of IO pressure propagation
is realized by making individual dirtying tasks to converge on a
common target dirty ratio point which naturally depending on those
tasks seeing the same picture in terms of the current write bandwidth
and available memory and how much of it is dirty. Tasks dirtying
pages belonging to the same memcg while some of them are mostly being
written out by a different blkcg would wreck the mechanism. It won't
be difficult for one subset to make the other to consider themselves
under severe IO pressure when there actually isn't one in that group
possibly stalling and starving those tasks unduly. At more basic
level, it's just wrong for one group to be writing out significant
amount for another.
These issues can persist indefinitely if we follow the same
instantiator-owns rule for inode writebacks. Even if we reset the
ownership when an inode becomes clea, it wouldn't work as it can be
dirtied over and over again while under writeback, and when things
like this happen, the behavior may become extremely difficult to
understand or characterize. We don't have visibility into how
individual pages of an inode get distributed across multiple cgroups,
who's currently responsible for writing back a specific inode or how
dirty ratio mechanism is behaving in the face of the unexpected
combination of parameters.
Even if we assume that write sharing is a fringe case, we need
something better than first-whatever rule when choosing which blkcg is
responsible for writing a shared inode out. There needs to be a
constant corrective pressure so that incidental and temporary sharings
don't end up screwing up the mechanism for an extended period of time.
Greg mentioned chossing the closest ancestor of the sharers, which
basically pushes inode sharing policy implmentation down to writeback
from memcg. This could work but we end up with the same collusion
problem as when this is used for memcg and it's even more difficult to
solve this at writeback layer - we'd have to communicate the shared
state all the way down to block layer and then implement a mechanism
there to take corrective measures and even after that we're likely to
end up with prolonged state where dirty ratio propagation is
essentially broken as the dirtier and writer would be seeing different
pictures.
So, based on the assumption that write sharings are mostly incidental
and temporary (ie. we're basically declaring that we don't support
persistent write sharing), how about something like the following?
1. memcg contiues per-page tracking.
2. Each inode is associated with a single blkcg at a given time and
written out by that blkcg.
3. While writing back, if the number of pages from foreign memcg's is
higher than certain ratio of total written pages, the inode is
marked as disowned and the writeback instance is optionally
terminated early. e.g. if the ratio of foreign pages is over 50%
after writing out the number of pages matching 5s worth of write
bandwidth for the bdi, mark the inode as disowned.
4. On the following dirtying of the inode, the inode is associated
with the matching blkcg of the dirtied page. Note that this could
be the next cycle as the inode could already have been marked dirty
by the time the above condition triggered. In that case, the
following writeback would be terminated early too.
This should provide sufficient corrective pressure so that incidental
and temporary sharing of an inode doesn't become a persistent issue
while keeping the complexity necessary for implementing such pressure
fairly minimal and self-contained. Also, the changes necessary for
individual filesystems would be minimal.
I think this should work well enough as long as the forementioned
assumptions are true - IOW, if we maintain that write sharing is
unsupported.
What do you think?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 2:19 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 2:19 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, again.
On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> If we can argue that memcg and blkcg having different views is
> meaningful and characterize and justify the behaviors stemming from
> the deviation, sure, that'd be fine, but I don't think we have that as
> of now.
If we assume that memcg and blkcg having different views is something
which represents an acceptable compromise considering the use cases
and implementation convenience - IOW, if we assume that read-sharing
is something which can happen regularly while write sharing is a
corner case and that while not completely correct the existing
self-corrective behavior from tracking ownership per-page at the point
of instantiation is good enough (as a memcg under pressure is likely
to give up shared pages to be re-instantiated by another sharer w/
more budget), we need to do the impedance matching between memcg and
blkcg at the writeback layer.
The main issue there is that the last chain of IO pressure propagation
is realized by making individual dirtying tasks to converge on a
common target dirty ratio point which naturally depending on those
tasks seeing the same picture in terms of the current write bandwidth
and available memory and how much of it is dirty. Tasks dirtying
pages belonging to the same memcg while some of them are mostly being
written out by a different blkcg would wreck the mechanism. It won't
be difficult for one subset to make the other to consider themselves
under severe IO pressure when there actually isn't one in that group
possibly stalling and starving those tasks unduly. At more basic
level, it's just wrong for one group to be writing out significant
amount for another.
These issues can persist indefinitely if we follow the same
instantiator-owns rule for inode writebacks. Even if we reset the
ownership when an inode becomes clea, it wouldn't work as it can be
dirtied over and over again while under writeback, and when things
like this happen, the behavior may become extremely difficult to
understand or characterize. We don't have visibility into how
individual pages of an inode get distributed across multiple cgroups,
who's currently responsible for writing back a specific inode or how
dirty ratio mechanism is behaving in the face of the unexpected
combination of parameters.
Even if we assume that write sharing is a fringe case, we need
something better than first-whatever rule when choosing which blkcg is
responsible for writing a shared inode out. There needs to be a
constant corrective pressure so that incidental and temporary sharings
don't end up screwing up the mechanism for an extended period of time.
Greg mentioned chossing the closest ancestor of the sharers, which
basically pushes inode sharing policy implmentation down to writeback
from memcg. This could work but we end up with the same collusion
problem as when this is used for memcg and it's even more difficult to
solve this at writeback layer - we'd have to communicate the shared
state all the way down to block layer and then implement a mechanism
there to take corrective measures and even after that we're likely to
end up with prolonged state where dirty ratio propagation is
essentially broken as the dirtier and writer would be seeing different
pictures.
So, based on the assumption that write sharings are mostly incidental
and temporary (ie. we're basically declaring that we don't support
persistent write sharing), how about something like the following?
1. memcg contiues per-page tracking.
2. Each inode is associated with a single blkcg at a given time and
written out by that blkcg.
3. While writing back, if the number of pages from foreign memcg's is
higher than certain ratio of total written pages, the inode is
marked as disowned and the writeback instance is optionally
terminated early. e.g. if the ratio of foreign pages is over 50%
after writing out the number of pages matching 5s worth of write
bandwidth for the bdi, mark the inode as disowned.
4. On the following dirtying of the inode, the inode is associated
with the matching blkcg of the dirtied page. Note that this could
be the next cycle as the inode could already have been marked dirty
by the time the above condition triggered. In that case, the
following writeback would be terminated early too.
This should provide sufficient corrective pressure so that incidental
and temporary sharing of an inode doesn't become a persistent issue
while keeping the complexity necessary for implementing such pressure
fairly minimal and self-contained. Also, the changes necessary for
individual filesystems would be minimal.
I think this should work well enough as long as the forementioned
assumptions are true - IOW, if we maintain that write sharing is
unsupported.
What do you think?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 2:19 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 2:19 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, again.
On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> If we can argue that memcg and blkcg having different views is
> meaningful and characterize and justify the behaviors stemming from
> the deviation, sure, that'd be fine, but I don't think we have that as
> of now.
If we assume that memcg and blkcg having different views is something
which represents an acceptable compromise considering the use cases
and implementation convenience - IOW, if we assume that read-sharing
is something which can happen regularly while write sharing is a
corner case and that while not completely correct the existing
self-corrective behavior from tracking ownership per-page at the point
of instantiation is good enough (as a memcg under pressure is likely
to give up shared pages to be re-instantiated by another sharer w/
more budget), we need to do the impedance matching between memcg and
blkcg at the writeback layer.
The main issue there is that the last chain of IO pressure propagation
is realized by making individual dirtying tasks to converge on a
common target dirty ratio point which naturally depending on those
tasks seeing the same picture in terms of the current write bandwidth
and available memory and how much of it is dirty. Tasks dirtying
pages belonging to the same memcg while some of them are mostly being
written out by a different blkcg would wreck the mechanism. It won't
be difficult for one subset to make the other to consider themselves
under severe IO pressure when there actually isn't one in that group
possibly stalling and starving those tasks unduly. At more basic
level, it's just wrong for one group to be writing out significant
amount for another.
These issues can persist indefinitely if we follow the same
instantiator-owns rule for inode writebacks. Even if we reset the
ownership when an inode becomes clea, it wouldn't work as it can be
dirtied over and over again while under writeback, and when things
like this happen, the behavior may become extremely difficult to
understand or characterize. We don't have visibility into how
individual pages of an inode get distributed across multiple cgroups,
who's currently responsible for writing back a specific inode or how
dirty ratio mechanism is behaving in the face of the unexpected
combination of parameters.
Even if we assume that write sharing is a fringe case, we need
something better than first-whatever rule when choosing which blkcg is
responsible for writing a shared inode out. There needs to be a
constant corrective pressure so that incidental and temporary sharings
don't end up screwing up the mechanism for an extended period of time.
Greg mentioned chossing the closest ancestor of the sharers, which
basically pushes inode sharing policy implmentation down to writeback
from memcg. This could work but we end up with the same collusion
problem as when this is used for memcg and it's even more difficult to
solve this at writeback layer - we'd have to communicate the shared
state all the way down to block layer and then implement a mechanism
there to take corrective measures and even after that we're likely to
end up with prolonged state where dirty ratio propagation is
essentially broken as the dirtier and writer would be seeing different
pictures.
So, based on the assumption that write sharings are mostly incidental
and temporary (ie. we're basically declaring that we don't support
persistent write sharing), how about something like the following?
1. memcg contiues per-page tracking.
2. Each inode is associated with a single blkcg at a given time and
written out by that blkcg.
3. While writing back, if the number of pages from foreign memcg's is
higher than certain ratio of total written pages, the inode is
marked as disowned and the writeback instance is optionally
terminated early. e.g. if the ratio of foreign pages is over 50%
after writing out the number of pages matching 5s worth of write
bandwidth for the bdi, mark the inode as disowned.
4. On the following dirtying of the inode, the inode is associated
with the matching blkcg of the dirtied page. Note that this could
be the next cycle as the inode could already have been marked dirty
by the time the above condition triggered. In that case, the
following writeback would be terminated early too.
This should provide sufficient corrective pressure so that incidental
and temporary sharing of an inode doesn't become a persistent issue
while keeping the complexity necessary for implementing such pressure
fairly minimal and self-contained. Also, the changes necessary for
individual filesystems would be minimal.
I think this should work well enough as long as the forementioned
assumptions are true - IOW, if we maintain that write sharing is
unsupported.
What do you think?
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 2:19 ` Tejun Heo
(?)
@ 2015-02-11 7:32 ` Jan Kara
-1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2015-02-11 7:32 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
Hello Tejun,
On Tue 10-02-15 21:19:06, Tejun Heo wrote:
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> > If we can argue that memcg and blkcg having different views is
> > meaningful and characterize and justify the behaviors stemming from
> > the deviation, sure, that'd be fine, but I don't think we have that as
> > of now.
...
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
I like this proposal. It looks simple enough and when inodes aren't
pernamently write-shared it converges to the blkcg that is currently
writing to the inode. So ack from me.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 7:32 ` Jan Kara
0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2015-02-11 7:32 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
Hello Tejun,
On Tue 10-02-15 21:19:06, Tejun Heo wrote:
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> > If we can argue that memcg and blkcg having different views is
> > meaningful and characterize and justify the behaviors stemming from
> > the deviation, sure, that'd be fine, but I don't think we have that as
> > of now.
...
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
I like this proposal. It looks simple enough and when inodes aren't
pernamently write-shared it converges to the blkcg that is currently
writing to the inode. So ack from me.
Honza
--
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 7:32 ` Jan Kara
0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2015-02-11 7:32 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
Hello Tejun,
On Tue 10-02-15 21:19:06, Tejun Heo wrote:
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> > If we can argue that memcg and blkcg having different views is
> > meaningful and characterize and justify the behaviors stemming from
> > the deviation, sure, that'd be fine, but I don't think we have that as
> > of now.
...
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
I like this proposal. It looks simple enough and when inodes aren't
pernamently write-shared it converges to the blkcg that is currently
writing to the inode. So ack from me.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 2:19 ` Tejun Heo
(?)
@ 2015-02-11 18:28 ` Greg Thelen
-1 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-11 18:28 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, again.
>
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
>> If we can argue that memcg and blkcg having different views is
>> meaningful and characterize and justify the behaviors stemming from
>> the deviation, sure, that'd be fine, but I don't think we have that as
>> of now.
>
> If we assume that memcg and blkcg having different views is something
> which represents an acceptable compromise considering the use cases
> and implementation convenience - IOW, if we assume that read-sharing
> is something which can happen regularly while write sharing is a
> corner case and that while not completely correct the existing
> self-corrective behavior from tracking ownership per-page at the point
> of instantiation is good enough (as a memcg under pressure is likely
> to give up shared pages to be re-instantiated by another sharer w/
> more budget), we need to do the impedance matching between memcg and
> blkcg at the writeback layer.
>
> The main issue there is that the last chain of IO pressure propagation
> is realized by making individual dirtying tasks to converge on a
> common target dirty ratio point which naturally depending on those
> tasks seeing the same picture in terms of the current write bandwidth
> and available memory and how much of it is dirty. Tasks dirtying
> pages belonging to the same memcg while some of them are mostly being
> written out by a different blkcg would wreck the mechanism. It won't
> be difficult for one subset to make the other to consider themselves
> under severe IO pressure when there actually isn't one in that group
> possibly stalling and starving those tasks unduly. At more basic
> level, it's just wrong for one group to be writing out significant
> amount for another.
>
> These issues can persist indefinitely if we follow the same
> instantiator-owns rule for inode writebacks. Even if we reset the
> ownership when an inode becomes clea, it wouldn't work as it can be
> dirtied over and over again while under writeback, and when things
> like this happen, the behavior may become extremely difficult to
> understand or characterize. We don't have visibility into how
> individual pages of an inode get distributed across multiple cgroups,
> who's currently responsible for writing back a specific inode or how
> dirty ratio mechanism is behaving in the face of the unexpected
> combination of parameters.
>
> Even if we assume that write sharing is a fringe case, we need
> something better than first-whatever rule when choosing which blkcg is
> responsible for writing a shared inode out. There needs to be a
> constant corrective pressure so that incidental and temporary sharings
> don't end up screwing up the mechanism for an extended period of time.
>
> Greg mentioned chossing the closest ancestor of the sharers, which
> basically pushes inode sharing policy implmentation down to writeback
> from memcg. This could work but we end up with the same collusion
> problem as when this is used for memcg and it's even more difficult to
> solve this at writeback layer - we'd have to communicate the shared
> state all the way down to block layer and then implement a mechanism
> there to take corrective measures and even after that we're likely to
> end up with prolonged state where dirty ratio propagation is
> essentially broken as the dirtier and writer would be seeing different
> pictures.
>
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
>
> I think this should work well enough as long as the forementioned
> assumptions are true - IOW, if we maintain that write sharing is
> unsupported.
>
> What do you think?
>
> Thanks.
>
> --
> tejun
This seems good. I assume that blkcg writeback would query
corresponding memcg for dirty page count to determine if over
background limit. And balance_dirty_pages() would query memcg's dirty
page count to throttle based on blkcg's bandwidth. Note: memcg
doesn't yet have dirty page counts, but several of us have made
attempts at adding the counters. And it shouldn't be hard to get them
merged.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 18:28 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-11 18:28 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, again.
>
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
>> If we can argue that memcg and blkcg having different views is
>> meaningful and characterize and justify the behaviors stemming from
>> the deviation, sure, that'd be fine, but I don't think we have that as
>> of now.
>
> If we assume that memcg and blkcg having different views is something
> which represents an acceptable compromise considering the use cases
> and implementation convenience - IOW, if we assume that read-sharing
> is something which can happen regularly while write sharing is a
> corner case and that while not completely correct the existing
> self-corrective behavior from tracking ownership per-page at the point
> of instantiation is good enough (as a memcg under pressure is likely
> to give up shared pages to be re-instantiated by another sharer w/
> more budget), we need to do the impedance matching between memcg and
> blkcg at the writeback layer.
>
> The main issue there is that the last chain of IO pressure propagation
> is realized by making individual dirtying tasks to converge on a
> common target dirty ratio point which naturally depending on those
> tasks seeing the same picture in terms of the current write bandwidth
> and available memory and how much of it is dirty. Tasks dirtying
> pages belonging to the same memcg while some of them are mostly being
> written out by a different blkcg would wreck the mechanism. It won't
> be difficult for one subset to make the other to consider themselves
> under severe IO pressure when there actually isn't one in that group
> possibly stalling and starving those tasks unduly. At more basic
> level, it's just wrong for one group to be writing out significant
> amount for another.
>
> These issues can persist indefinitely if we follow the same
> instantiator-owns rule for inode writebacks. Even if we reset the
> ownership when an inode becomes clea, it wouldn't work as it can be
> dirtied over and over again while under writeback, and when things
> like this happen, the behavior may become extremely difficult to
> understand or characterize. We don't have visibility into how
> individual pages of an inode get distributed across multiple cgroups,
> who's currently responsible for writing back a specific inode or how
> dirty ratio mechanism is behaving in the face of the unexpected
> combination of parameters.
>
> Even if we assume that write sharing is a fringe case, we need
> something better than first-whatever rule when choosing which blkcg is
> responsible for writing a shared inode out. There needs to be a
> constant corrective pressure so that incidental and temporary sharings
> don't end up screwing up the mechanism for an extended period of time.
>
> Greg mentioned chossing the closest ancestor of the sharers, which
> basically pushes inode sharing policy implmentation down to writeback
> from memcg. This could work but we end up with the same collusion
> problem as when this is used for memcg and it's even more difficult to
> solve this at writeback layer - we'd have to communicate the shared
> state all the way down to block layer and then implement a mechanism
> there to take corrective measures and even after that we're likely to
> end up with prolonged state where dirty ratio propagation is
> essentially broken as the dirtier and writer would be seeing different
> pictures.
>
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
>
> I think this should work well enough as long as the forementioned
> assumptions are true - IOW, if we maintain that write sharing is
> unsupported.
>
> What do you think?
>
> Thanks.
>
> --
> tejun
This seems good. I assume that blkcg writeback would query
corresponding memcg for dirty page count to determine if over
background limit. And balance_dirty_pages() would query memcg's dirty
page count to throttle based on blkcg's bandwidth. Note: memcg
doesn't yet have dirty page counts, but several of us have made
attempts at adding the counters. And it shouldn't be hard to get them
merged.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 18:28 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-11 18:28 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, again.
>
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
>> If we can argue that memcg and blkcg having different views is
>> meaningful and characterize and justify the behaviors stemming from
>> the deviation, sure, that'd be fine, but I don't think we have that as
>> of now.
>
> If we assume that memcg and blkcg having different views is something
> which represents an acceptable compromise considering the use cases
> and implementation convenience - IOW, if we assume that read-sharing
> is something which can happen regularly while write sharing is a
> corner case and that while not completely correct the existing
> self-corrective behavior from tracking ownership per-page at the point
> of instantiation is good enough (as a memcg under pressure is likely
> to give up shared pages to be re-instantiated by another sharer w/
> more budget), we need to do the impedance matching between memcg and
> blkcg at the writeback layer.
>
> The main issue there is that the last chain of IO pressure propagation
> is realized by making individual dirtying tasks to converge on a
> common target dirty ratio point which naturally depending on those
> tasks seeing the same picture in terms of the current write bandwidth
> and available memory and how much of it is dirty. Tasks dirtying
> pages belonging to the same memcg while some of them are mostly being
> written out by a different blkcg would wreck the mechanism. It won't
> be difficult for one subset to make the other to consider themselves
> under severe IO pressure when there actually isn't one in that group
> possibly stalling and starving those tasks unduly. At more basic
> level, it's just wrong for one group to be writing out significant
> amount for another.
>
> These issues can persist indefinitely if we follow the same
> instantiator-owns rule for inode writebacks. Even if we reset the
> ownership when an inode becomes clea, it wouldn't work as it can be
> dirtied over and over again while under writeback, and when things
> like this happen, the behavior may become extremely difficult to
> understand or characterize. We don't have visibility into how
> individual pages of an inode get distributed across multiple cgroups,
> who's currently responsible for writing back a specific inode or how
> dirty ratio mechanism is behaving in the face of the unexpected
> combination of parameters.
>
> Even if we assume that write sharing is a fringe case, we need
> something better than first-whatever rule when choosing which blkcg is
> responsible for writing a shared inode out. There needs to be a
> constant corrective pressure so that incidental and temporary sharings
> don't end up screwing up the mechanism for an extended period of time.
>
> Greg mentioned chossing the closest ancestor of the sharers, which
> basically pushes inode sharing policy implmentation down to writeback
> from memcg. This could work but we end up with the same collusion
> problem as when this is used for memcg and it's even more difficult to
> solve this at writeback layer - we'd have to communicate the shared
> state all the way down to block layer and then implement a mechanism
> there to take corrective measures and even after that we're likely to
> end up with prolonged state where dirty ratio propagation is
> essentially broken as the dirtier and writer would be seeing different
> pictures.
>
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
>
> I think this should work well enough as long as the forementioned
> assumptions are true - IOW, if we maintain that write sharing is
> unsupported.
>
> What do you think?
>
> Thanks.
>
> --
> tejun
This seems good. I assume that blkcg writeback would query
corresponding memcg for dirty page count to determine if over
background limit. And balance_dirty_pages() would query memcg's dirty
page count to throttle based on blkcg's bandwidth. Note: memcg
doesn't yet have dirty page counts, but several of us have made
attempts at adding the counters. And it shouldn't be hard to get them
merged.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 18:28 ` Greg Thelen
@ 2015-02-11 20:33 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 20:33 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
> This seems good. I assume that blkcg writeback would query
> corresponding memcg for dirty page count to determine if over
> background limit. And balance_dirty_pages() would query memcg's dirty
Yeah, available memory to the matching memcg and the number of dirty
pages in it. It's gonna work the same way as the global case just
scoped to the cgroup.
> page count to throttle based on blkcg's bandwidth. Note: memcg
> doesn't yet have dirty page counts, but several of us have made
> attempts at adding the counters. And it shouldn't be hard to get them
> merged.
Can you please post those?
So, cool, we're in agreement. Working on it. It shouldn't take too
long, hopefully.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 20:33 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 20:33 UTC (permalink / raw)
To: Greg Thelen
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
Hello, Greg.
On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
> This seems good. I assume that blkcg writeback would query
> corresponding memcg for dirty page count to determine if over
> background limit. And balance_dirty_pages() would query memcg's dirty
Yeah, available memory to the matching memcg and the number of dirty
pages in it. It's gonna work the same way as the global case just
scoped to the cgroup.
> page count to throttle based on blkcg's bandwidth. Note: memcg
> doesn't yet have dirty page counts, but several of us have made
> attempts at adding the counters. And it shouldn't be hard to get them
> merged.
Can you please post those?
So, cool, we're in agreement. Working on it. It shouldn't take too
long, hopefully.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 20:33 ` Tejun Heo
(?)
@ 2015-02-11 21:22 ` Konstantin Khlebnikov
-1 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:22 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Wed, Feb 11, 2015 at 11:33 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Greg.
>
> On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
>> This seems good. I assume that blkcg writeback would query
>> corresponding memcg for dirty page count to determine if over
>> background limit. And balance_dirty_pages() would query memcg's dirty
>
> Yeah, available memory to the matching memcg and the number of dirty
> pages in it. It's gonna work the same way as the global case just
> scoped to the cgroup.
That might be a problem: all dirty pages accounted to cgroup must be
reachable for its own personal writeback or balanace-drity-pages will be
unable to satisfy memcg dirty memory thresholds. I've done accounting
for per-inode owner, but there is another option: shared inodes might be
handled differently and will be available for all (or related) cgroup
writebacks.
Another side is that reclaimer now (mosly?) never trigger pageout.
Memcg reclaimer should do something if it finds shared dirty page:
either move it into right cgroup or make that inode reachable for
memcg writeback. I've send patch which marks shared dirty inodes
with flag I_DIRTY_SHARED or so.
>
>> page count to throttle based on blkcg's bandwidth. Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters. And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?
>
> So, cool, we're in agreement. Working on it. It shouldn't take too
> long, hopefully.
Good. As I see this design is almost equal to my proposal,
maybe except that dumb first-owns-all-until-the-end rule.
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 21:22 ` Konstantin Khlebnikov
0 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:22 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
On Wed, Feb 11, 2015 at 11:33 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Greg.
>
> On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
>> This seems good. I assume that blkcg writeback would query
>> corresponding memcg for dirty page count to determine if over
>> background limit. And balance_dirty_pages() would query memcg's dirty
>
> Yeah, available memory to the matching memcg and the number of dirty
> pages in it. It's gonna work the same way as the global case just
> scoped to the cgroup.
That might be a problem: all dirty pages accounted to cgroup must be
reachable for its own personal writeback or balanace-drity-pages will be
unable to satisfy memcg dirty memory thresholds. I've done accounting
for per-inode owner, but there is another option: shared inodes might be
handled differently and will be available for all (or related) cgroup
writebacks.
Another side is that reclaimer now (mosly?) never trigger pageout.
Memcg reclaimer should do something if it finds shared dirty page:
either move it into right cgroup or make that inode reachable for
memcg writeback. I've send patch which marks shared dirty inodes
with flag I_DIRTY_SHARED or so.
>
>> page count to throttle based on blkcg's bandwidth. Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters. And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?
>
> So, cool, we're in agreement. Working on it. It shouldn't take too
> long, hopefully.
Good. As I see this design is almost equal to my proposal,
maybe except that dumb first-owns-all-until-the-end rule.
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 21:22 ` Konstantin Khlebnikov
0 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:22 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Wed, Feb 11, 2015 at 11:33 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Greg.
>
> On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
>> This seems good. I assume that blkcg writeback would query
>> corresponding memcg for dirty page count to determine if over
>> background limit. And balance_dirty_pages() would query memcg's dirty
>
> Yeah, available memory to the matching memcg and the number of dirty
> pages in it. It's gonna work the same way as the global case just
> scoped to the cgroup.
That might be a problem: all dirty pages accounted to cgroup must be
reachable for its own personal writeback or balanace-drity-pages will be
unable to satisfy memcg dirty memory thresholds. I've done accounting
for per-inode owner, but there is another option: shared inodes might be
handled differently and will be available for all (or related) cgroup
writebacks.
Another side is that reclaimer now (mosly?) never trigger pageout.
Memcg reclaimer should do something if it finds shared dirty page:
either move it into right cgroup or make that inode reachable for
memcg writeback. I've send patch which marks shared dirty inodes
with flag I_DIRTY_SHARED or so.
>
>> page count to throttle based on blkcg's bandwidth. Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters. And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?
>
> So, cool, we're in agreement. Working on it. It shouldn't take too
> long, hopefully.
Good. As I see this design is almost equal to my proposal,
maybe except that dumb first-owns-all-until-the-end rule.
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 21:22 ` Konstantin Khlebnikov
@ 2015-02-11 21:46 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 21:46 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
Hello,
On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> > Yeah, available memory to the matching memcg and the number of dirty
> > pages in it. It's gonna work the same way as the global case just
> > scoped to the cgroup.
>
> That might be a problem: all dirty pages accounted to cgroup must be
> reachable for its own personal writeback or balanace-drity-pages will be
> unable to satisfy memcg dirty memory thresholds. I've done accounting
Yeah, it would. Why wouldn't it?
> for per-inode owner, but there is another option: shared inodes might be
> handled differently and will be available for all (or related) cgroup
> writebacks.
I'm not following you at all. The only reason this scheme can work is
because we exclude persistent shared write cases. As the whole thing
is based on that assumption, special casing shared inodes doesn't make
any sense. Doing things like allowing all cgroups to write shared
inodes without getting memcg on-board almost immediately breaks
pressure propagation while making shared writes a lot more attractive
and increasing implementation complexity substantially. Am I missing
something?
> Another side is that reclaimer now (mosly?) never trigger pageout.
> Memcg reclaimer should do something if it finds shared dirty page:
> either move it into right cgroup or make that inode reachable for
> memcg writeback. I've send patch which marks shared dirty inodes
> with flag I_DIRTY_SHARED or so.
It *might* make sense for memcg to drop pages being dirtied which
don't match the currently associated blkcg of the inode; however,
again, as we're basically declaring that shared writes aren't
supported, I'm skeptical about the usefulness.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 21:46 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 21:46 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
Hello,
On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> > Yeah, available memory to the matching memcg and the number of dirty
> > pages in it. It's gonna work the same way as the global case just
> > scoped to the cgroup.
>
> That might be a problem: all dirty pages accounted to cgroup must be
> reachable for its own personal writeback or balanace-drity-pages will be
> unable to satisfy memcg dirty memory thresholds. I've done accounting
Yeah, it would. Why wouldn't it?
> for per-inode owner, but there is another option: shared inodes might be
> handled differently and will be available for all (or related) cgroup
> writebacks.
I'm not following you at all. The only reason this scheme can work is
because we exclude persistent shared write cases. As the whole thing
is based on that assumption, special casing shared inodes doesn't make
any sense. Doing things like allowing all cgroups to write shared
inodes without getting memcg on-board almost immediately breaks
pressure propagation while making shared writes a lot more attractive
and increasing implementation complexity substantially. Am I missing
something?
> Another side is that reclaimer now (mosly?) never trigger pageout.
> Memcg reclaimer should do something if it finds shared dirty page:
> either move it into right cgroup or make that inode reachable for
> memcg writeback. I've send patch which marks shared dirty inodes
> with flag I_DIRTY_SHARED or so.
It *might* make sense for memcg to drop pages being dirtied which
don't match the currently associated blkcg of the inode; however,
again, as we're basically declaring that shared writes aren't
supported, I'm skeptical about the usefulness.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 21:46 ` Tejun Heo
(?)
@ 2015-02-11 21:57 ` Konstantin Khlebnikov
-1 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:57 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> > Yeah, available memory to the matching memcg and the number of dirty
>> > pages in it. It's gonna work the same way as the global case just
>> > scoped to the cgroup.
>>
>> That might be a problem: all dirty pages accounted to cgroup must be
>> reachable for its own personal writeback or balanace-drity-pages will be
>> unable to satisfy memcg dirty memory thresholds. I've done accounting
>
> Yeah, it would. Why wouldn't it?
How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
Or you're thinking only about separating writeback flow into blkio cgroups
without actual inode filtering? I mean delaying inode writeback and keeping
dirty pages as long as possible if their cgroups are far from threshold.
>
>> for per-inode owner, but there is another option: shared inodes might be
>> handled differently and will be available for all (or related) cgroup
>> writebacks.
>
> I'm not following you at all. The only reason this scheme can work is
> because we exclude persistent shared write cases. As the whole thing
> is based on that assumption, special casing shared inodes doesn't make
> any sense. Doing things like allowing all cgroups to write shared
> inodes without getting memcg on-board almost immediately breaks
> pressure propagation while making shared writes a lot more attractive
> and increasing implementation complexity substantially. Am I missing
> something?
>
>> Another side is that reclaimer now (mosly?) never trigger pageout.
>> Memcg reclaimer should do something if it finds shared dirty page:
>> either move it into right cgroup or make that inode reachable for
>> memcg writeback. I've send patch which marks shared dirty inodes
>> with flag I_DIRTY_SHARED or so.
>
> It *might* make sense for memcg to drop pages being dirtied which
> don't match the currently associated blkcg of the inode; however,
> again, as we're basically declaring that shared writes aren't
> supported, I'm skeptical about the usefulness.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 21:57 ` Konstantin Khlebnikov
0 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:57 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> > Yeah, available memory to the matching memcg and the number of dirty
>> > pages in it. It's gonna work the same way as the global case just
>> > scoped to the cgroup.
>>
>> That might be a problem: all dirty pages accounted to cgroup must be
>> reachable for its own personal writeback or balanace-drity-pages will be
>> unable to satisfy memcg dirty memory thresholds. I've done accounting
>
> Yeah, it would. Why wouldn't it?
How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
Or you're thinking only about separating writeback flow into blkio cgroups
without actual inode filtering? I mean delaying inode writeback and keeping
dirty pages as long as possible if their cgroups are far from threshold.
>
>> for per-inode owner, but there is another option: shared inodes might be
>> handled differently and will be available for all (or related) cgroup
>> writebacks.
>
> I'm not following you at all. The only reason this scheme can work is
> because we exclude persistent shared write cases. As the whole thing
> is based on that assumption, special casing shared inodes doesn't make
> any sense. Doing things like allowing all cgroups to write shared
> inodes without getting memcg on-board almost immediately breaks
> pressure propagation while making shared writes a lot more attractive
> and increasing implementation complexity substantially. Am I missing
> something?
>
>> Another side is that reclaimer now (mosly?) never trigger pageout.
>> Memcg reclaimer should do something if it finds shared dirty page:
>> either move it into right cgroup or make that inode reachable for
>> memcg writeback. I've send patch which marks shared dirty inodes
>> with flag I_DIRTY_SHARED or so.
>
> It *might* make sense for memcg to drop pages being dirtied which
> don't match the currently associated blkcg of the inode; however,
> again, as we're basically declaring that shared writes aren't
> supported, I'm skeptical about the usefulness.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 21:57 ` Konstantin Khlebnikov
0 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:57 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> > Yeah, available memory to the matching memcg and the number of dirty
>> > pages in it. It's gonna work the same way as the global case just
>> > scoped to the cgroup.
>>
>> That might be a problem: all dirty pages accounted to cgroup must be
>> reachable for its own personal writeback or balanace-drity-pages will be
>> unable to satisfy memcg dirty memory thresholds. I've done accounting
>
> Yeah, it would. Why wouldn't it?
How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
Or you're thinking only about separating writeback flow into blkio cgroups
without actual inode filtering? I mean delaying inode writeback and keeping
dirty pages as long as possible if their cgroups are far from threshold.
>
>> for per-inode owner, but there is another option: shared inodes might be
>> handled differently and will be available for all (or related) cgroup
>> writebacks.
>
> I'm not following you at all. The only reason this scheme can work is
> because we exclude persistent shared write cases. As the whole thing
> is based on that assumption, special casing shared inodes doesn't make
> any sense. Doing things like allowing all cgroups to write shared
> inodes without getting memcg on-board almost immediately breaks
> pressure propagation while making shared writes a lot more attractive
> and increasing implementation complexity substantially. Am I missing
> something?
>
>> Another side is that reclaimer now (mosly?) never trigger pageout.
>> Memcg reclaimer should do something if it finds shared dirty page:
>> either move it into right cgroup or make that inode reachable for
>> memcg writeback. I've send patch which marks shared dirty inodes
>> with flag I_DIRTY_SHARED or so.
>
> It *might* make sense for memcg to drop pages being dirtied which
> don't match the currently associated blkcg of the inode; however,
> again, as we're basically declaring that shared writes aren't
> supported, I'm skeptical about the usefulness.
>
> Thanks.
>
> --
> tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 21:57 ` Konstantin Khlebnikov
(?)
@ 2015-02-11 22:05 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 22:05 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> >> > Yeah, available memory to the matching memcg and the number of dirty
> >> > pages in it. It's gonna work the same way as the global case just
> >> > scoped to the cgroup.
> >>
> >> That might be a problem: all dirty pages accounted to cgroup must be
> >> reachable for its own personal writeback or balanace-drity-pages will be
> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
> >
> > Yeah, it would. Why wouldn't it?
>
> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
> Or you're thinking only about separating writeback flow into blkio cgroups
> without actual inode filtering? I mean delaying inode writeback and keeping
> dirty pages as long as possible if their cgroups are far from threshold.
What? The code was already in the previous patchset. I'm just gonna
rip out the code to handle inode being dirtied on multiple wb's.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 22:05 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 22:05 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > Hello,
> >
> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> >> > Yeah, available memory to the matching memcg and the number of dirty
> >> > pages in it. It's gonna work the same way as the global case just
> >> > scoped to the cgroup.
> >>
> >> That might be a problem: all dirty pages accounted to cgroup must be
> >> reachable for its own personal writeback or balanace-drity-pages will be
> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
> >
> > Yeah, it would. Why wouldn't it?
>
> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
> Or you're thinking only about separating writeback flow into blkio cgroups
> without actual inode filtering? I mean delaying inode writeback and keeping
> dirty pages as long as possible if their cgroups are far from threshold.
What? The code was already in the previous patchset. I'm just gonna
rip out the code to handle inode being dirtied on multiple wb's.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 22:05 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 22:05 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> >> > Yeah, available memory to the matching memcg and the number of dirty
> >> > pages in it. It's gonna work the same way as the global case just
> >> > scoped to the cgroup.
> >>
> >> That might be a problem: all dirty pages accounted to cgroup must be
> >> reachable for its own personal writeback or balanace-drity-pages will be
> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
> >
> > Yeah, it would. Why wouldn't it?
>
> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
> Or you're thinking only about separating writeback flow into blkio cgroups
> without actual inode filtering? I mean delaying inode writeback and keeping
> dirty pages as long as possible if their cgroups are far from threshold.
What? The code was already in the previous patchset. I'm just gonna
rip out the code to handle inode being dirtied on multiple wb's.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 22:05 ` Tejun Heo
(?)
@ 2015-02-11 22:15 ` Konstantin Khlebnikov
-1 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 22:15 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Thu, Feb 12, 2015 at 1:05 AM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
>> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
>> > Hello,
>> >
>> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> >> > Yeah, available memory to the matching memcg and the number of dirty
>> >> > pages in it. It's gonna work the same way as the global case just
>> >> > scoped to the cgroup.
>> >>
>> >> That might be a problem: all dirty pages accounted to cgroup must be
>> >> reachable for its own personal writeback or balanace-drity-pages will be
>> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
>> >
>> > Yeah, it would. Why wouldn't it?
>>
>> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
>> Or you're thinking only about separating writeback flow into blkio cgroups
>> without actual inode filtering? I mean delaying inode writeback and keeping
>> dirty pages as long as possible if their cgroups are far from threshold.
>
> What? The code was already in the previous patchset. I'm just gonna
> rip out the code to handle inode being dirtied on multiple wb's.
Well, ok. Even if shared writes are rare whey should be handled somehow
without relying on kupdate-like writeback. If memcg has a lot of dirty pages
but their inodes are accidentially belong to wrong wb queues when tasks in
that memcg shouldn't stuck in balance-dirty-pages until somebody outside
acidentially writes this data. That's all what I wanted to say.
>
> --
> tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 22:15 ` Konstantin Khlebnikov
0 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 22:15 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Dave Chinner,
Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins
On Thu, Feb 12, 2015 at 1:05 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
>> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > Hello,
>> >
>> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> >> > Yeah, available memory to the matching memcg and the number of dirty
>> >> > pages in it. It's gonna work the same way as the global case just
>> >> > scoped to the cgroup.
>> >>
>> >> That might be a problem: all dirty pages accounted to cgroup must be
>> >> reachable for its own personal writeback or balanace-drity-pages will be
>> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
>> >
>> > Yeah, it would. Why wouldn't it?
>>
>> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
>> Or you're thinking only about separating writeback flow into blkio cgroups
>> without actual inode filtering? I mean delaying inode writeback and keeping
>> dirty pages as long as possible if their cgroups are far from threshold.
>
> What? The code was already in the previous patchset. I'm just gonna
> rip out the code to handle inode being dirtied on multiple wb's.
Well, ok. Even if shared writes are rare whey should be handled somehow
without relying on kupdate-like writeback. If memcg has a lot of dirty pages
but their inodes are accidentially belong to wrong wb queues when tasks in
that memcg shouldn't stuck in balance-dirty-pages until somebody outside
acidentially writes this data. That's all what I wanted to say.
>
> --
> tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 22:15 ` Konstantin Khlebnikov
0 siblings, 0 replies; 74+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 22:15 UTC (permalink / raw)
To: Tejun Heo
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
On Thu, Feb 12, 2015 at 1:05 AM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
>> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
>> > Hello,
>> >
>> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> >> > Yeah, available memory to the matching memcg and the number of dirty
>> >> > pages in it. It's gonna work the same way as the global case just
>> >> > scoped to the cgroup.
>> >>
>> >> That might be a problem: all dirty pages accounted to cgroup must be
>> >> reachable for its own personal writeback or balanace-drity-pages will be
>> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
>> >
>> > Yeah, it would. Why wouldn't it?
>>
>> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
>> Or you're thinking only about separating writeback flow into blkio cgroups
>> without actual inode filtering? I mean delaying inode writeback and keeping
>> dirty pages as long as possible if their cgroups are far from threshold.
>
> What? The code was already in the previous patchset. I'm just gonna
> rip out the code to handle inode being dirtied on multiple wb's.
Well, ok. Even if shared writes are rare whey should be handled somehow
without relying on kupdate-like writeback. If memcg has a lot of dirty pages
but their inodes are accidentially belong to wrong wb queues when tasks in
that memcg shouldn't stuck in balance-dirty-pages until somebody outside
acidentially writes this data. That's all what I wanted to say.
>
> --
> tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 22:15 ` Konstantin Khlebnikov
@ 2015-02-11 22:30 ` Tejun Heo
-1 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 22:30 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
Hello,
On Thu, Feb 12, 2015 at 02:15:29AM +0400, Konstantin Khlebnikov wrote:
> Well, ok. Even if shared writes are rare whey should be handled somehow
> without relying on kupdate-like writeback. If memcg has a lot of dirty pages
This only works iff we consider those cases to be marginal enough to
be handle them in a pretty ghetto way.
> but their inodes are accidentially belong to wrong wb queues when tasks in
> that memcg shouldn't stuck in balance-dirty-pages until somebody outside
> acidentially writes this data. That's all what I wanted to say.
But, right, yeah, corner cases around this could be nasty if writeout
interval is set really high. I don't think it matters for the default
5s interval at all. Maybe what we need is queueing a delayed per-wb
work w/ the default writeout interval when dirtying a foreign inode.
I'll think more about it.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-11 22:30 ` Tejun Heo
0 siblings, 0 replies; 74+ messages in thread
From: Tejun Heo @ 2015-02-11 22:30 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
Hugh Dickins
Hello,
On Thu, Feb 12, 2015 at 02:15:29AM +0400, Konstantin Khlebnikov wrote:
> Well, ok. Even if shared writes are rare whey should be handled somehow
> without relying on kupdate-like writeback. If memcg has a lot of dirty pages
This only works iff we consider those cases to be marginal enough to
be handle them in a pretty ghetto way.
> but their inodes are accidentially belong to wrong wb queues when tasks in
> that memcg shouldn't stuck in balance-dirty-pages until somebody outside
> acidentially writes this data. That's all what I wanted to say.
But, right, yeah, corner cases around this could be nasty if writeout
interval is set really high. I don't think it matters for the default
5s interval at all. Maybe what we need is queueing a delayed per-wb
work w/ the default writeout interval when dirtying a foreign inode.
I'll think more about it.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
2015-02-11 20:33 ` Tejun Heo
@ 2015-02-12 2:10 ` Greg Thelen
-1 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-12 2:10 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Wed, Feb 11, 2015 at 12:33 PM, Tejun Heo <tj@kernel.org> wrote:
[...]
>> page count to throttle based on blkcg's bandwidth. Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters. And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?
Will do. Rebasing and testing needed, so it won't be today.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-02-12 2:10 ` Greg Thelen
0 siblings, 0 replies; 74+ messages in thread
From: Greg Thelen @ 2015-02-12 2:10 UTC (permalink / raw)
To: Tejun Heo
Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
Christoph Hellwig, Li Zefan, Hugh Dickins
On Wed, Feb 11, 2015 at 12:33 PM, Tejun Heo <tj@kernel.org> wrote:
[...]
>> page count to throttle based on blkcg's bandwidth. Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters. And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?
Will do. Rebasing and testing needed, so it won't be today.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 74+ messages in thread