* [RFC] Making memcg track ownership per address_space or anon_vma @ 2015-01-30 4:43 Tejun Heo 2015-01-30 5:55 ` Greg Thelen 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-01-30 4:43 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, gthelen, hughd, Konstantin Khebnikov Hello, Since the cgroup writeback patchset[1] have been posted, several people brought up concerns about the complexity of allowing an inode to be dirtied against multiple cgroups is necessary for the purpose of writeback and it is true that a significant amount of complexity (note that bdi still needs to be split, so it's still not trivial) can be removed if we assume that an inode always belongs to one cgroup for the purpose of writeback. However, as mentioned before, this issue is directly linked to whether memcg needs to track the memory ownership per-page. If there are valid use cases where the pages of an inode must be tracked to be owned by different cgroups, cgroup writeback must be able to handle that situation properly. If there aren't no such cases, the cgroup writeback support can be simplified but again we should put memcg on the same cadence and enforce per-inode (or per-anon_vma) ownership from the beginning. The conclusion can be either way - per-page or per-inode - but both memcg and blkcg must be looking at the same picture. Deviating them is highly likely to lead to long-term issues forcing us to look at this again anyway, only with far more baggage. One thing to note is that the per-page tracking which is currently employed by memcg seems to have been born more out of conveninence rather than requirements for any actual use cases. Per-page ownership makes sense iff pages of an inode have to be associated with different cgroups - IOW, when an inode is accessed by multiple cgroups; however, currently, memcg assigns a page to its instantiating memcg and leaves it at that till the page is released. This means that if a page is instantiated by one cgroup and then subsequently accessed only by a different cgroup, whether the page's charge gets moved to the cgroup which is actively using it is purely incidental. If the page gets reclaimed and released at some point, it'll be moved. If not, it won't. AFAICS, the only case where the current per-page accounting works properly is when disjoint sections of an inode are used by different cgroups and the whole thing hinges on whether this use case justifies all the added overhead including page->mem_cgroup pointer and the extra complexity in the writeback layer. FWIW, I'm doubtful. Johannes, Michal, Greg, what do you guys think? If the above use case - a huge file being actively accssed disjointly by multiple cgroups - isn't significant enough and there aren't other use cases that I missed which can benefit from the per-page tracking that's currently implemented, it'd be logical to switch to per-inode (or per-anon_vma or per-slab) ownership tracking. For the short term, even just adding an extra ownership information to those containing objects and inherting those to page->mem_cgroup could work although it'd definitely be beneficial to eventually get rid of page->mem_cgroup. As with per-page, when the ownership terminates is debatable w/ per-inode tracking. Also, supporting some form of shared accounting across different cgroups may be useful (e.g. shared library's memory being equally split among anyone who accesses it); however, these aren't likely to be major and trying to do something smart may affect other use cases adversely, so it'd probably be best to just keep it dumb and clear the ownership when the inode loses all pages (a cgroup can disown such inode through FADV_DONTNEED if necessary). What do you guys think? If making memcg track ownership at per-inode level, even for just the unified hierarchy, is the direction we can take, I'll go ahead and simplify the cgroup writeback patchset. Thanks. -- tejun [1] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email-tj@kernel.org ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-01-30 4:43 [RFC] Making memcg track ownership per address_space or anon_vma Tejun Heo @ 2015-01-30 5:55 ` Greg Thelen 2015-01-30 6:27 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Greg Thelen @ 2015-01-30 5:55 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, hughd, Konstantin Khebnikov On Thu, Jan 29 2015, Tejun Heo wrote: > Hello, > > Since the cgroup writeback patchset[1] have been posted, several > people brought up concerns about the complexity of allowing an inode > to be dirtied against multiple cgroups is necessary for the purpose of > writeback and it is true that a significant amount of complexity (note > that bdi still needs to be split, so it's still not trivial) can be > removed if we assume that an inode always belongs to one cgroup for > the purpose of writeback. > > However, as mentioned before, this issue is directly linked to whether > memcg needs to track the memory ownership per-page. If there are > valid use cases where the pages of an inode must be tracked to be > owned by different cgroups, cgroup writeback must be able to handle > that situation properly. If there aren't no such cases, the cgroup > writeback support can be simplified but again we should put memcg on > the same cadence and enforce per-inode (or per-anon_vma) ownership > from the beginning. The conclusion can be either way - per-page or > per-inode - but both memcg and blkcg must be looking at the same > picture. Deviating them is highly likely to lead to long-term issues > forcing us to look at this again anyway, only with far more baggage. > > One thing to note is that the per-page tracking which is currently > employed by memcg seems to have been born more out of conveninence > rather than requirements for any actual use cases. Per-page ownership > makes sense iff pages of an inode have to be associated with different > cgroups - IOW, when an inode is accessed by multiple cgroups; however, > currently, memcg assigns a page to its instantiating memcg and leaves > it at that till the page is released. This means that if a page is > instantiated by one cgroup and then subsequently accessed only by a > different cgroup, whether the page's charge gets moved to the cgroup > which is actively using it is purely incidental. If the page gets > reclaimed and released at some point, it'll be moved. If not, it > won't. > > AFAICS, the only case where the current per-page accounting works > properly is when disjoint sections of an inode are used by different > cgroups and the whole thing hinges on whether this use case justifies > all the added overhead including page->mem_cgroup pointer and the > extra complexity in the writeback layer. FWIW, I'm doubtful. > Johannes, Michal, Greg, what do you guys think? > > If the above use case - a huge file being actively accssed disjointly > by multiple cgroups - isn't significant enough and there aren't other > use cases that I missed which can benefit from the per-page tracking > that's currently implemented, it'd be logical to switch to per-inode > (or per-anon_vma or per-slab) ownership tracking. For the short term, > even just adding an extra ownership information to those containing > objects and inherting those to page->mem_cgroup could work although > it'd definitely be beneficial to eventually get rid of > page->mem_cgroup. > > As with per-page, when the ownership terminates is debatable w/ > per-inode tracking. Also, supporting some form of shared accounting > across different cgroups may be useful (e.g. shared library's memory > being equally split among anyone who accesses it); however, these > aren't likely to be major and trying to do something smart may affect > other use cases adversely, so it'd probably be best to just keep it > dumb and clear the ownership when the inode loses all pages (a cgroup > can disown such inode through FADV_DONTNEED if necessary). > > What do you guys think? If making memcg track ownership at per-inode > level, even for just the unified hierarchy, is the direction we can > take, I'll go ahead and simplify the cgroup writeback patchset. > > Thanks. I find simplification appealing. But I not sure it will fly, if for no other reason than the shared accountings. I'm ignoring intentional sharing, used by carefully crafted apps, and just thinking about incidental sharing (e.g. libc). Example: $ mkdir small $ echo 1M > small/memory.limit_in_bytes $ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) & $ mkdir big $ echo 10G > big/memory.limit_in_bytes $ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) & Assuming big/mlockall_database mlocks all of libc, then it will oom kill the small memcg because libc is owned by small due it having touched it first. It'd be hard to figure out what small did wrong to deserve the oom kill. FWIW we've been using memcg writeback where inodes have a memcg writeback owner. Once multiple memcg write to an inode then the inode becomes writeback shared which makes it more likely to be written. Once cleaned the inode is then again able to be privately owned: https://lkml.org/lkml/2011/8/17/200 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-01-30 5:55 ` Greg Thelen @ 2015-01-30 6:27 ` Tejun Heo 2015-01-30 16:07 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-01-30 6:27 UTC (permalink / raw) To: Greg Thelen Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, hughd, Konstantin Khebnikov Hello, Greg. On Thu, Jan 29, 2015 at 09:55:53PM -0800, Greg Thelen wrote: > I find simplification appealing. But I not sure it will fly, if for no > other reason than the shared accountings. I'm ignoring intentional > sharing, used by carefully crafted apps, and just thinking about > incidental sharing (e.g. libc). > > Example: > > $ mkdir small > $ echo 1M > small/memory.limit_in_bytes > $ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) & > > $ mkdir big > $ echo 10G > big/memory.limit_in_bytes > $ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) & > > Assuming big/mlockall_database mlocks all of libc, then it will oom kill > the small memcg because libc is owned by small due it having touched it > first. It'd be hard to figure out what small did wrong to deserve the > oom kill. The previous behavior was pretty unpredictable in terms of shared file ownership too. I wonder whether the better thing to do here is either charging cases like this to the common ancestor or splitting the charge equally among the accessors, which might be doable for ro files. > FWIW we've been using memcg writeback where inodes have a memcg > writeback owner. Once multiple memcg write to an inode then the inode > becomes writeback shared which makes it more likely to be written. Once > cleaned the inode is then again able to be privately owned: > https://lkml.org/lkml/2011/8/17/200 The problem is that it introduces deviations between memcg and writeback / blkcg which will mess up pressure propagation. Writeback pressure can't be determined without its associated memcg and neither can dirty balancing. We sure can simplify things by trading off accuracies at places but let's please try to do that throughout the stack, not in the midpoint, so that we can say "if you do this, it'll behave this way and you can see it showing up there". The thing is if we leave it half-way, in time, some will try to actively exploit memcg's page granularity and we'll have to deal with writeback behavior which is difficult to even characterize. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-01-30 6:27 ` Tejun Heo @ 2015-01-30 16:07 ` Tejun Heo 2015-02-02 19:26 ` Konstantin Khlebnikov 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-01-30 16:07 UTC (permalink / raw) To: Greg Thelen Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, hughd, Konstantin Khebnikov Hey, again. On Fri, Jan 30, 2015 at 01:27:37AM -0500, Tejun Heo wrote: > The previous behavior was pretty unpredictable in terms of shared file > ownership too. I wonder whether the better thing to do here is either > charging cases like this to the common ancestor or splitting the > charge equally among the accessors, which might be doable for ro > files. I've been thinking more about this. It's true that doing per-page association allows for avoiding confronting the worst side effects of inode sharing head-on, but it is a tradeoff with fairly weak justfications. The only thing we're gaining is side-stepping the blunt of the problem in an awkward manner and the loss of clarity in taking this compromised position has nasty ramifications when we try to connect it with the rest of the world. I could be missing something major but the more I think about it, it looks to me that the right thing to do here is accounting per-inode and charging shared inodes to the nearest common ancestor. The resulting behavior would be way more logical and predicatable than the current one, which would make it straight forward to integrate memcg with blkcg and writeback. One of the problems that I can think of off the top of my head is that it'd involve more regular use of charge moving; however, this is an operation which is per-inode rather than per-page and still gonna be fairly infrequent. Another one is that if we move memcg over to this behavior, it's likely to affect the behavior on the traditional hierarchies too as we sure as hell don't want to switch between the two major behaviors dynamically but given that behaviors on inode sharing aren't very well supported yet, this can be an acceptable change. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-01-30 16:07 ` Tejun Heo @ 2015-02-02 19:26 ` Konstantin Khlebnikov 2015-02-02 19:46 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Konstantin Khlebnikov @ 2015-02-02 19:26 UTC (permalink / raw) To: Tejun Heo, Greg Thelen Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, hughd On 30.01.2015 19:07, Tejun Heo wrote: > Hey, again. > > On Fri, Jan 30, 2015 at 01:27:37AM -0500, Tejun Heo wrote: >> The previous behavior was pretty unpredictable in terms of shared file >> ownership too. I wonder whether the better thing to do here is either >> charging cases like this to the common ancestor or splitting the >> charge equally among the accessors, which might be doable for ro >> files. > > I've been thinking more about this. It's true that doing per-page > association allows for avoiding confronting the worst side effects of > inode sharing head-on, but it is a tradeoff with fairly weak > justfications. The only thing we're gaining is side-stepping the > blunt of the problem in an awkward manner and the loss of clarity in > taking this compromised position has nasty ramifications when we try > to connect it with the rest of the world. > > I could be missing something major but the more I think about it, it > looks to me that the right thing to do here is accounting per-inode > and charging shared inodes to the nearest common ancestor. The > resulting behavior would be way more logical and predicatable than the > current one, which would make it straight forward to integrate memcg > with blkcg and writeback. > > One of the problems that I can think of off the top of my head is that > it'd involve more regular use of charge moving; however, this is an > operation which is per-inode rather than per-page and still gonna be > fairly infrequent. Another one is that if we move memcg over to this > behavior, it's likely to affect the behavior on the traditional > hierarchies too as we sure as hell don't want to switch between the > two major behaviors dynamically but given that behaviors on inode > sharing aren't very well supported yet, this can be an acceptable > change. > > Thanks. > Well... that might work. Per-inode/anonvma memcg will be much more predictable for sure. In some cases memory cgroup for inode might be assigned statically. For example database files migth be pinned to special cgroup and protected with low limit (soft guarantee or whatever it's called nowadays). For overlay-fs-like containers might be reasonable to keep shared template area in separate memory cgroup. (keep cgroup mark at bind-mount vfsmount?). Removing memcg pointer from struct page might be tricky. It's not clear what to do with truncated pages: either link them with lru differently or remove from lru right at truncate. Swap cache pages have the same problem. Process of moving inodes from memcg to memcg is more or less doable. Possible solution: keep at inode two pointers to memcg "old" and "new". Each page will be accounted (and linked into corresponding lru) to one of them. Separation to "old" and "new" pages could be done by flag on struct page or by bordering page index stored in inode: pages where index < border are accounted to the new memcg, the rest to the old. Keeping shared inodes in common ancestor is reasonable. We could schedule asynchronous moving when somebody opens or mmaps inode from outside of its current cgroup. But it's not clear when inode should be moved into opposite direction: when inode should become private and how detect if it's no longer shared. For example each inode could keep yet another pointer to memcg where it will track subtree of cgroups where it was accessed in past 5 minutes or so. And sometimes that informations goes into moving thread. Actually I don't see other options except that time-based estimation: tracking all cgroups for each inode is too expensive, moving pages from one lru to another is expensive too. So, moving inodes back and forth at each access from the outside world is not an option. That should be rare operation which runs in background or in reclaimer. -- Konstantin ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-02 19:26 ` Konstantin Khlebnikov @ 2015-02-02 19:46 ` Tejun Heo 2015-02-03 23:30 ` Greg Thelen 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-02 19:46 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Greg Thelen, Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, hughd Hey, On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote: > Removing memcg pointer from struct page might be tricky. > It's not clear what to do with truncated pages: either link them > with lru differently or remove from lru right at truncate. > Swap cache pages have the same problem. Hmmm... idk, maybe play another trick with low bits of page->mapping and make it point to the cgroup after truncation? Do we even care tho? Can't we just push them to the root and forget about them? They are pretty transient after all, no? > Process of moving inodes from memcg to memcg is more or less doable. > Possible solution: keep at inode two pointers to memcg "old" and "new". > Each page will be accounted (and linked into corresponding lru) to one > of them. Separation to "old" and "new" pages could be done by flag on > struct page or by bordering page index stored in inode: pages where > index < border are accounted to the new memcg, the rest to the old. Yeah, pretty much the same scheme that the per-page cgroup writeback is using with lower bits of page->mem_cgroup should work with the bits moved to page->flags. > Keeping shared inodes in common ancestor is reasonable. > We could schedule asynchronous moving when somebody opens or mmaps > inode from outside of its current cgroup. But it's not clear when > inode should be moved into opposite direction: when inode should > become private and how detect if it's no longer shared. > > For example each inode could keep yet another pointer to memcg where > it will track subtree of cgroups where it was accessed in past 5 > minutes or so. And sometimes that informations goes into moving thread. > > Actually I don't see other options except that time-based estimation: > tracking all cgroups for each inode is too expensive, moving pages > from one lru to another is expensive too. So, moving inodes back and > forth at each access from the outside world is not an option. > That should be rare operation which runs in background or in reclaimer. Right, what strategy to use for migration is up for debate, even for moving to the common ancestor. e.g. should we do that on the first access? In the other direction, it get more interesting. Let's say if we decide to move back an inode to a descendant, what if that triggers OOM condition? Do we still go through it and cause OOM in the target? Do we even want automatic moving in this direction? For explicit cases, userland can do FADV_DONTNEED, I suppose. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-02 19:46 ` Tejun Heo @ 2015-02-03 23:30 ` Greg Thelen 2015-02-04 10:49 ` Konstantin Khlebnikov 2015-02-04 17:06 ` Tejun Heo 0 siblings, 2 replies; 31+ messages in thread From: Greg Thelen @ 2015-02-03 23:30 UTC (permalink / raw) To: Tejun Heo Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo <tj@kernel.org> wrote: > Hey, > > On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote: > >> Keeping shared inodes in common ancestor is reasonable. >> We could schedule asynchronous moving when somebody opens or mmaps >> inode from outside of its current cgroup. But it's not clear when >> inode should be moved into opposite direction: when inode should >> become private and how detect if it's no longer shared. >> >> For example each inode could keep yet another pointer to memcg where >> it will track subtree of cgroups where it was accessed in past 5 >> minutes or so. And sometimes that informations goes into moving thread. >> >> Actually I don't see other options except that time-based estimation: >> tracking all cgroups for each inode is too expensive, moving pages >> from one lru to another is expensive too. So, moving inodes back and >> forth at each access from the outside world is not an option. >> That should be rare operation which runs in background or in reclaimer. > > Right, what strategy to use for migration is up for debate, even for > moving to the common ancestor. e.g. should we do that on the first > access? In the other direction, it get more interesting. Let's say > if we decide to move back an inode to a descendant, what if that > triggers OOM condition? Do we still go through it and cause OOM in > the target? Do we even want automatic moving in this direction? > > For explicit cases, userland can do FADV_DONTNEED, I suppose. > > Thanks. > > -- > tejun I don't have any killer objections, most of my worries are isolation concerns. If a machine has several top level memcg trying to get some form of isolation (using low, min, soft limit) then a shared libc will be moved to the root memcg where it's not protected from global memory pressure. At least with the current per page accounting such shared pages often land into some protected memcg. If two cgroups collude they can use more memory than their limit and oom the entire machine. Admittedly the current per-page system isn't perfect because deleting a memcg which contains mlocked memory (referenced by a remote memcg) moves the mlocked memory to root resulting in the same issue. But I'd argue this is more likely with the RFC because it doesn't involve the cgroup deletion/reparenting. A possible tweak to shore up the current system is to move such mlocked pages to the memcg of the surviving locker. When the machine is oom it's often nice to examine memcg state to determine which container is using the memory. Tracking down who's contributing to a shared container is non-trivial. I actually have a set of patches which add a memcg=M mount option to memory backed file systems. I was planning on proposing them, regardless of this RFC, and this discussion makes them even more appealing. If we go in this direction, then we'd need a similar notion for disk based filesystems. As Konstantin suggested, it'd be really nice to specify charge policy on a per file, or directory, or bind mount basis. This allows shared files to be deterministically charged to a known container. We'd need to flesh out the policies: e.g. if two bind mound each specify different charge targets for the same inode, I guess we just pick one. Though the nature of this catch-all shared container is strange. Presumably a machine manager would need to create it as an unlimited container (or at least as big as the sum of all shared files) so that any app which decided it wants to mlock all shared files has a way to without ooming the shared container. In the current per-page approach it's possible to lock shared libs. But the machine manager would need to decide how much system ram to set aside for this catch-all shared container. When there's large incidental sharing, then things get sticky. A periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in a small container would pull all pages to the root memcg where they are exposed to root pressure which breaks isolation. This is concerning. Perhaps the such accesses could be decorated with (O_NO_MOVEMEM). So this RFC change will introduce significant change to user space machine managers and perturb isolation. Is the resulting system better? It's not clear, it's the devil know vs devil unknown. Maybe it'd be easier if the memcg's I'm talking about were not allowed to share page cache (aka copy-on-read) even for files which are jointly visible. That would provide today's interface while avoiding the problematic sharing. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-03 23:30 ` Greg Thelen @ 2015-02-04 10:49 ` Konstantin Khlebnikov 2015-02-04 17:15 ` Tejun Heo 2015-02-04 17:06 ` Tejun Heo 1 sibling, 1 reply; 31+ messages in thread From: Konstantin Khlebnikov @ 2015-02-04 10:49 UTC (permalink / raw) To: Greg Thelen, Tejun Heo Cc: Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins, Roman Gushchin On 04.02.2015 02:30, Greg Thelen wrote: > On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo <tj@kernel.org> wrote: >> Hey, >> >> On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote: >> >>> Keeping shared inodes in common ancestor is reasonable. >>> We could schedule asynchronous moving when somebody opens or mmaps >>> inode from outside of its current cgroup. But it's not clear when >>> inode should be moved into opposite direction: when inode should >>> become private and how detect if it's no longer shared. >>> >>> For example each inode could keep yet another pointer to memcg where >>> it will track subtree of cgroups where it was accessed in past 5 >>> minutes or so. And sometimes that informations goes into moving thread. >>> >>> Actually I don't see other options except that time-based estimation: >>> tracking all cgroups for each inode is too expensive, moving pages >>> from one lru to another is expensive too. So, moving inodes back and >>> forth at each access from the outside world is not an option. >>> That should be rare operation which runs in background or in reclaimer. >> >> Right, what strategy to use for migration is up for debate, even for >> moving to the common ancestor. e.g. should we do that on the first >> access? In the other direction, it get more interesting. Let's say >> if we decide to move back an inode to a descendant, what if that >> triggers OOM condition? Do we still go through it and cause OOM in >> the target? Do we even want automatic moving in this direction? >> >> For explicit cases, userland can do FADV_DONTNEED, I suppose. >> >> Thanks. >> >> -- >> tejun > > I don't have any killer objections, most of my worries are isolation concerns. > > If a machine has several top level memcg trying to get some form of > isolation (using low, min, soft limit) then a shared libc will be > moved to the root memcg where it's not protected from global memory > pressure. At least with the current per page accounting such shared > pages often land into some protected memcg. > > If two cgroups collude they can use more memory than their limit and > oom the entire machine. Admittedly the current per-page system isn't > perfect because deleting a memcg which contains mlocked memory > (referenced by a remote memcg) moves the mlocked memory to root > resulting in the same issue. But I'd argue this is more likely with > the RFC because it doesn't involve the cgroup deletion/reparenting. A > possible tweak to shore up the current system is to move such mlocked > pages to the memcg of the surviving locker. When the machine is oom > it's often nice to examine memcg state to determine which container is > using the memory. Tracking down who's contributing to a shared > container is non-trivial. > > I actually have a set of patches which add a memcg=M mount option to > memory backed file systems. I was planning on proposing them, > regardless of this RFC, and this discussion makes them even more > appealing. If we go in this direction, then we'd need a similar > notion for disk based filesystems. As Konstantin suggested, it'd be > really nice to specify charge policy on a per file, or directory, or > bind mount basis. This allows shared files to be deterministically > charged to a known container. We'd need to flesh out the policies: > e.g. if two bind mound each specify different charge targets for the > same inode, I guess we just pick one. Though the nature of this > catch-all shared container is strange. Presumably a machine manager > would need to create it as an unlimited container (or at least as big > as the sum of all shared files) so that any app which decided it wants > to mlock all shared files has a way to without ooming the shared > container. In the current per-page approach it's possible to lock > shared libs. But the machine manager would need to decide how much > system ram to set aside for this catch-all shared container. > > When there's large incidental sharing, then things get sticky. A > periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in > a small container would pull all pages to the root memcg where they > are exposed to root pressure which breaks isolation. This is > concerning. Perhaps the such accesses could be decorated with > (O_NO_MOVEMEM). > > So this RFC change will introduce significant change to user space > machine managers and perturb isolation. Is the resulting system > better? It's not clear, it's the devil know vs devil unknown. Maybe > it'd be easier if the memcg's I'm talking about were not allowed to > share page cache (aka copy-on-read) even for files which are jointly > visible. That would provide today's interface while avoiding the > problematic sharing. > I think important shared data must be handled and protected explicitly. That 'catch-all' shared container could be separated into several memory cgroups depending on importance of files: glibc protected with soft guarantee, less important stuff is placed into another cgroup and cannot push top-priority libraries out of ram. If shared files are free for use then that 'shared' container must be ready to keep them in memory. Otherwise this need to be fixed at the container side: we could ignore mlock for shared inodes or amount of such vmas might be limited in per-container basis. But sharing responsibility for shared file is vague concept: memory usage and limit of container must depends only on its own behavior not on neighbors at the same machine. Generally incidental sharing could be handled as temporary sharing: default policy (if inode isn't pinned to memory cgroup) after some time should detect that inode is no longer shared and migrate it into original cgroup. Of course task could provide hit: O_NO_MOVEMEM or even while memory cgroup where it runs could be marked as "scanner" which shouldn't disturb memory classification. BTW, the same algorithm which determines who have used inode recently could tell who have used shared inode even if it's pinned to shared container. Other cool option which could fix false-sharing after scanning is FADV_NOREUSE which tells to keep page-cache pages which were used for reading and writing via this file descriptor out of lru and remove them from inode when this file descriptor closes. Something like private per-struct-file page-cache. Probably somebody already tried that? I've missed obvious solution for controlling memory cgroup for files: project id. This persistent integer id stored in file system. For now it's implemented only for xfs and used for quota which is orthogonal to user/group quotas. We could map some of project id to memory cgroup. That is more flexible than per-superblock mark, has no conflicts like mark on bind-mount. -- Konstantin ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-04 10:49 ` Konstantin Khlebnikov @ 2015-02-04 17:15 ` Tejun Heo 2015-02-04 17:58 ` Konstantin Khlebnikov 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-04 17:15 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Greg Thelen, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins, Roman Gushchin Hello, On Wed, Feb 04, 2015 at 01:49:08PM +0300, Konstantin Khlebnikov wrote: > I think important shared data must be handled and protected explicitly. > That 'catch-all' shared container could be separated into several I kinda disagree. That'd be a major pain in the ass to use and you wouldn't know when you got something wrong unless it actually goes wrong and you know enough about the innerworkings to look for that. Doesn't sound like a sound design to me. > memory cgroups depending on importance of files: glibc protected > with soft guarantee, less important stuff is placed into another > cgroup and cannot push top-priority libraries out of ram. That sounds extremely painful. > If shared files are free for use then that 'shared' container must be > ready to keep them in memory. Otherwise this need to be fixed at the > container side: we could ignore mlock for shared inodes or amount of > such vmas might be limited in per-container basis. > > But sharing responsibility for shared file is vague concept: memory > usage and limit of container must depends only on its own behavior not > on neighbors at the same machine. > > > Generally incidental sharing could be handled as temporary sharing: > default policy (if inode isn't pinned to memory cgroup) after some > time should detect that inode is no longer shared and migrate it into > original cgroup. Of course task could provide hit: O_NO_MOVEMEM or > even while memory cgroup where it runs could be marked as "scanner" > which shouldn't disturb memory classification. Ditto for annotating each file individually. Let's please try to stay away from things like that. That's mostly a cop-out which is unlikely to actually benefit the majority of users. > I've missed obvious solution for controlling memory cgroup for files: > project id. This persistent integer id stored in file system. For now > it's implemented only for xfs and used for quota which is orthogonal > to user/group quotas. We could map some of project id to memory cgroup. > That is more flexible than per-superblock mark, has no conflicts like > mark on bind-mount. Again, hell, no. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-04 17:15 ` Tejun Heo @ 2015-02-04 17:58 ` Konstantin Khlebnikov 2015-02-04 18:28 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Konstantin Khlebnikov @ 2015-02-04 17:58 UTC (permalink / raw) To: Tejun Heo Cc: Greg Thelen, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins, Roman Gushchin On 04.02.2015 20:15, Tejun Heo wrote: > Hello, > > On Wed, Feb 04, 2015 at 01:49:08PM +0300, Konstantin Khlebnikov wrote: >> I think important shared data must be handled and protected explicitly. >> That 'catch-all' shared container could be separated into several > > I kinda disagree. That'd be a major pain in the ass to use and you > wouldn't know when you got something wrong unless it actually goes > wrong and you know enough about the innerworkings to look for that. > Doesn't sound like a sound design to me. > >> memory cgroups depending on importance of files: glibc protected >> with soft guarantee, less important stuff is placed into another >> cgroup and cannot push top-priority libraries out of ram. > > That sounds extremely painful. I mean this thing _could_ be controlled more precisely. Even if default policy works for 99% users manual override is still required for 1% or if something goes wrong. > >> If shared files are free for use then that 'shared' container must be >> ready to keep them in memory. Otherwise this need to be fixed at the >> container side: we could ignore mlock for shared inodes or amount of >> such vmas might be limited in per-container basis. >> >> But sharing responsibility for shared file is vague concept: memory >> usage and limit of container must depends only on its own behavior not >> on neighbors at the same machine. >> >> >> Generally incidental sharing could be handled as temporary sharing: >> default policy (if inode isn't pinned to memory cgroup) after some >> time should detect that inode is no longer shared and migrate it into >> original cgroup. Of course task could provide hit: O_NO_MOVEMEM or >> even while memory cgroup where it runs could be marked as "scanner" >> which shouldn't disturb memory classification. > > Ditto for annotating each file individually. Let's please try to stay > away from things like that. That's mostly a cop-out which is unlikely > to actually benefit the majority of users. Process which scans all files once isn't so rare use case. Linux still cannot handle this pattern sometimes. > >> I've missed obvious solution for controlling memory cgroup for files: >> project id. This persistent integer id stored in file system. For now >> it's implemented only for xfs and used for quota which is orthogonal >> to user/group quotas. We could map some of project id to memory cgroup. >> That is more flexible than per-superblock mark, has no conflicts like >> mark on bind-mount. > > Again, hell, no. > > Thanks. > -- Konstantin ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-04 17:58 ` Konstantin Khlebnikov @ 2015-02-04 18:28 ` Tejun Heo 0 siblings, 0 replies; 31+ messages in thread From: Tejun Heo @ 2015-02-04 18:28 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Greg Thelen, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins, Roman Gushchin On Wed, Feb 04, 2015 at 08:58:21PM +0300, Konstantin Khlebnikov wrote: > >>Generally incidental sharing could be handled as temporary sharing: > >>default policy (if inode isn't pinned to memory cgroup) after some > >>time should detect that inode is no longer shared and migrate it into > >>original cgroup. Of course task could provide hit: O_NO_MOVEMEM or > >>even while memory cgroup where it runs could be marked as "scanner" > >>which shouldn't disturb memory classification. > > > >Ditto for annotating each file individually. Let's please try to stay > >away from things like that. That's mostly a cop-out which is unlikely > >to actually benefit the majority of users. > > Process which scans all files once isn't so rare use case. > Linux still cannot handle this pattern sometimes. Yeah, sure, tagging usages with m/fadvise's is fine. We can just look at the policy and ignore them for the purpose of determining who's using the inode, but let's stay away from tagging the files on filesystem if at all possible. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-03 23:30 ` Greg Thelen 2015-02-04 10:49 ` Konstantin Khlebnikov @ 2015-02-04 17:06 ` Tejun Heo 2015-02-04 23:51 ` Greg Thelen 1 sibling, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-04 17:06 UTC (permalink / raw) To: Greg Thelen Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote: > If a machine has several top level memcg trying to get some form of > isolation (using low, min, soft limit) then a shared libc will be > moved to the root memcg where it's not protected from global memory > pressure. At least with the current per page accounting such shared > pages often land into some protected memcg. Yes, it becomes interesting with the low limit as the pressure direction is reversed but at the same time overcommitting low limits doesn't lead to a sane setup to begin with as it's asking for global OOMs anyway, which means that things like libc would end up competing at least fairly with other pages for global pressure and should stay in memory under most circumstances, which may or may not be sufficient. Hmm.... need to think more about it but this only becomes a problem with the root cgroup because it doesn't have min setting which is expected to be inclusive of all descendants, right? Maybe the right thing to do here is treating the inodes which get pushed to the root as a special case and we can implement a mechanism where the root is effectively borrowing from the mins of its children which doesn't have to be completely correct - e.g. just charge it against all children repeatedly and if any has min protection, put it under min protection. IOW, make it the baseload for all of them. > If two cgroups collude they can use more memory than their limit and > oom the entire machine. Admittedly the current per-page system isn't > perfect because deleting a memcg which contains mlocked memory > (referenced by a remote memcg) moves the mlocked memory to root > resulting in the same issue. But I'd argue this is more likely with Hmmm... why does it do that? Can you point me to where it's happening? > the RFC because it doesn't involve the cgroup deletion/reparenting. A One approach could be expanding on the forementioned scheme and make all sharing cgroups to get charged for the shared inodes they're using, which should render such collusions entirely pointless. e.g. let's say we start with the following. A (usage=48M) +-B (usage=16M) \-C (usage=32M) And let's say, C starts accessing an inode which is 8M and currently associated with B. A (usage=48M, hosted= 8M) +-B (usage= 8M, shared= 8M) \-C (usage=32M, shared= 8M) The only extra charging that we'd be doing is charing C with extra 8M. Let's say another cgroup D gets created and uses 4M. A (usage=56M, hosted= 8M) +-B (usage= 8M, shared= 8M) +-C (usage=32M, shared= 8M) \-D (usage= 8M) and it also accesses the inode. A (usage=56M, hosted= 8M) +-B (usage= 8M, shared= 8M) +-C (usage=32M, shared= 8M) \-D (usage= 8M, shared= 8M) We'd need to track the shared charges separately as they should count only once in the parent but that shouldn't be too hard. The problem here is that we'd need to track which inodes are being accessed by which children, which can get painful for things like libc. Maybe we can limit it to be level-by-level - track sharing only from the immediate children and always move a shared inode at one level at a time. That would lose some ability to track the sharing beyond the immediate children but it should be enough to solve the root case and allow us to adapt to changing usage pattern over time. Given that sharing is mostly a corner case, this could be good enough. Now, if D accesses 4M area of the inode which hasn't been accessed by others yet. We'd want it to look like the following. A (usage=64M, hosted=16M) +-B (usage= 8M, shared=16M) +-C (usage=32M, shared=16M) \-D (usage= 8M, shared=16M) But charging it to B, C at the same time prolly wouldn't be particularly convenient. We can prolly just do D -> A charging and let B and C sort themselves out later. Note that such charging would still maintain the overall integrity of memory limits. The only thing which may overflow is the pseudo shared charges to keep sharing in check and dealing with them later when B and C try to create further charges should be completely fine. Note that we can also try to split the shared charge across the users; however, charging the full amount seems like the better approach to me. We don't have any way to tell how the usage is distributed anyway. For use cases where this sort of sharing is expected, I think it's perfectly reasonable to provision the sharing children to have enough to accomodate the possible full size of the shared resource. > possible tweak to shore up the current system is to move such mlocked > pages to the memcg of the surviving locker. When the machine is oom > it's often nice to examine memcg state to determine which container is > using the memory. Tracking down who's contributing to a shared > container is non-trivial. > > I actually have a set of patches which add a memcg=M mount option to > memory backed file systems. I was planning on proposing them, > regardless of this RFC, and this discussion makes them even more > appealing. If we go in this direction, then we'd need a similar > notion for disk based filesystems. As Konstantin suggested, it'd be > really nice to specify charge policy on a per file, or directory, or > bind mount basis. This allows shared files to be deterministically I'm not too sure about that. We might add that later if absolutely justifiable but designing assuming that level of intervention from userland may not be such a good idea. > When there's large incidental sharing, then things get sticky. A > periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in > a small container would pull all pages to the root memcg where they > are exposed to root pressure which breaks isolation. This is > concerning. Perhaps the such accesses could be decorated with > (O_NO_MOVEMEM). If such thing is really necessary, FADV_NOREUSE would be a better indicator; however, yes, such incidental sharing is easier to handle with per-page scheme as such scanner can be limited in the number of pages it can carry throughout its operation regardless of which cgroup it's looking at. It still has the nasty corner case where random target cgroups can latch onto pages faulted in by the scanner and keeping accessing them tho, so, even now, FADV_NOREUSE would be a good idea. Note that such scanning, if repeated on cgroups under high memory pressure, is *likely* to accumulate residue escaped pages and if such a management cgroup is transient, those escaped pages will accumulate over time outside any limit in a way which is unpredictable and invisible. > So this RFC change will introduce significant change to user space > machine managers and perturb isolation. Is the resulting system > better? It's not clear, it's the devil know vs devil unknown. Maybe > it'd be easier if the memcg's I'm talking about were not allowed to > share page cache (aka copy-on-read) even for files which are jointly > visible. That would provide today's interface while avoiding the > problematic sharing. Yeah, compatibility would be the stickiest part. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-04 17:06 ` Tejun Heo @ 2015-02-04 23:51 ` Greg Thelen 2015-02-05 13:15 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Greg Thelen @ 2015-02-04 23:51 UTC (permalink / raw) To: Tejun Heo Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Wed, Feb 04 2015, Tejun Heo wrote: > Hello, > > On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote: >> If a machine has several top level memcg trying to get some form of >> isolation (using low, min, soft limit) then a shared libc will be >> moved to the root memcg where it's not protected from global memory >> pressure. At least with the current per page accounting such shared >> pages often land into some protected memcg. > > Yes, it becomes interesting with the low limit as the pressure > direction is reversed but at the same time overcommitting low limits > doesn't lead to a sane setup to begin with as it's asking for global > OOMs anyway, which means that things like libc would end up competing > at least fairly with other pages for global pressure and should stay > in memory under most circumstances, which may or may not be > sufficient. I agree. Clarification... I don't plan to overcommit low or min limits. On machines without overcommited min limits the existing system offers some protection for shared libs from global reclaim. Pushing them to root doesn't. > Hmm.... need to think more about it but this only becomes a problem > with the root cgroup because it doesn't have min setting which is > expected to be inclusive of all descendants, right? Maybe the right > thing to do here is treating the inodes which get pushed to the root > as a special case and we can implement a mechanism where the root is > effectively borrowing from the mins of its children which doesn't have > to be completely correct - e.g. just charge it against all children > repeatedly and if any has min protection, put it under min protection. > IOW, make it the baseload for all of them. I think the linux-next low (and the TBD min) limits also have the problem for more than just the root memcg. I'm thinking of a 2M file shared between C and D below. The file will be charged to common parent B. A +-B (usage=2M lim=3M min=2M) +-C (usage=0 lim=2M min=1M shared_usage=2M) +-D (usage=0 lim=2M min=1M shared_usage=2M) \-E (usage=0 lim=2M min=0) The problem arises if A/B/E allocates more than 1M of private reclaimable file data. This pushes A/B into reclaim which will reclaim both the shared file from A/B and private file from A/B/E. In contrast, the current per-page memcg would've protected the shared file in either C or D leaving A/B reclaim to only attack A/B/E. Pinning the shared file to either C or D, using TBD policy such as mount option, would solve this for tightly shared files. But for wide fanout file (libc) the admin would need to assign a global bucket and this would be a pain to size due to various job requirements. >> If two cgroups collude they can use more memory than their limit and >> oom the entire machine. Admittedly the current per-page system isn't >> perfect because deleting a memcg which contains mlocked memory >> (referenced by a remote memcg) moves the mlocked memory to root >> resulting in the same issue. But I'd argue this is more likely with > > Hmmm... why does it do that? Can you point me to where it's > happening? My mistake, I was thinking of older kernels which reparent memory. Though I can't say v3.19-rc7 handles this collusion any better. Instead of reparenting the mlocked memory, it's left in an invisible (offline) memcg. Unlike older kernels the memory doesn't appear in root/memory.stat[unevictable], instead it buried in root/memory.stat[total_unevictable] which includes mlocked memory in visible (online) and invisible (offline) children. >> the RFC because it doesn't involve the cgroup deletion/reparenting. A > > One approach could be expanding on the forementioned scheme and make > all sharing cgroups to get charged for the shared inodes they're > using, which should render such collusions entirely pointless. > e.g. let's say we start with the following. > > A (usage=48M) > +-B (usage=16M) > \-C (usage=32M) > > And let's say, C starts accessing an inode which is 8M and currently > associated with B. > > A (usage=48M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > \-C (usage=32M, shared= 8M) > > The only extra charging that we'd be doing is charing C with extra > 8M. Let's say another cgroup D gets created and uses 4M. > > A (usage=56M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > +-C (usage=32M, shared= 8M) > \-D (usage= 8M) > > and it also accesses the inode. > > A (usage=56M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > +-C (usage=32M, shared= 8M) > \-D (usage= 8M, shared= 8M) > > We'd need to track the shared charges separately as they should count > only once in the parent but that shouldn't be too hard. The problem > here is that we'd need to track which inodes are being accessed by > which children, which can get painful for things like libc. Maybe we > can limit it to be level-by-level - track sharing only from the > immediate children and always move a shared inode at one level at a > time. That would lose some ability to track the sharing beyond the > immediate children but it should be enough to solve the root case and > allow us to adapt to changing usage pattern over time. Given that > sharing is mostly a corner case, this could be good enough. > > Now, if D accesses 4M area of the inode which hasn't been accessed by > others yet. We'd want it to look like the following. > > A (usage=64M, hosted=16M) > +-B (usage= 8M, shared=16M) > +-C (usage=32M, shared=16M) > \-D (usage= 8M, shared=16M) > > But charging it to B, C at the same time prolly wouldn't be > particularly convenient. We can prolly just do D -> A charging and > let B and C sort themselves out later. Note that such charging would > still maintain the overall integrity of memory limits. The only thing > which may overflow is the pseudo shared charges to keep sharing in > check and dealing with them later when B and C try to create further > charges should be completely fine. > > Note that we can also try to split the shared charge across the users; > however, charging the full amount seems like the better approach to > me. We don't have any way to tell how the usage is distributed > anyway. For use cases where this sort of sharing is expected, I think > it's perfectly reasonable to provision the sharing children to have > enough to accomodate the possible full size of the shared resource. > >> possible tweak to shore up the current system is to move such mlocked >> pages to the memcg of the surviving locker. When the machine is oom >> it's often nice to examine memcg state to determine which container is >> using the memory. Tracking down who's contributing to a shared >> container is non-trivial. >> >> I actually have a set of patches which add a memcg=M mount option to >> memory backed file systems. I was planning on proposing them, >> regardless of this RFC, and this discussion makes them even more >> appealing. If we go in this direction, then we'd need a similar >> notion for disk based filesystems. As Konstantin suggested, it'd be >> really nice to specify charge policy on a per file, or directory, or >> bind mount basis. This allows shared files to be deterministically > > I'm not too sure about that. We might add that later if absolutely > justifiable but designing assuming that level of intervention from > userland may not be such a good idea. > >> When there's large incidental sharing, then things get sticky. A >> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in >> a small container would pull all pages to the root memcg where they >> are exposed to root pressure which breaks isolation. This is >> concerning. Perhaps the such accesses could be decorated with >> (O_NO_MOVEMEM). > > If such thing is really necessary, FADV_NOREUSE would be a better > indicator; however, yes, such incidental sharing is easier to handle > with per-page scheme as such scanner can be limited in the number of > pages it can carry throughout its operation regardless of which cgroup > it's looking at. It still has the nasty corner case where random > target cgroups can latch onto pages faulted in by the scanner and > keeping accessing them tho, so, even now, FADV_NOREUSE would be a good > idea. Note that such scanning, if repeated on cgroups under high > memory pressure, is *likely* to accumulate residue escaped pages and > if such a management cgroup is transient, those escaped pages will > accumulate over time outside any limit in a way which is unpredictable > and invisible. > >> So this RFC change will introduce significant change to user space >> machine managers and perturb isolation. Is the resulting system >> better? It's not clear, it's the devil know vs devil unknown. Maybe >> it'd be easier if the memcg's I'm talking about were not allowed to >> share page cache (aka copy-on-read) even for files which are jointly >> visible. That would provide today's interface while avoiding the >> problematic sharing. > > Yeah, compatibility would be the stickiest part. > > Thanks. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-04 23:51 ` Greg Thelen @ 2015-02-05 13:15 ` Tejun Heo 2015-02-05 22:05 ` Greg Thelen 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-05 13:15 UTC (permalink / raw) To: Greg Thelen Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, Greg. On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote: > I think the linux-next low (and the TBD min) limits also have the > problem for more than just the root memcg. I'm thinking of a 2M file > shared between C and D below. The file will be charged to common parent > B. > > A > +-B (usage=2M lim=3M min=2M) > +-C (usage=0 lim=2M min=1M shared_usage=2M) > +-D (usage=0 lim=2M min=1M shared_usage=2M) > \-E (usage=0 lim=2M min=0) > > The problem arises if A/B/E allocates more than 1M of private > reclaimable file data. This pushes A/B into reclaim which will reclaim > both the shared file from A/B and private file from A/B/E. In contrast, > the current per-page memcg would've protected the shared file in either > C or D leaving A/B reclaim to only attack A/B/E. > > Pinning the shared file to either C or D, using TBD policy such as mount > option, would solve this for tightly shared files. But for wide fanout > file (libc) the admin would need to assign a global bucket and this > would be a pain to size due to various job requirements. Shouldn't we be able to handle it the same way as I proposed for handling sharing? The above would look like A +-B (usage=2M lim=3M min=2M hosted_usage=2M) +-C (usage=0 lim=2M min=1M shared_usage=2M) +-D (usage=0 lim=2M min=1M shared_usage=2M) \-E (usage=0 lim=2M min=0) Now, we don't wanna use B's min verbatim on the hosted inodes shared by children but we're unconditionally charging the shared amount to all sharing children, which means that we're eating into the min settings of all participating children, so, we should be able to use sum of all sharing children's min-covered amount as the inode's min, which of course is to be contained inside the min of the parent. Above, we're charging 2M to C and D, each of which has 1M min which is being consumed by the shared charge (the shared part won't get reclaimed from the internal pressure of children, so we're really taking that part away from it). Summing them up, the shared inode would have 2M protection which is honored as long as B as a whole is under its 3M limit. This is similar to creating a dedicated child for each shared resource for low limits. The downside is that we end up guarding the shared inodes more than non-shared ones, but, after all, we're charging it to everybody who's using it. Would something like this work? Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-05 13:15 ` Tejun Heo @ 2015-02-05 22:05 ` Greg Thelen 2015-02-05 22:25 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Greg Thelen @ 2015-02-05 22:05 UTC (permalink / raw) To: Tejun Heo Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Thu, Feb 05 2015, Tejun Heo wrote: > Hello, Greg. > > On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote: >> I think the linux-next low (and the TBD min) limits also have the >> problem for more than just the root memcg. I'm thinking of a 2M file >> shared between C and D below. The file will be charged to common parent >> B. >> >> A >> +-B (usage=2M lim=3M min=2M) >> +-C (usage=0 lim=2M min=1M shared_usage=2M) >> +-D (usage=0 lim=2M min=1M shared_usage=2M) >> \-E (usage=0 lim=2M min=0) >> >> The problem arises if A/B/E allocates more than 1M of private >> reclaimable file data. This pushes A/B into reclaim which will reclaim >> both the shared file from A/B and private file from A/B/E. In contrast, >> the current per-page memcg would've protected the shared file in either >> C or D leaving A/B reclaim to only attack A/B/E. >> >> Pinning the shared file to either C or D, using TBD policy such as mount >> option, would solve this for tightly shared files. But for wide fanout >> file (libc) the admin would need to assign a global bucket and this >> would be a pain to size due to various job requirements. > > Shouldn't we be able to handle it the same way as I proposed for > handling sharing? The above would look like > > A > +-B (usage=2M lim=3M min=2M hosted_usage=2M) > +-C (usage=0 lim=2M min=1M shared_usage=2M) > +-D (usage=0 lim=2M min=1M shared_usage=2M) > \-E (usage=0 lim=2M min=0) > > Now, we don't wanna use B's min verbatim on the hosted inodes shared > by children but we're unconditionally charging the shared amount to > all sharing children, which means that we're eating into the min > settings of all participating children, so, we should be able to use > sum of all sharing children's min-covered amount as the inode's min, > which of course is to be contained inside the min of the parent. > > Above, we're charging 2M to C and D, each of which has 1M min which is > being consumed by the shared charge (the shared part won't get > reclaimed from the internal pressure of children, so we're really > taking that part away from it). Summing them up, the shared inode > would have 2M protection which is honored as long as B as a whole is > under its 3M limit. This is similar to creating a dedicated child for > each shared resource for low limits. The downside is that we end up > guarding the shared inodes more than non-shared ones, but, after all, > we're charging it to everybody who's using it. > > Would something like this work? Maybe, but I want to understand more about how pressure works in the child. As C (or D) allocates non shared memory does it perform reclaim to ensure that its (C.usage + C.shared_usage < C.lim). Given C's shared_usage is linked into B.LRU it wouldn't be naturally reclaimable by C. Are you thinking that charge failures on cgroups with non zero shared_usage would, as needed, induce reclaim of parent's hosted_usage? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-05 22:05 ` Greg Thelen @ 2015-02-05 22:25 ` Tejun Heo 2015-02-06 0:03 ` Greg Thelen 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-05 22:25 UTC (permalink / raw) To: Greg Thelen Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hey, On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote: > > A > > +-B (usage=2M lim=3M min=2M hosted_usage=2M) > > +-C (usage=0 lim=2M min=1M shared_usage=2M) > > +-D (usage=0 lim=2M min=1M shared_usage=2M) > > \-E (usage=0 lim=2M min=0) ... > Maybe, but I want to understand more about how pressure works in the > child. As C (or D) allocates non shared memory does it perform reclaim > to ensure that its (C.usage + C.shared_usage < C.lim). Given C's Yes. > shared_usage is linked into B.LRU it wouldn't be naturally reclaimable > by C. Are you thinking that charge failures on cgroups with non zero > shared_usage would, as needed, induce reclaim of parent's hosted_usage? Hmmm.... I'm not really sure but why not? If we properly account for the low protection when pushing inodes to the parent, I don't think it'd break anything. IOW, allow the amount beyond the sum of low limits to be reclaimed when one of the sharers is under pressure. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-05 22:25 ` Tejun Heo @ 2015-02-06 0:03 ` Greg Thelen 2015-02-06 14:17 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Greg Thelen @ 2015-02-06 0:03 UTC (permalink / raw) To: Tejun Heo Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Thu, Feb 05 2015, Tejun Heo wrote: > Hey, > > On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote: >> > A >> > +-B (usage=2M lim=3M min=2M hosted_usage=2M) >> > +-C (usage=0 lim=2M min=1M shared_usage=2M) >> > +-D (usage=0 lim=2M min=1M shared_usage=2M) >> > \-E (usage=0 lim=2M min=0) > ... >> Maybe, but I want to understand more about how pressure works in the >> child. As C (or D) allocates non shared memory does it perform reclaim >> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's > > Yes. > >> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable >> by C. Are you thinking that charge failures on cgroups with non zero >> shared_usage would, as needed, induce reclaim of parent's hosted_usage? > > Hmmm.... I'm not really sure but why not? If we properly account for > the low protection when pushing inodes to the parent, I don't think > it'd break anything. IOW, allow the amount beyond the sum of low > limits to be reclaimed when one of the sharers is under pressure. > > Thanks. I'm not saying that it'd break anything. I think it's required that children perform reclaim on shared data hosted in the parent. The child is limited by shared_usage, so it needs ability to reclaim it. So I think we're in agreement. Child will reclaim parent's hosted_usage when the child is charged for shared_usage. Ideally the only parental memory reclaimed in this situation would be shared. But I think (though I can't claim to have followed the new memcg philosophy discussions) that internal nodes in the cgroup tree (i.e. parents) do not have any resources charged directly to them. All resources are charged to leaf cgroups which linger until resources are uncharged. Thus the LRUs of parent will only contain hosted (shared) memory. This thankfully focus parental reclaim easy on shared pages. Child pressure will, unfortunately, reclaim shared pages used by any container. But if shared pages were charged all sharing containers, then it will help relieve pressure in the caller. So this is a system which charges all cgroups using a shared inode (recharge on read) for all resident pages of that shared inode. There's only one copy of the page in memory on just one LRU, but the page may be charged to multiple container's (shared_)usage. Perhaps I missed it, but what happens when a child's limit is insufficient to accept all pages shared by its siblings? Example starting with 2M cached of a shared file: A +-B (usage=2M lim=3M hosted_usage=2M) +-C (usage=0 lim=2M shared_usage=2M) +-D (usage=0 lim=2M shared_usage=2M) \-E (usage=0 lim=1M shared_usage=0) If E faults in a new 4K page within the shared file, then E is a sharing participant so it'd be charged the 2M+4K, which pushes E over it's limit. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-06 0:03 ` Greg Thelen @ 2015-02-06 14:17 ` Tejun Heo 2015-02-06 23:43 ` Greg Thelen 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-06 14:17 UTC (permalink / raw) To: Greg Thelen Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, Greg. On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote: > So this is a system which charges all cgroups using a shared inode > (recharge on read) for all resident pages of that shared inode. There's > only one copy of the page in memory on just one LRU, but the page may be > charged to multiple container's (shared_)usage. Yeap. > Perhaps I missed it, but what happens when a child's limit is > insufficient to accept all pages shared by its siblings? Example > starting with 2M cached of a shared file: > > A > +-B (usage=2M lim=3M hosted_usage=2M) > +-C (usage=0 lim=2M shared_usage=2M) > +-D (usage=0 lim=2M shared_usage=2M) > \-E (usage=0 lim=1M shared_usage=0) > > If E faults in a new 4K page within the shared file, then E is a sharing > participant so it'd be charged the 2M+4K, which pushes E over it's > limit. OOM? It shouldn't be participating in sharing of an inode if it can't match others' protection on the inode, I think. What we're doing now w/ page based charging is kinda unfair because in the situations like above the one under pressure can end up siphoning off of the larger cgroups' protection if they actually use overlapping areas; however, for disjoint areas, per-page charging would behave correctly. So, this part comes down to the same question - whether multiple cgroups accessing disjoint areas of a single inode is an important enough use case. If we say yes to that, we better make writeback support that too. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-06 14:17 ` Tejun Heo @ 2015-02-06 23:43 ` Greg Thelen 2015-02-07 14:38 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Greg Thelen @ 2015-02-06 23:43 UTC (permalink / raw) To: Tejun Heo Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Fri, Feb 6, 2015 at 6:17 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, Greg. > > On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote: >> So this is a system which charges all cgroups using a shared inode >> (recharge on read) for all resident pages of that shared inode. There's >> only one copy of the page in memory on just one LRU, but the page may be >> charged to multiple container's (shared_)usage. > > Yeap. > >> Perhaps I missed it, but what happens when a child's limit is >> insufficient to accept all pages shared by its siblings? Example >> starting with 2M cached of a shared file: >> >> A >> +-B (usage=2M lim=3M hosted_usage=2M) >> +-C (usage=0 lim=2M shared_usage=2M) >> +-D (usage=0 lim=2M shared_usage=2M) >> \-E (usage=0 lim=1M shared_usage=0) >> >> If E faults in a new 4K page within the shared file, then E is a sharing >> participant so it'd be charged the 2M+4K, which pushes E over it's >> limit. > > OOM? It shouldn't be participating in sharing of an inode if it can't > match others' protection on the inode, I think. What we're doing now > w/ page based charging is kinda unfair because in the situations like > above the one under pressure can end up siphoning off of the larger > cgroups' protection if they actually use overlapping areas; however, > for disjoint areas, per-page charging would behave correctly. > > So, this part comes down to the same question - whether multiple > cgroups accessing disjoint areas of a single inode is an important > enough use case. If we say yes to that, we better make writeback > support that too. If cgroups are about isolation then writing to shared files should be rare, so I'm willing to say that we don't need to handle shared writers well. Shared readers seem like a more valuable use cases (thin provisioning). I'm getting overwhelmed with the thought exercise of automatically moving inodes to common ancestors and back charging the sharers for shared_usage. I haven't wrapped my head around how these shared data pages will get protected. It seems like they'd no longer be protected by child min watermarks. So I know this thread opened with the claim "both memcg and blkcg must be looking at the same picture. Deviating them is highly likely to lead to long-term issues forcing us to look at this again anyway, only with far more baggage." But I'm still wondering if the following is simpler: (1) leave memcg as a per page controller. (2) maintain a per inode i_memcg which is set to the common dirtying ancestor. If not shared then it'll point to the memcg that the page was charged to. (3) when memcg dirtying page pressure is seen, walk up the cgroup tree writing dirty inodes, this will write shared inodes using blkcg priority of the respective levels. (4) background limit wb_check_background_flush() and time based wb_check_old_data_flush() can feel free to attack shared inodes to hopefully restore them to non-shared state. For non-shared inodes, this should behave the same. For shared inodes it should only affect those in the hierarchy which is sharing. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-06 23:43 ` Greg Thelen @ 2015-02-07 14:38 ` Tejun Heo 2015-02-11 2:19 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-07 14:38 UTC (permalink / raw) To: Greg Thelen Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, Greg. On Fri, Feb 06, 2015 at 03:43:11PM -0800, Greg Thelen wrote: > If cgroups are about isolation then writing to shared files should be > rare, so I'm willing to say that we don't need to handle shared > writers well. Shared readers seem like a more valuable use cases > (thin provisioning). I'm getting overwhelmed with the thought > exercise of automatically moving inodes to common ancestors and back > charging the sharers for shared_usage. I haven't wrapped my head > around how these shared data pages will get protected. It seems like > they'd no longer be protected by child min watermarks. Yes, this is challenging and what my current thought is around taking the maximum of the low settings of the sharing children but I need to think more about it. One problem is that the shared inodes will preemptively take away the amount shared from the children's low protection. They won't compete fairly with other inodes or anons but they can't really as they don't really belong to any single sharer. > So I know this thread opened with the claim "both memcg and blkcg must > be looking at the same picture. Deviating them is highly likely to > lead to long-term issues forcing us to look at this again anyway, only > with far more baggage." But I'm still wondering if the following is > simpler: > (1) leave memcg as a per page controller. > (2) maintain a per inode i_memcg which is set to the common dirtying > ancestor. If not shared then it'll point to the memcg that the page > was charged to. > (3) when memcg dirtying page pressure is seen, walk up the cgroup tree > writing dirty inodes, this will write shared inodes using blkcg > priority of the respective levels. > (4) background limit wb_check_background_flush() and time based > wb_check_old_data_flush() can feel free to attack shared inodes to > hopefully restore them to non-shared state. > For non-shared inodes, this should behave the same. For shared inodes > it should only affect those in the hierarchy which is sharing. The thing which breaks when you de-couple what memcg sees from the rest of the stack is that the amount of memory which may be available to a given cgroup and how much of that is dirty is the main linkage propagating IO pressure to actual dirtying tasks. If you decouple the two worldviews, you lose the ability to propagate IO pressure to dirtiers in a controlled manner and that's why anything inside a memcg currently is always triggering direct reclaim path instead of being properly dirty throttled. You can argue that an inode being actively dirtied from multiple cgroups is a rare case which we can sweep under the rug and that *might* be the case but I have a nagging feeling that that would be a decision which is made merely out of immediate convenience and would much prefer having a well defined model of sharing inodes and anons across cgroups so that the behaviors shown in thoses cases aren't mere accidental consequences without any innate meaning. If we can argue that memcg and blkcg having different views is meaningful and characterize and justify the behaviors stemming from the deviation, sure, that'd be fine, but I don't think we have that as of now. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-07 14:38 ` Tejun Heo @ 2015-02-11 2:19 ` Tejun Heo 2015-02-11 7:32 ` Jan Kara 2015-02-11 18:28 ` Greg Thelen 0 siblings, 2 replies; 31+ messages in thread From: Tejun Heo @ 2015-02-11 2:19 UTC (permalink / raw) To: Greg Thelen Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, again. On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote: > If we can argue that memcg and blkcg having different views is > meaningful and characterize and justify the behaviors stemming from > the deviation, sure, that'd be fine, but I don't think we have that as > of now. If we assume that memcg and blkcg having different views is something which represents an acceptable compromise considering the use cases and implementation convenience - IOW, if we assume that read-sharing is something which can happen regularly while write sharing is a corner case and that while not completely correct the existing self-corrective behavior from tracking ownership per-page at the point of instantiation is good enough (as a memcg under pressure is likely to give up shared pages to be re-instantiated by another sharer w/ more budget), we need to do the impedance matching between memcg and blkcg at the writeback layer. The main issue there is that the last chain of IO pressure propagation is realized by making individual dirtying tasks to converge on a common target dirty ratio point which naturally depending on those tasks seeing the same picture in terms of the current write bandwidth and available memory and how much of it is dirty. Tasks dirtying pages belonging to the same memcg while some of them are mostly being written out by a different blkcg would wreck the mechanism. It won't be difficult for one subset to make the other to consider themselves under severe IO pressure when there actually isn't one in that group possibly stalling and starving those tasks unduly. At more basic level, it's just wrong for one group to be writing out significant amount for another. These issues can persist indefinitely if we follow the same instantiator-owns rule for inode writebacks. Even if we reset the ownership when an inode becomes clea, it wouldn't work as it can be dirtied over and over again while under writeback, and when things like this happen, the behavior may become extremely difficult to understand or characterize. We don't have visibility into how individual pages of an inode get distributed across multiple cgroups, who's currently responsible for writing back a specific inode or how dirty ratio mechanism is behaving in the face of the unexpected combination of parameters. Even if we assume that write sharing is a fringe case, we need something better than first-whatever rule when choosing which blkcg is responsible for writing a shared inode out. There needs to be a constant corrective pressure so that incidental and temporary sharings don't end up screwing up the mechanism for an extended period of time. Greg mentioned chossing the closest ancestor of the sharers, which basically pushes inode sharing policy implmentation down to writeback from memcg. This could work but we end up with the same collusion problem as when this is used for memcg and it's even more difficult to solve this at writeback layer - we'd have to communicate the shared state all the way down to block layer and then implement a mechanism there to take corrective measures and even after that we're likely to end up with prolonged state where dirty ratio propagation is essentially broken as the dirtier and writer would be seeing different pictures. So, based on the assumption that write sharings are mostly incidental and temporary (ie. we're basically declaring that we don't support persistent write sharing), how about something like the following? 1. memcg contiues per-page tracking. 2. Each inode is associated with a single blkcg at a given time and written out by that blkcg. 3. While writing back, if the number of pages from foreign memcg's is higher than certain ratio of total written pages, the inode is marked as disowned and the writeback instance is optionally terminated early. e.g. if the ratio of foreign pages is over 50% after writing out the number of pages matching 5s worth of write bandwidth for the bdi, mark the inode as disowned. 4. On the following dirtying of the inode, the inode is associated with the matching blkcg of the dirtied page. Note that this could be the next cycle as the inode could already have been marked dirty by the time the above condition triggered. In that case, the following writeback would be terminated early too. This should provide sufficient corrective pressure so that incidental and temporary sharing of an inode doesn't become a persistent issue while keeping the complexity necessary for implementing such pressure fairly minimal and self-contained. Also, the changes necessary for individual filesystems would be minimal. I think this should work well enough as long as the forementioned assumptions are true - IOW, if we maintain that write sharing is unsupported. What do you think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 2:19 ` Tejun Heo @ 2015-02-11 7:32 ` Jan Kara 2015-02-11 18:28 ` Greg Thelen 1 sibling, 0 replies; 31+ messages in thread From: Jan Kara @ 2015-02-11 7:32 UTC (permalink / raw) To: Tejun Heo Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello Tejun, On Tue 10-02-15 21:19:06, Tejun Heo wrote: > On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote: > > If we can argue that memcg and blkcg having different views is > > meaningful and characterize and justify the behaviors stemming from > > the deviation, sure, that'd be fine, but I don't think we have that as > > of now. ... > So, based on the assumption that write sharings are mostly incidental > and temporary (ie. we're basically declaring that we don't support > persistent write sharing), how about something like the following? > > 1. memcg contiues per-page tracking. > > 2. Each inode is associated with a single blkcg at a given time and > written out by that blkcg. > > 3. While writing back, if the number of pages from foreign memcg's is > higher than certain ratio of total written pages, the inode is > marked as disowned and the writeback instance is optionally > terminated early. e.g. if the ratio of foreign pages is over 50% > after writing out the number of pages matching 5s worth of write > bandwidth for the bdi, mark the inode as disowned. > > 4. On the following dirtying of the inode, the inode is associated > with the matching blkcg of the dirtied page. Note that this could > be the next cycle as the inode could already have been marked dirty > by the time the above condition triggered. In that case, the > following writeback would be terminated early too. > > This should provide sufficient corrective pressure so that incidental > and temporary sharing of an inode doesn't become a persistent issue > while keeping the complexity necessary for implementing such pressure > fairly minimal and self-contained. Also, the changes necessary for > individual filesystems would be minimal. I like this proposal. It looks simple enough and when inodes aren't pernamently write-shared it converges to the blkcg that is currently writing to the inode. So ack from me. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 2:19 ` Tejun Heo 2015-02-11 7:32 ` Jan Kara @ 2015-02-11 18:28 ` Greg Thelen 2015-02-11 20:33 ` Tejun Heo 1 sibling, 1 reply; 31+ messages in thread From: Greg Thelen @ 2015-02-11 18:28 UTC (permalink / raw) To: Tejun Heo Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, again. > > On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote: >> If we can argue that memcg and blkcg having different views is >> meaningful and characterize and justify the behaviors stemming from >> the deviation, sure, that'd be fine, but I don't think we have that as >> of now. > > If we assume that memcg and blkcg having different views is something > which represents an acceptable compromise considering the use cases > and implementation convenience - IOW, if we assume that read-sharing > is something which can happen regularly while write sharing is a > corner case and that while not completely correct the existing > self-corrective behavior from tracking ownership per-page at the point > of instantiation is good enough (as a memcg under pressure is likely > to give up shared pages to be re-instantiated by another sharer w/ > more budget), we need to do the impedance matching between memcg and > blkcg at the writeback layer. > > The main issue there is that the last chain of IO pressure propagation > is realized by making individual dirtying tasks to converge on a > common target dirty ratio point which naturally depending on those > tasks seeing the same picture in terms of the current write bandwidth > and available memory and how much of it is dirty. Tasks dirtying > pages belonging to the same memcg while some of them are mostly being > written out by a different blkcg would wreck the mechanism. It won't > be difficult for one subset to make the other to consider themselves > under severe IO pressure when there actually isn't one in that group > possibly stalling and starving those tasks unduly. At more basic > level, it's just wrong for one group to be writing out significant > amount for another. > > These issues can persist indefinitely if we follow the same > instantiator-owns rule for inode writebacks. Even if we reset the > ownership when an inode becomes clea, it wouldn't work as it can be > dirtied over and over again while under writeback, and when things > like this happen, the behavior may become extremely difficult to > understand or characterize. We don't have visibility into how > individual pages of an inode get distributed across multiple cgroups, > who's currently responsible for writing back a specific inode or how > dirty ratio mechanism is behaving in the face of the unexpected > combination of parameters. > > Even if we assume that write sharing is a fringe case, we need > something better than first-whatever rule when choosing which blkcg is > responsible for writing a shared inode out. There needs to be a > constant corrective pressure so that incidental and temporary sharings > don't end up screwing up the mechanism for an extended period of time. > > Greg mentioned chossing the closest ancestor of the sharers, which > basically pushes inode sharing policy implmentation down to writeback > from memcg. This could work but we end up with the same collusion > problem as when this is used for memcg and it's even more difficult to > solve this at writeback layer - we'd have to communicate the shared > state all the way down to block layer and then implement a mechanism > there to take corrective measures and even after that we're likely to > end up with prolonged state where dirty ratio propagation is > essentially broken as the dirtier and writer would be seeing different > pictures. > > So, based on the assumption that write sharings are mostly incidental > and temporary (ie. we're basically declaring that we don't support > persistent write sharing), how about something like the following? > > 1. memcg contiues per-page tracking. > > 2. Each inode is associated with a single blkcg at a given time and > written out by that blkcg. > > 3. While writing back, if the number of pages from foreign memcg's is > higher than certain ratio of total written pages, the inode is > marked as disowned and the writeback instance is optionally > terminated early. e.g. if the ratio of foreign pages is over 50% > after writing out the number of pages matching 5s worth of write > bandwidth for the bdi, mark the inode as disowned. > > 4. On the following dirtying of the inode, the inode is associated > with the matching blkcg of the dirtied page. Note that this could > be the next cycle as the inode could already have been marked dirty > by the time the above condition triggered. In that case, the > following writeback would be terminated early too. > > This should provide sufficient corrective pressure so that incidental > and temporary sharing of an inode doesn't become a persistent issue > while keeping the complexity necessary for implementing such pressure > fairly minimal and self-contained. Also, the changes necessary for > individual filesystems would be minimal. > > I think this should work well enough as long as the forementioned > assumptions are true - IOW, if we maintain that write sharing is > unsupported. > > What do you think? > > Thanks. > > -- > tejun This seems good. I assume that blkcg writeback would query corresponding memcg for dirty page count to determine if over background limit. And balance_dirty_pages() would query memcg's dirty page count to throttle based on blkcg's bandwidth. Note: memcg doesn't yet have dirty page counts, but several of us have made attempts at adding the counters. And it shouldn't be hard to get them merged. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 18:28 ` Greg Thelen @ 2015-02-11 20:33 ` Tejun Heo 2015-02-11 21:22 ` Konstantin Khlebnikov 2015-02-12 2:10 ` Greg Thelen 0 siblings, 2 replies; 31+ messages in thread From: Tejun Heo @ 2015-02-11 20:33 UTC (permalink / raw) To: Greg Thelen Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, Greg. On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote: > This seems good. I assume that blkcg writeback would query > corresponding memcg for dirty page count to determine if over > background limit. And balance_dirty_pages() would query memcg's dirty Yeah, available memory to the matching memcg and the number of dirty pages in it. It's gonna work the same way as the global case just scoped to the cgroup. > page count to throttle based on blkcg's bandwidth. Note: memcg > doesn't yet have dirty page counts, but several of us have made > attempts at adding the counters. And it shouldn't be hard to get them > merged. Can you please post those? So, cool, we're in agreement. Working on it. It shouldn't take too long, hopefully. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 20:33 ` Tejun Heo @ 2015-02-11 21:22 ` Konstantin Khlebnikov 2015-02-11 21:46 ` Tejun Heo 2015-02-12 2:10 ` Greg Thelen 1 sibling, 1 reply; 31+ messages in thread From: Konstantin Khlebnikov @ 2015-02-11 21:22 UTC (permalink / raw) To: Tejun Heo Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Wed, Feb 11, 2015 at 11:33 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Greg. > > On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote: >> This seems good. I assume that blkcg writeback would query >> corresponding memcg for dirty page count to determine if over >> background limit. And balance_dirty_pages() would query memcg's dirty > > Yeah, available memory to the matching memcg and the number of dirty > pages in it. It's gonna work the same way as the global case just > scoped to the cgroup. That might be a problem: all dirty pages accounted to cgroup must be reachable for its own personal writeback or balanace-drity-pages will be unable to satisfy memcg dirty memory thresholds. I've done accounting for per-inode owner, but there is another option: shared inodes might be handled differently and will be available for all (or related) cgroup writebacks. Another side is that reclaimer now (mosly?) never trigger pageout. Memcg reclaimer should do something if it finds shared dirty page: either move it into right cgroup or make that inode reachable for memcg writeback. I've send patch which marks shared dirty inodes with flag I_DIRTY_SHARED or so. > >> page count to throttle based on blkcg's bandwidth. Note: memcg >> doesn't yet have dirty page counts, but several of us have made >> attempts at adding the counters. And it shouldn't be hard to get them >> merged. > > Can you please post those? > > So, cool, we're in agreement. Working on it. It shouldn't take too > long, hopefully. Good. As I see this design is almost equal to my proposal, maybe except that dumb first-owns-all-until-the-end rule. > > Thanks. > > -- > tejun > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 21:22 ` Konstantin Khlebnikov @ 2015-02-11 21:46 ` Tejun Heo 2015-02-11 21:57 ` Konstantin Khlebnikov 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-11 21:46 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote: > > Yeah, available memory to the matching memcg and the number of dirty > > pages in it. It's gonna work the same way as the global case just > > scoped to the cgroup. > > That might be a problem: all dirty pages accounted to cgroup must be > reachable for its own personal writeback or balanace-drity-pages will be > unable to satisfy memcg dirty memory thresholds. I've done accounting Yeah, it would. Why wouldn't it? > for per-inode owner, but there is another option: shared inodes might be > handled differently and will be available for all (or related) cgroup > writebacks. I'm not following you at all. The only reason this scheme can work is because we exclude persistent shared write cases. As the whole thing is based on that assumption, special casing shared inodes doesn't make any sense. Doing things like allowing all cgroups to write shared inodes without getting memcg on-board almost immediately breaks pressure propagation while making shared writes a lot more attractive and increasing implementation complexity substantially. Am I missing something? > Another side is that reclaimer now (mosly?) never trigger pageout. > Memcg reclaimer should do something if it finds shared dirty page: > either move it into right cgroup or make that inode reachable for > memcg writeback. I've send patch which marks shared dirty inodes > with flag I_DIRTY_SHARED or so. It *might* make sense for memcg to drop pages being dirtied which don't match the currently associated blkcg of the inode; however, again, as we're basically declaring that shared writes aren't supported, I'm skeptical about the usefulness. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 21:46 ` Tejun Heo @ 2015-02-11 21:57 ` Konstantin Khlebnikov 2015-02-11 22:05 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Konstantin Khlebnikov @ 2015-02-11 21:57 UTC (permalink / raw) To: Tejun Heo Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote: >> > Yeah, available memory to the matching memcg and the number of dirty >> > pages in it. It's gonna work the same way as the global case just >> > scoped to the cgroup. >> >> That might be a problem: all dirty pages accounted to cgroup must be >> reachable for its own personal writeback or balanace-drity-pages will be >> unable to satisfy memcg dirty memory thresholds. I've done accounting > > Yeah, it would. Why wouldn't it? How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages? Or you're thinking only about separating writeback flow into blkio cgroups without actual inode filtering? I mean delaying inode writeback and keeping dirty pages as long as possible if their cgroups are far from threshold. > >> for per-inode owner, but there is another option: shared inodes might be >> handled differently and will be available for all (or related) cgroup >> writebacks. > > I'm not following you at all. The only reason this scheme can work is > because we exclude persistent shared write cases. As the whole thing > is based on that assumption, special casing shared inodes doesn't make > any sense. Doing things like allowing all cgroups to write shared > inodes without getting memcg on-board almost immediately breaks > pressure propagation while making shared writes a lot more attractive > and increasing implementation complexity substantially. Am I missing > something? > >> Another side is that reclaimer now (mosly?) never trigger pageout. >> Memcg reclaimer should do something if it finds shared dirty page: >> either move it into right cgroup or make that inode reachable for >> memcg writeback. I've send patch which marks shared dirty inodes >> with flag I_DIRTY_SHARED or so. > > It *might* make sense for memcg to drop pages being dirtied which > don't match the currently associated blkcg of the inode; however, > again, as we're basically declaring that shared writes aren't > supported, I'm skeptical about the usefulness. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 21:57 ` Konstantin Khlebnikov @ 2015-02-11 22:05 ` Tejun Heo 2015-02-11 22:15 ` Konstantin Khlebnikov 0 siblings, 1 reply; 31+ messages in thread From: Tejun Heo @ 2015-02-11 22:05 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote: > On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote: > > Hello, > > > > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote: > >> > Yeah, available memory to the matching memcg and the number of dirty > >> > pages in it. It's gonna work the same way as the global case just > >> > scoped to the cgroup. > >> > >> That might be a problem: all dirty pages accounted to cgroup must be > >> reachable for its own personal writeback or balanace-drity-pages will be > >> unable to satisfy memcg dirty memory thresholds. I've done accounting > > > > Yeah, it would. Why wouldn't it? > > How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages? > Or you're thinking only about separating writeback flow into blkio cgroups > without actual inode filtering? I mean delaying inode writeback and keeping > dirty pages as long as possible if their cgroups are far from threshold. What? The code was already in the previous patchset. I'm just gonna rip out the code to handle inode being dirtied on multiple wb's. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 22:05 ` Tejun Heo @ 2015-02-11 22:15 ` Konstantin Khlebnikov 2015-02-11 22:30 ` Tejun Heo 0 siblings, 1 reply; 31+ messages in thread From: Konstantin Khlebnikov @ 2015-02-11 22:15 UTC (permalink / raw) To: Tejun Heo Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Thu, Feb 12, 2015 at 1:05 AM, Tejun Heo <tj@kernel.org> wrote: > On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote: >> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote: >> > Hello, >> > >> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote: >> >> > Yeah, available memory to the matching memcg and the number of dirty >> >> > pages in it. It's gonna work the same way as the global case just >> >> > scoped to the cgroup. >> >> >> >> That might be a problem: all dirty pages accounted to cgroup must be >> >> reachable for its own personal writeback or balanace-drity-pages will be >> >> unable to satisfy memcg dirty memory thresholds. I've done accounting >> > >> > Yeah, it would. Why wouldn't it? >> >> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages? >> Or you're thinking only about separating writeback flow into blkio cgroups >> without actual inode filtering? I mean delaying inode writeback and keeping >> dirty pages as long as possible if their cgroups are far from threshold. > > What? The code was already in the previous patchset. I'm just gonna > rip out the code to handle inode being dirtied on multiple wb's. Well, ok. Even if shared writes are rare whey should be handled somehow without relying on kupdate-like writeback. If memcg has a lot of dirty pages but their inodes are accidentially belong to wrong wb queues when tasks in that memcg shouldn't stuck in balance-dirty-pages until somebody outside acidentially writes this data. That's all what I wanted to say. > > -- > tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 22:15 ` Konstantin Khlebnikov @ 2015-02-11 22:30 ` Tejun Heo 0 siblings, 0 replies; 31+ messages in thread From: Tejun Heo @ 2015-02-11 22:30 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins Hello, On Thu, Feb 12, 2015 at 02:15:29AM +0400, Konstantin Khlebnikov wrote: > Well, ok. Even if shared writes are rare whey should be handled somehow > without relying on kupdate-like writeback. If memcg has a lot of dirty pages This only works iff we consider those cases to be marginal enough to be handle them in a pretty ghetto way. > but their inodes are accidentially belong to wrong wb queues when tasks in > that memcg shouldn't stuck in balance-dirty-pages until somebody outside > acidentially writes this data. That's all what I wanted to say. But, right, yeah, corner cases around this could be nasty if writeout interval is set really high. I don't think it matters for the default 5s interval at all. Maybe what we need is queueing a delayed per-wb work w/ the default writeout interval when dirtying a foreign inode. I'll think more about it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Making memcg track ownership per address_space or anon_vma 2015-02-11 20:33 ` Tejun Heo 2015-02-11 21:22 ` Konstantin Khlebnikov @ 2015-02-12 2:10 ` Greg Thelen 1 sibling, 0 replies; 31+ messages in thread From: Greg Thelen @ 2015-02-12 2:10 UTC (permalink / raw) To: Tejun Heo Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan, Hugh Dickins On Wed, Feb 11, 2015 at 12:33 PM, Tejun Heo <tj@kernel.org> wrote: [...] >> page count to throttle based on blkcg's bandwidth. Note: memcg >> doesn't yet have dirty page counts, but several of us have made >> attempts at adding the counters. And it shouldn't be hard to get them >> merged. > > Can you please post those? Will do. Rebasing and testing needed, so it won't be today. ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2015-02-12 2:10 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-01-30 4:43 [RFC] Making memcg track ownership per address_space or anon_vma Tejun Heo 2015-01-30 5:55 ` Greg Thelen 2015-01-30 6:27 ` Tejun Heo 2015-01-30 16:07 ` Tejun Heo 2015-02-02 19:26 ` Konstantin Khlebnikov 2015-02-02 19:46 ` Tejun Heo 2015-02-03 23:30 ` Greg Thelen 2015-02-04 10:49 ` Konstantin Khlebnikov 2015-02-04 17:15 ` Tejun Heo 2015-02-04 17:58 ` Konstantin Khlebnikov 2015-02-04 18:28 ` Tejun Heo 2015-02-04 17:06 ` Tejun Heo 2015-02-04 23:51 ` Greg Thelen 2015-02-05 13:15 ` Tejun Heo 2015-02-05 22:05 ` Greg Thelen 2015-02-05 22:25 ` Tejun Heo 2015-02-06 0:03 ` Greg Thelen 2015-02-06 14:17 ` Tejun Heo 2015-02-06 23:43 ` Greg Thelen 2015-02-07 14:38 ` Tejun Heo 2015-02-11 2:19 ` Tejun Heo 2015-02-11 7:32 ` Jan Kara 2015-02-11 18:28 ` Greg Thelen 2015-02-11 20:33 ` Tejun Heo 2015-02-11 21:22 ` Konstantin Khlebnikov 2015-02-11 21:46 ` Tejun Heo 2015-02-11 21:57 ` Konstantin Khlebnikov 2015-02-11 22:05 ` Tejun Heo 2015-02-11 22:15 ` Konstantin Khlebnikov 2015-02-11 22:30 ` Tejun Heo 2015-02-12 2:10 ` Greg Thelen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).