* Improvement on memory subsystem @ 2006-07-18 10:03 yunfeng zhang 2006-07-18 12:18 ` Valdis.Kletnieks ` (3 more replies) 0 siblings, 4 replies; 14+ messages in thread From: yunfeng zhang @ 2006-07-18 10:03 UTC (permalink / raw) To: linux-kernel Dear Linux core memory developers: It's my pleasure to show some ideas in my OS named Zero to you, the features listed below, I think, can be introduced into Linux 2.6.16.1 and later. The following section is divided into two parts by the implmentation difficulty. Minor improvement. 1. Apply dlmalloc arithmetic (http://g.oswego.edu/dl/html/malloc.html) on memory page allocation instead of buddy arithmetic. As the result, we can get more consecutive memory pages easily. 2. Read-ahead process during page-in/out (page fault or swap out) should be based on its VMA to enhance IO efficiency instead of the relative physical pages in swap space. 3. All slabs are all off-slab type. Store slab instance in page structure. 4. Introduce PrivateVMA class and discard anonymous VMA to simplify the relationship between VMA and its pages. When a VMA is split/combined, update the member mapping of all relating pages. In fact, those methods should be rare to use. 5. Add a lock bit in pte. Note, the feature want CPU preserves a programer available bit in pte. So we can avoid to allocate page before locking the pte during do_anonymous_page, in other words, relief memory page allocation pressure. 6. Swap out pages by scaning all vmas linking to Zone instead of scaning pages. Major improvement. 1. No COW on anonymous vma. 2. Dynamic page mapping on core space. It's the further discussion about former item 1 in minor improvment section, other features on it is applying DLPTE arithmetic on core PTE array, introducing RemapDaemon. You can download http://www.cublog.cn/u/21764/upfile/060718173355.zip to find more about those features and it's convenient to illustrate my idea for there is a lots of diagrams in it. Summary of the file is MemoryArchitecture.pdf is the documentation of my Zero OS memory subsystem. Code in Implementation/memory/ shows some sample implementation about my memory subsystem. Note, not like other OS, I do like to write documentation and my OS is far from completion, even its memory subsystem. My blog (Chinese site): http://zeroos.cublog.cn/ Regarding Yunfeng Zhang ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang @ 2006-07-18 12:18 ` Valdis.Kletnieks 2006-07-19 3:44 ` yunfeng zhang 2006-07-19 9:18 ` Ian Stirling 2006-07-18 16:25 ` Pekka Enberg ` (2 subsequent siblings) 3 siblings, 2 replies; 14+ messages in thread From: Valdis.Kletnieks @ 2006-07-18 12:18 UTC (permalink / raw) To: yunfeng zhang; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1654 bytes --] On Tue, 18 Jul 2006 18:03:54 +0800, yunfeng zhang said: > 2. Read-ahead process during page-in/out (page fault or swap out) should be > based on its VMA to enhance IO efficiency instead of the relative physical pages > in swap space. But wouldn't that end up causing a seek storm, rather than handling the pages in the order that minimizes the total seek distance, no matter where they are in memory? Remember - if you have a 2Ghz processor, and a disk that seeks in 1 millisecond, every seek is (*very* roughly) about 2 million instructions. So if we can burn 20 thousand instructions finding a read order that eliminates *one* seek, we're 1.98M instructions ahead. Now, if you have an improved read-ahead that spews out page requests that are both elevator-friendly and temporal-friendly, *then* you might be onto something. For instance, if you can identify 80 pages that will likely be needed in the next 50 milliseconds, of which 50 pages will be likely needed in the next 30ms, you want to issue those 50 first, in an elevator-friendly manner (uncaffienated handwave here) - and then issue the other 30 page requests in a second burst the next time the elevator goes by. Note this requires the read-ahead to get a *lot* more chummy with the elevator than it seems to currently. In particular, readahead would need to need to be able to hold off submitting the "later" 30 pages until it could be sure that the elevator wouldn't merge them into the queue in a way that would slow down the first 50 requests. If it does that already, somebody just smack me. And if there's a good reason not to do that, hand me some caffeine and a clue. :) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-18 12:18 ` Valdis.Kletnieks @ 2006-07-19 3:44 ` yunfeng zhang 2006-07-19 9:18 ` Ian Stirling 1 sibling, 0 replies; 14+ messages in thread From: yunfeng zhang @ 2006-07-19 3:44 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: linux-kernel > On Tue, 18 Jul 2006 18:03:54 +0800, yunfeng zhang said: > > But wouldn't that end up causing a seek storm, rather than handling the pages > in the order that minimizes the total seek distance, no matter where they are > in memory? Remember - if you have a 2Ghz processor, and a disk that seeks in 1 > millisecond, every seek is (*very* roughly) about 2 million instructions. So > if we can burn 20 thousand instructions finding a read order that eliminates > *one* seek, we're 1.98M instructions ahead. Further sample is showd below to page-fault (page-in operation) scan the pte triggering page-fault and its following ptes in its VMA, if its followers are swap_entry_t type and their relative offset is enough closer to the host pte, read them together. to swap daemon (page-out operation), let's scan every pte of a VMA in the OS, if we find an appropriate candidate, lock it and its following ptes if all ptes are appropriate swap-out objects, then allocate a consecutive swap pages from swap space, if we're success, do an efficient asynchronous IO operation; if we're failed, shrink those ptes. Isn't it right? By the way, all improvements listed by me are introduced briefly, most of them are complex, maybe only my documentation can descript them clearly. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-18 12:18 ` Valdis.Kletnieks 2006-07-19 3:44 ` yunfeng zhang @ 2006-07-19 9:18 ` Ian Stirling 2006-07-19 14:56 ` Valdis.Kletnieks 1 sibling, 1 reply; 14+ messages in thread From: Ian Stirling @ 2006-07-19 9:18 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: yunfeng zhang, linux-kernel Valdis.Kletnieks@vt.edu wrote: > On Tue, 18 Jul 2006 18:03:54 +0800, yunfeng zhang said: > > >>2. Read-ahead process during page-in/out (page fault or swap out) should be >>based on its VMA to enhance IO efficiency instead of the relative physical pages >>in swap space. > > > But wouldn't that end up causing a seek storm, rather than handling the pages > in the order that minimizes the total seek distance, no matter where they are > in memory? Remember - if you have a 2Ghz processor, and a disk that seeks in 1 > millisecond, every seek is (*very* roughly) about 2 million instructions. So > if we can burn 20 thousand instructions finding a read order that eliminates > *one* seek, we're 1.98M instructions ahead. To paraphrase shakespear - all the world is not a P4 - and all the swap devices are not hard disks. For example - I've got a 486/33 laptop with 12M RAM that I sometimes use , with swapping to a 128M PCMCIA RAM card that I got from somewhere. 20K instructions wasted on a device with no seek time is just annoying. And on my main laptop - I have experimented with swap-over-wifi to a large ramdisk on my server - which works quite well. (until the wifi connection falls over). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-19 9:18 ` Ian Stirling @ 2006-07-19 14:56 ` Valdis.Kletnieks 0 siblings, 0 replies; 14+ messages in thread From: Valdis.Kletnieks @ 2006-07-19 14:56 UTC (permalink / raw) To: Ian Stirling; +Cc: yunfeng zhang, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1271 bytes --] On Wed, 19 Jul 2006 10:18:44 BST, Ian Stirling said: > To paraphrase shakespear - all the world is not a P4 - and all the swap > devices are not hard disks. Been there, done that. I used to admin a net of Sun 3/50s where /dev/swap was a symlink to a file on an NFS server, because the "shoebox" local hard drives for those were so slow that throwing it across the ethernet to a 3/280 with Fujitsu Super-Eagles was faster... > For example - I've got a 486/33 laptop with 12M RAM that I sometimes use > , with swapping to a 128M PCMCIA RAM card that I got from somewhere. If we go to the effort of writing code that tries to be smart about grouping swap reads/writes by cost, it's easy enough to flag any sort of ram-disk device as a 'zero seek time' device. Remember that I suggested making it dependent on "how long until the next pass of the elevator" - for a ramdisk that basically is zero, so the algorithm easily degenerates into "just queue the requests in expected order you'll need the results". > 20K instructions wasted on a device with no seek time is just annoying. On the other hand, how long does it take to move a 4K page across the PCMCIA interface? If you're seeing deep queues on it, you may *still* want to optimize the order of requests... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang 2006-07-18 12:18 ` Valdis.Kletnieks @ 2006-07-18 16:25 ` Pekka Enberg 2006-07-19 3:21 ` yunfeng zhang 2006-07-24 9:01 ` yunfeng zhang 2006-08-23 10:39 ` yunfeng zhang 3 siblings, 1 reply; 14+ messages in thread From: Pekka Enberg @ 2006-07-18 16:25 UTC (permalink / raw) To: yunfeng zhang; +Cc: linux-kernel On 7/18/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote: > 3. All slabs are all off-slab type. Store slab instance in page structure. Not sure what you mean. We need much more than sizeof(struct page) for slab management. Hmm? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-18 16:25 ` Pekka Enberg @ 2006-07-19 3:21 ` yunfeng zhang 2006-07-19 8:30 ` Pekka Enberg 0 siblings, 1 reply; 14+ messages in thread From: yunfeng zhang @ 2006-07-19 3:21 UTC (permalink / raw) To: Pekka Enberg; +Cc: linux-kernel 2006/7/19, Pekka Enberg <penberg@cs.helsinki.fi>: > On 7/18/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote: > > 3. All slabs are all off-slab type. Store slab instance in page structure. > > Not sure what you mean. We need much more than sizeof(struct page) for > slab management. Hmm? > Current page struct is just like this struct page { unsigned long flags; atomic_t _count; atomic_t _mapcount; union { struct { unsigned long private; struct address_space *mapping; }; #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS spinlock_t ptl; #endif }; pgoff_t index; struct list_head lru; #if defined(WANT_PAGE_VIRTUAL) void *virtual; #endif /* WANT_PAGE_VIRTUAL */ }; Most fields in the page structure is used for user page, to a core slab page, these aren't touched at all. So I think we should define a union struct page { unsigned long flags; struct slab { struct list_head list; unsigned long colouroff; void *s_mem; unsigned int inuse; kmem_bufctl_t free; unsigned short nodeid; }; }; ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-19 3:21 ` yunfeng zhang @ 2006-07-19 8:30 ` Pekka Enberg 2006-07-19 10:13 ` yunfeng zhang 0 siblings, 1 reply; 14+ messages in thread From: Pekka Enberg @ 2006-07-19 8:30 UTC (permalink / raw) To: yunfeng zhang; +Cc: linux-kernel On 7/18/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote: > > > 3. All slabs are all off-slab type. Store slab instance in page structure. 2006/7/19, Pekka Enberg <penberg@cs.helsinki.fi>: > > Not sure what you mean. We need much more than sizeof(struct page) for > > slab management. Hmm? On 7/19/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote: > Current page struct is just like this [snip] Which, like I said, is not enough to hold slab management structures (we need an array of bufctl_t in addition to struct slab). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-19 8:30 ` Pekka Enberg @ 2006-07-19 10:13 ` yunfeng zhang 2006-07-19 10:35 ` Pekka Enberg 0 siblings, 1 reply; 14+ messages in thread From: yunfeng zhang @ 2006-07-19 10:13 UTC (permalink / raw) To: Pekka Enberg; +Cc: linux-kernel > Which, like I said, is not enough to hold slab management structures > (we need an array of bufctl_t in addition to struct slab). > Here, the off-slab is the same as the off-slab concept of Linux, doesn't Linux stores bufctl_t array in its off-slab object? I consider that we should try our best to explore the potential of page structure. In my OS, page structure is just like a union and is cast into different types according to its flag field. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-19 10:13 ` yunfeng zhang @ 2006-07-19 10:35 ` Pekka Enberg 0 siblings, 0 replies; 14+ messages in thread From: Pekka Enberg @ 2006-07-19 10:35 UTC (permalink / raw) To: yunfeng zhang; +Cc: linux-kernel On 7/19/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote: > Here, the off-slab is the same as the off-slab concept of Linux, > doesn't Linux stores bufctl_t array in its off-slab object? > > I consider that we should try our best to explore the potential of > page structure. In my OS, page structure is just like a union and is > cast into different types according to its flag field. The slab allocator currently allocates space for struct slab and the bufctls next to each other regardless of whether we are allocating on-slab or off-slab. Your approach of splitting them results in more data cache pressure for the cache_alloc() path, I think. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang 2006-07-18 12:18 ` Valdis.Kletnieks 2006-07-18 16:25 ` Pekka Enberg @ 2006-07-24 9:01 ` yunfeng zhang 2006-07-24 14:55 ` David Lang 2006-08-23 10:39 ` yunfeng zhang 3 siblings, 1 reply; 14+ messages in thread From: yunfeng zhang @ 2006-07-24 9:01 UTC (permalink / raw) To: linux-kernel How to let memory subsystem allocate bigger consecutive memory pages? In current Linux, to driver programmer it's always failed to issue a request to alloc_pages with a enough larger order parameter, or it's diffult to allocate a bigger consecutive physical memory block. The reason causing the problem, I think, are the items listed below 1) Core space has a static mapping relationship with physical memory pages. So once a core page is allocated, its core address is also fixed, it prevents the physical pages around it to conglomerate together. 2) Current physical page management arithmetic is buddy arithmetic. The main advantage of its is that pages managed by it is always aligned by 2 power. But, is it necessary or there is any hardware which need physical memory pages aligned by 2 or more power? The solution is 1) Using dynamic page mapping on core space. So we can move all core pages freely anytime to conglomerate bigger consecutive memory pages, a new background daemon thread -- RemapDaemon can do conglomeration periodly. 2) Using another page management arithmetic instead of buddy, the minimum unit of new arithmetic should be page. In fact, I think dlmalloc arithmetic is a good candidate, it's also page conglomeration-affinity. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-24 9:01 ` yunfeng zhang @ 2006-07-24 14:55 ` David Lang 2006-07-25 8:33 ` yunfeng zhang 0 siblings, 1 reply; 14+ messages in thread From: David Lang @ 2006-07-24 14:55 UTC (permalink / raw) To: yunfeng zhang; +Cc: linux-kernel On Mon, 24 Jul 2006, yunfeng zhang wrote: > How to let memory subsystem allocate bigger consecutive memory pages? In > current > Linux, to driver programmer it's always failed to issue a request to > alloc_pages > with a enough larger order parameter, or it's diffult to allocate a bigger > consecutive physical memory block. > > The reason causing the problem, I think, are the items listed below > 1) Core space has a static mapping relationship with physical memory pages. > So > once a core page is allocated, its core address is also fixed, it prevents > the > physical pages around it to conglomerate together. > 2) Current physical page management arithmetic is buddy arithmetic. The main > advantage of its is that pages managed by it is always aligned by 2 power. > But, > is it necessary or there is any hardware which need physical memory pages > aligned by 2 or more power? > > The solution is > 1) Using dynamic page mapping on core space. So we can move all core pages > freely anytime to conglomerate bigger consecutive memory pages, a new > background > daemon thread -- RemapDaemon can do conglomeration periodly. this gets discussed periodicly, however the performance hit of doing the mapping for all core memory accesses is something that the developers have not been willing to accept. > 2) Using another page management arithmetic instead of buddy, the minimum > unit > of new arithmetic should be page. In fact, I think dlmalloc arithmetic is a > good > candidate, it's also page conglomeration-affinity. experiment, if it's as good as you think it will be post the numbers an you will get a lot of attention. quite a few people are working on the memory allocation for various things, therea re a lot of useage patterns to balance (and avaid pathalogical problems with). David Lang ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-24 14:55 ` David Lang @ 2006-07-25 8:33 ` yunfeng zhang 0 siblings, 0 replies; 14+ messages in thread From: yunfeng zhang @ 2006-07-25 8:33 UTC (permalink / raw) To: David Lang; +Cc: linux-kernel, valdis.kletnieks, penberg No COW in Private VMA In current Linux 2.6.16, Linux applies copy-on-write technical when application issues CLONE_MM parameter to do_fork, as the result, it makes private VMA of a process share its private pages with other process. It increases memory subsystem complexity. In fact, OS should be application-oriented, not standard-oriented. In most cases, supporting POSIX thread model, vfork and evecve is enough to application. In other words, we should focus on optimizing our system for the frequent cases, that is, do copy-on-call when someone really calls fork with CLONE_MM. No COW in private VMA makes a simple one-to-one relationship among its pte, its private page and its swap_entry of a private VMA, we will benefit from the model A new PTE type is introduced here before we go struct UnmappedPTE { present : 1; // = 0. ...; pageNum : 20; }; 1) To swap daemon, we can give a fairer opportunity to every private page. As I've suggested swap daemon should swap out pages based on VMA instead of memory page array. So we do the steps listed below to every pte of a private VMA a) Convert the pte to UnmappedPTE type, that doesn't free the private page at all, UnmappedPTE::pageNum holds a trace of its private page. b) Allocate a swap entry for the private page of the pte and page-out it, remember do the job on current pte and its following ptes together, I've explained the virtue. c) OK, it's still untouched, reclaim the private page, convert the pte to SwappedPTE type. The flow is better than the implementation of current Linux, I think. 2) We can economize a litter memory. Current swap space includes two parts, one is swap_info_struct, its responsibility is tracing physical swap area by short swap_info_struct::swap_map array. Now, it should be a bit array; Another is an address_space structure, which is used by process to test whether its swap pages have been read in memory. Now, it should be discarded, every swap page once is read in, it's linked with its pte by UnmappedPTE. Note, No COW in PrivateVMA is a bigger improvement. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Improvement on memory subsystem 2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang ` (2 preceding siblings ...) 2006-07-24 9:01 ` yunfeng zhang @ 2006-08-23 10:39 ` yunfeng zhang 3 siblings, 0 replies; 14+ messages in thread From: yunfeng zhang @ 2006-08-23 10:39 UTC (permalink / raw) To: linux-kernel New design has been appended into my OS, it's mainly focused on how to allocate swap pages for each PrivateVMA efficiently in swap daemon. Let's first see the PTE array snapshot of a PrivateVMA, U-U-P-P-P-S-S U: UnmappedPTE; P: PTE; S: SwappedPTE. The snapshot shows that the pages of the PrivateVMA has an access-affinity. The 3rd, 4th and 5th have been accessed since last scanning on the PrivateVMA, other PTEs aren't still touched. A concept is introduced here to describe the accessed PTEs -- series. So we've got three series (1st and 2nd), (3rd, 4th and 5th) and (6th and 7th) from the snapshot. The design is allocating swap pages by series. It should be more efficient for future page-fault to page-in them. Further discussion in my new documentation http://www.cublog.cn/u/21764/upfile/060823181124.zip. In section "Memory Daemons >> SwapDaemon and PrivateVMA" and "Class Repository >> SwapDaemon". There's also sample code in Implementation subdirectory of the archives. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2006-08-23 10:39 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang 2006-07-18 12:18 ` Valdis.Kletnieks 2006-07-19 3:44 ` yunfeng zhang 2006-07-19 9:18 ` Ian Stirling 2006-07-19 14:56 ` Valdis.Kletnieks 2006-07-18 16:25 ` Pekka Enberg 2006-07-19 3:21 ` yunfeng zhang 2006-07-19 8:30 ` Pekka Enberg 2006-07-19 10:13 ` yunfeng zhang 2006-07-19 10:35 ` Pekka Enberg 2006-07-24 9:01 ` yunfeng zhang 2006-07-24 14:55 ` David Lang 2006-07-25 8:33 ` yunfeng zhang 2006-08-23 10:39 ` yunfeng zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).