linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Improvement on memory subsystem
@ 2006-07-18 10:03 yunfeng zhang
  2006-07-18 12:18 ` Valdis.Kletnieks
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: yunfeng zhang @ 2006-07-18 10:03 UTC (permalink / raw)
  To: linux-kernel

Dear Linux core memory developers:

It's my pleasure to show some ideas in my OS named Zero to you, the features
listed below, I think, can be introduced into Linux 2.6.16.1 and later. The
following section is divided into two parts by the implmentation difficulty.

Minor improvement.
1. Apply dlmalloc arithmetic (http://g.oswego.edu/dl/html/malloc.html) on memory
page allocation instead of buddy arithmetic. As the result, we can get more
consecutive memory pages easily.
2. Read-ahead process during page-in/out (page fault or swap out) should be
based on its VMA to enhance IO efficiency instead of the relative physical pages
in swap space.
3. All slabs are all off-slab type. Store slab instance in page structure.
4. Introduce PrivateVMA class and discard anonymous VMA to simplify the
relationship between VMA and its pages. When a VMA is split/combined, update the
member mapping of all relating pages. In fact, those methods should be rare to
use.
5. Add a lock bit in pte. Note, the feature want CPU preserves a programer
available bit in pte. So we can avoid to allocate page before locking the pte
during do_anonymous_page, in other words, relief memory page allocation
pressure.
6. Swap out pages by scaning all vmas linking to Zone instead of scaning pages.

Major improvement.
1. No COW on anonymous vma.
2. Dynamic page mapping on core space. It's the further discussion about former
item 1 in minor improvment section, other features on it is applying DLPTE
arithmetic on core PTE array, introducing RemapDaemon.

You can download http://www.cublog.cn/u/21764/upfile/060718173355.zip to find
more about those features and it's convenient to illustrate my idea for there is
a lots of diagrams in it. Summary of the file is MemoryArchitecture.pdf is the
documentation of my Zero OS memory subsystem. Code in Implementation/memory/
shows some sample implementation about my memory subsystem. Note, not like other
OS, I do like to write documentation and my OS is far from completion, even its
memory subsystem.

My blog (Chinese site): http://zeroos.cublog.cn/

Regarding
                                                            Yunfeng Zhang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang
@ 2006-07-18 12:18 ` Valdis.Kletnieks
  2006-07-19  3:44   ` yunfeng zhang
  2006-07-19  9:18   ` Ian Stirling
  2006-07-18 16:25 ` Pekka Enberg
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Valdis.Kletnieks @ 2006-07-18 12:18 UTC (permalink / raw)
  To: yunfeng zhang; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1654 bytes --]

On Tue, 18 Jul 2006 18:03:54 +0800, yunfeng zhang said:

> 2. Read-ahead process during page-in/out (page fault or swap out) should be
> based on its VMA to enhance IO efficiency instead of the relative physical pages
> in swap space.

But wouldn't that end up causing a seek storm, rather than handling the pages
in the order that minimizes the total seek distance, no matter where they are
in memory? Remember - if you have a 2Ghz processor, and a disk that seeks in 1
millisecond, every seek is (*very* roughly) about 2 million instructions.  So
if we can burn 20 thousand instructions finding a read order that eliminates
*one* seek, we're 1.98M instructions ahead.

Now, if you have an improved read-ahead that spews out page requests that
are both elevator-friendly and temporal-friendly, *then* you might be onto
something.  For instance, if you can identify 80 pages that will likely be
needed in the next 50 milliseconds, of which 50 pages will be likely needed
in the next 30ms, you want to issue those 50 first, in an elevator-friendly
manner (uncaffienated handwave here) - and then issue the other 30 page
requests in a second burst the next time the elevator goes by.  Note this
requires the read-ahead to get a *lot* more chummy with the elevator than
it seems to currently.  In particular, readahead would need to need to be
able to hold off submitting the "later" 30 pages until it could be sure
that the elevator wouldn't merge them into the queue in a way that would
slow down the first 50 requests.

If it does that already, somebody just smack me.  And if there's a good
reason not to do that, hand me some caffeine and a clue. :)



[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang
  2006-07-18 12:18 ` Valdis.Kletnieks
@ 2006-07-18 16:25 ` Pekka Enberg
  2006-07-19  3:21   ` yunfeng zhang
  2006-07-24  9:01 ` yunfeng zhang
  2006-08-23 10:39 ` yunfeng zhang
  3 siblings, 1 reply; 14+ messages in thread
From: Pekka Enberg @ 2006-07-18 16:25 UTC (permalink / raw)
  To: yunfeng zhang; +Cc: linux-kernel

On 7/18/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote:
> 3. All slabs are all off-slab type. Store slab instance in page structure.

Not sure what you mean. We need much more than sizeof(struct page) for
slab management. Hmm?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-18 16:25 ` Pekka Enberg
@ 2006-07-19  3:21   ` yunfeng zhang
  2006-07-19  8:30     ` Pekka Enberg
  0 siblings, 1 reply; 14+ messages in thread
From: yunfeng zhang @ 2006-07-19  3:21 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel

2006/7/19, Pekka Enberg <penberg@cs.helsinki.fi>:
> On 7/18/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote:
> > 3. All slabs are all off-slab type. Store slab instance in page structure.
>
> Not sure what you mean. We need much more than sizeof(struct page) for
> slab management. Hmm?
>

Current page struct is just like this
struct page {
	unsigned long flags;
	atomic_t _count;
	atomic_t _mapcount;
	union {
		struct {
			unsigned long private;
			struct address_space *mapping;
		};
#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
		spinlock_t ptl;
#endif
	};
	pgoff_t index;
	struct list_head lru;
#if defined(WANT_PAGE_VIRTUAL)
	void *virtual;
#endif /* WANT_PAGE_VIRTUAL */
};
Most fields in the page structure is used for user page, to a core
slab page, these aren't touched at all.
So I think we should define a union
struct page {
	unsigned long flags;
	struct slab {
		struct list_head list;
		unsigned long colouroff;
		void *s_mem;
		unsigned int inuse;
		kmem_bufctl_t free;
		unsigned short nodeid;
	};
};

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-18 12:18 ` Valdis.Kletnieks
@ 2006-07-19  3:44   ` yunfeng zhang
  2006-07-19  9:18   ` Ian Stirling
  1 sibling, 0 replies; 14+ messages in thread
From: yunfeng zhang @ 2006-07-19  3:44 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

> On Tue, 18 Jul 2006 18:03:54 +0800, yunfeng zhang said:
>
> But wouldn't that end up causing a seek storm, rather than handling the pages
> in the order that minimizes the total seek distance, no matter where they are
> in memory? Remember - if you have a 2Ghz processor, and a disk that seeks in 1
> millisecond, every seek is (*very* roughly) about 2 million instructions.  So
> if we can burn 20 thousand instructions finding a read order that eliminates
> *one* seek, we're 1.98M instructions ahead.

Further sample is showd below

to page-fault (page-in operation) scan the pte triggering page-fault and its
following ptes in its VMA, if its followers are swap_entry_t type and their
relative offset is enough closer to the host pte, read them together.

to swap daemon (page-out operation), let's scan every pte of a VMA in the OS, if
we find an appropriate candidate, lock it and its following ptes if all ptes are
appropriate swap-out objects, then allocate a consecutive swap pages from swap
space, if we're success, do an efficient asynchronous IO operation; if we're
failed, shrink those ptes.

Isn't it right?

By the way, all improvements listed by me are introduced briefly, most of them
are complex, maybe only my documentation can descript them clearly.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-19  3:21   ` yunfeng zhang
@ 2006-07-19  8:30     ` Pekka Enberg
  2006-07-19 10:13       ` yunfeng zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Pekka Enberg @ 2006-07-19  8:30 UTC (permalink / raw)
  To: yunfeng zhang; +Cc: linux-kernel

On 7/18/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote:
> > > 3. All slabs are all off-slab type. Store slab instance in page structure.

2006/7/19, Pekka Enberg <penberg@cs.helsinki.fi>:
> > Not sure what you mean. We need much more than sizeof(struct page) for
> > slab management. Hmm?

On 7/19/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote:
> Current page struct is just like this

[snip]

Which, like I said, is not enough to hold slab management structures
(we need an array of bufctl_t in addition to struct slab).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-18 12:18 ` Valdis.Kletnieks
  2006-07-19  3:44   ` yunfeng zhang
@ 2006-07-19  9:18   ` Ian Stirling
  2006-07-19 14:56     ` Valdis.Kletnieks
  1 sibling, 1 reply; 14+ messages in thread
From: Ian Stirling @ 2006-07-19  9:18 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: yunfeng zhang, linux-kernel

Valdis.Kletnieks@vt.edu wrote:
> On Tue, 18 Jul 2006 18:03:54 +0800, yunfeng zhang said:
> 
> 
>>2. Read-ahead process during page-in/out (page fault or swap out) should be
>>based on its VMA to enhance IO efficiency instead of the relative physical pages
>>in swap space.
> 
> 
> But wouldn't that end up causing a seek storm, rather than handling the pages
> in the order that minimizes the total seek distance, no matter where they are
> in memory? Remember - if you have a 2Ghz processor, and a disk that seeks in 1
> millisecond, every seek is (*very* roughly) about 2 million instructions.  So
> if we can burn 20 thousand instructions finding a read order that eliminates
> *one* seek, we're 1.98M instructions ahead.

To paraphrase shakespear - all the world is not a P4 - and all the swap 
devices are not hard disks.

For example - I've got a 486/33 laptop with 12M RAM that I sometimes use 
, with swapping to a 128M PCMCIA RAM card that I got from somewhere.

20K instructions wasted on a device with no seek time is just annoying.

And on my main laptop - I have experimented with swap-over-wifi to a 
large ramdisk on my server - which works quite well. (until the wifi 
connection falls over).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-19  8:30     ` Pekka Enberg
@ 2006-07-19 10:13       ` yunfeng zhang
  2006-07-19 10:35         ` Pekka Enberg
  0 siblings, 1 reply; 14+ messages in thread
From: yunfeng zhang @ 2006-07-19 10:13 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel

> Which, like I said, is not enough to hold slab management structures
> (we need an array of bufctl_t in addition to struct slab).
>

Here, the off-slab is the same as the off-slab concept of Linux,
doesn't Linux stores bufctl_t array in its off-slab object?

I consider that we should try our best to explore the potential of
page structure. In my OS, page structure is just like a union and is
cast into different types according to its flag field.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-19 10:13       ` yunfeng zhang
@ 2006-07-19 10:35         ` Pekka Enberg
  0 siblings, 0 replies; 14+ messages in thread
From: Pekka Enberg @ 2006-07-19 10:35 UTC (permalink / raw)
  To: yunfeng zhang; +Cc: linux-kernel

On 7/19/06, yunfeng zhang <zyf.zeroos@gmail.com> wrote:
> Here, the off-slab is the same as the off-slab concept of Linux,
> doesn't Linux stores bufctl_t array in its off-slab object?
>
> I consider that we should try our best to explore the potential of
> page structure. In my OS, page structure is just like a union and is
> cast into different types according to its flag field.

The slab allocator currently allocates space for struct slab and the
bufctls next to each other regardless of whether we are allocating
on-slab or off-slab. Your approach of splitting them results in more
data cache pressure for the cache_alloc() path, I think.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-19  9:18   ` Ian Stirling
@ 2006-07-19 14:56     ` Valdis.Kletnieks
  0 siblings, 0 replies; 14+ messages in thread
From: Valdis.Kletnieks @ 2006-07-19 14:56 UTC (permalink / raw)
  To: Ian Stirling; +Cc: yunfeng zhang, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1271 bytes --]

On Wed, 19 Jul 2006 10:18:44 BST, Ian Stirling said:

> To paraphrase shakespear - all the world is not a P4 - and all the swap 
> devices are not hard disks.

Been there, done that.  I used to admin a net of Sun 3/50s where /dev/swap
was a symlink to a file on an NFS server, because the "shoebox" local hard
drives for those were so slow that throwing it across the ethernet to a
3/280 with Fujitsu Super-Eagles was faster...

> For example - I've got a 486/33 laptop with 12M RAM that I sometimes use 
> , with swapping to a 128M PCMCIA RAM card that I got from somewhere.

If we go to the effort of writing code that tries to be smart about grouping
swap reads/writes by cost, it's easy enough to flag any sort of ram-disk device
as a 'zero seek time' device.  Remember that I suggested making it dependent
on "how long until the next pass of the elevator" - for a ramdisk that basically
is zero, so the algorithm easily degenerates into "just queue the requests in
expected order you'll need the results".

> 20K instructions wasted on a device with no seek time is just annoying.

On the other hand, how long does it take to move a 4K page across the
PCMCIA interface?  If you're seeing deep queues on it, you may *still*
want to optimize the order of requests...


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang
  2006-07-18 12:18 ` Valdis.Kletnieks
  2006-07-18 16:25 ` Pekka Enberg
@ 2006-07-24  9:01 ` yunfeng zhang
  2006-07-24 14:55   ` David Lang
  2006-08-23 10:39 ` yunfeng zhang
  3 siblings, 1 reply; 14+ messages in thread
From: yunfeng zhang @ 2006-07-24  9:01 UTC (permalink / raw)
  To: linux-kernel

How to let memory subsystem allocate bigger consecutive memory pages? In current
Linux, to driver programmer it's always failed to issue a request to alloc_pages
with a enough larger order parameter, or it's diffult to allocate a bigger
consecutive physical memory block.

The reason causing the problem, I think, are the items listed below
1) Core space has a static mapping relationship with physical memory pages. So
once a core page is allocated, its core address is also fixed, it prevents the
physical pages around it to conglomerate together.
2) Current physical page management arithmetic is buddy arithmetic. The main
advantage of its is that pages managed by it is always aligned by 2 power. But,
is it necessary or there is any hardware which need physical memory pages
aligned by 2 or more power?

The solution is
1) Using dynamic page mapping on core space. So we can move all core pages
freely anytime to conglomerate bigger consecutive memory pages, a new background
daemon thread -- RemapDaemon can do conglomeration periodly.
2) Using another page management arithmetic instead of buddy, the minimum unit
of new arithmetic should be page. In fact, I think dlmalloc arithmetic is a good
candidate, it's also page conglomeration-affinity.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-24  9:01 ` yunfeng zhang
@ 2006-07-24 14:55   ` David Lang
  2006-07-25  8:33     ` yunfeng zhang
  0 siblings, 1 reply; 14+ messages in thread
From: David Lang @ 2006-07-24 14:55 UTC (permalink / raw)
  To: yunfeng zhang; +Cc: linux-kernel

On Mon, 24 Jul 2006, yunfeng zhang wrote:

> How to let memory subsystem allocate bigger consecutive memory pages? In 
> current
> Linux, to driver programmer it's always failed to issue a request to 
> alloc_pages
> with a enough larger order parameter, or it's diffult to allocate a bigger
> consecutive physical memory block.
>
> The reason causing the problem, I think, are the items listed below
> 1) Core space has a static mapping relationship with physical memory pages. 
> So
> once a core page is allocated, its core address is also fixed, it prevents 
> the
> physical pages around it to conglomerate together.
> 2) Current physical page management arithmetic is buddy arithmetic. The main
> advantage of its is that pages managed by it is always aligned by 2 power. 
> But,
> is it necessary or there is any hardware which need physical memory pages
> aligned by 2 or more power?
>
> The solution is
> 1) Using dynamic page mapping on core space. So we can move all core pages
> freely anytime to conglomerate bigger consecutive memory pages, a new 
> background
> daemon thread -- RemapDaemon can do conglomeration periodly.

this gets discussed periodicly, however the performance hit of doing the mapping 
for all core memory accesses is something that the developers have not been 
willing to accept.

> 2) Using another page management arithmetic instead of buddy, the minimum 
> unit
> of new arithmetic should be page. In fact, I think dlmalloc arithmetic is a 
> good
> candidate, it's also page conglomeration-affinity.

experiment, if it's as good as you think it will be post the numbers an you will 
get a lot of attention. quite a few people are working on the memory allocation 
for various things, therea re a lot of useage patterns to balance (and avaid 
pathalogical problems with).

David Lang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-24 14:55   ` David Lang
@ 2006-07-25  8:33     ` yunfeng zhang
  0 siblings, 0 replies; 14+ messages in thread
From: yunfeng zhang @ 2006-07-25  8:33 UTC (permalink / raw)
  To: David Lang; +Cc: linux-kernel, valdis.kletnieks, penberg

No COW in Private VMA
In current Linux 2.6.16, Linux applies copy-on-write technical when application
issues CLONE_MM parameter to do_fork, as the result, it makes private VMA of a
process share its private pages with other process. It increases memory
subsystem complexity.

In fact, OS should be application-oriented, not standard-oriented. In most
cases, supporting POSIX thread model, vfork and evecve is enough to application.
In other words, we should focus on optimizing our system for the frequent cases,
that is, do copy-on-call when someone really calls fork with CLONE_MM.

No COW in private VMA makes a simple one-to-one relationship among its pte, its
private page and its swap_entry of a private VMA, we will benefit from the model

A new PTE type is introduced here before we go
	struct UnmappedPTE {
		present : 1; // = 0.
		...;
		pageNum : 20;
	};

1) To swap daemon, we can give a fairer opportunity to every private page. As
I've suggested swap daemon should swap out pages based on VMA instead of memory
page array. So we do the steps listed below to every pte of a private VMA
	a) Convert the pte to UnmappedPTE type, that doesn't free the private page
	at all, UnmappedPTE::pageNum holds a trace of its private page.
	b) Allocate a swap entry for the private page of the pte and page-out it,
	remember do the job on current pte and its following ptes together, I've
	explained the virtue.
	c) OK, it's still untouched, reclaim the private page, convert the pte to
	SwappedPTE type.
The flow is better than the implementation of current Linux, I think.
2) We can economize a litter memory. Current swap space includes two parts, one
is swap_info_struct, its responsibility is tracing physical swap area by short
swap_info_struct::swap_map array. Now, it should be a bit array; Another is an
address_space structure, which is used by process to test whether its swap pages
have been read in memory. Now, it should be discarded, every swap page once is
read in, it's linked with its pte by UnmappedPTE.

Note, No COW in PrivateVMA is a bigger improvement.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Improvement on memory subsystem
  2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang
                   ` (2 preceding siblings ...)
  2006-07-24  9:01 ` yunfeng zhang
@ 2006-08-23 10:39 ` yunfeng zhang
  3 siblings, 0 replies; 14+ messages in thread
From: yunfeng zhang @ 2006-08-23 10:39 UTC (permalink / raw)
  To: linux-kernel

New design has been appended into my OS, it's mainly focused on how to
allocate swap pages for each PrivateVMA efficiently in swap daemon.

Let's first see the PTE array snapshot of a PrivateVMA,
           U-U-P-P-P-S-S
U: UnmappedPTE; P: PTE; S: SwappedPTE.

The snapshot shows that the pages of the PrivateVMA has an
access-affinity. The 3rd, 4th and 5th have been accessed since last
scanning on the PrivateVMA, other PTEs aren't still touched.

A concept is introduced here to describe the accessed PTEs -- series.
So we've got three series (1st and 2nd), (3rd, 4th and 5th) and (6th
and 7th) from the snapshot.

The design is allocating swap pages by series. It should be more
efficient for future page-fault to page-in them.

Further discussion in my new documentation
http://www.cublog.cn/u/21764/upfile/060823181124.zip.  In section
"Memory Daemons >> SwapDaemon and PrivateVMA" and "Class Repository >>
SwapDaemon". There's also sample code in Implementation subdirectory
of the archives.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-08-23 10:39 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-18 10:03 Improvement on memory subsystem yunfeng zhang
2006-07-18 12:18 ` Valdis.Kletnieks
2006-07-19  3:44   ` yunfeng zhang
2006-07-19  9:18   ` Ian Stirling
2006-07-19 14:56     ` Valdis.Kletnieks
2006-07-18 16:25 ` Pekka Enberg
2006-07-19  3:21   ` yunfeng zhang
2006-07-19  8:30     ` Pekka Enberg
2006-07-19 10:13       ` yunfeng zhang
2006-07-19 10:35         ` Pekka Enberg
2006-07-24  9:01 ` yunfeng zhang
2006-07-24 14:55   ` David Lang
2006-07-25  8:33     ` yunfeng zhang
2006-08-23 10:39 ` yunfeng zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).