* Buffer management - interesting idea @ 2001-06-06 11:39 Ivan Schreter 2001-06-15 9:07 ` Helge Hafting ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Ivan Schreter @ 2001-06-06 11:39 UTC (permalink / raw) To: linux-kernel Hello, I'm working on some hi-speed DB projects under Linux and I was researching various buffer-replacement algorithms. I found 2Q buffer replacement policy at http://citeseer.nj.nec.com/63909.html Maybe it would be interesting to use it instead of LRU for disk buffer replacement. Seems relatively easy to implement and costs are about the same as for LRU. I'm not subscribed to the list, so replies please CC: to me (is@zapwerk.com). Ivan Schreter is@zapwerk.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Buffer management - interesting idea 2001-06-06 11:39 Buffer management - interesting idea Ivan Schreter @ 2001-06-15 9:07 ` Helge Hafting 2001-06-15 9:41 ` Ivan Schreter 2001-06-16 18:35 ` Jeremy Fitzhardinge 2 siblings, 0 replies; 7+ messages in thread From: Helge Hafting @ 2001-06-15 9:07 UTC (permalink / raw) To: Ivan Schreter, linux-kernel Ivan Schreter wrote: > > Hello, > > I'm working on some hi-speed DB projects under Linux and I was researching > various buffer-replacement algorithms. I found 2Q buffer replacement policy at > > http://citeseer.nj.nec.com/63909.html > > Maybe it would be interesting to use it instead of LRU for disk buffer > replacement. Seems relatively easy to implement and costs are about the same as > for LRU. The "resistance to scanning" seemed interesting, maybe one-time activities like a "find" run or big cat/dd will have less impact with this. Helge Hafting ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Buffer management - interesting idea 2001-06-06 11:39 Buffer management - interesting idea Ivan Schreter 2001-06-15 9:07 ` Helge Hafting @ 2001-06-15 9:41 ` Ivan Schreter 2001-06-15 17:05 ` Ivan Schreter 2001-06-16 18:35 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 7+ messages in thread From: Ivan Schreter @ 2001-06-15 9:41 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1968 bytes --] Hello, Please CC: replies to me, since I am not subscribed. >> http://citeseer.nj.nec.com/63909.html > The "resistance to scanning" seemed interesting, maybe one-time > activities like a "find" run or big cat/dd will have less impact with > this. Exactly. But not only that. I have made some experimental tests. When you have totally random access to data, you get degradation to LRU performance (but no worse). However, when doing normal work, results are quite encouraging. Speedup percentage is computed assuming processing in-memory buffer takes x and loading on-disk buffer takes 100x time: #blocks buffers P hitLRU hit2Q 2Q+ speedup 262400 16384 8 19.29% 24.89% 29% 7.365% 262400 16384 4 11.56% 14.23% 23% 3.084% 1024K 16384 32 16.90% 22.82% 35% 7.573% 1024K 16384 16 9.00% 11.94% 33% 3.305% 1024K 16384 8 4.94% 6.38% 29% 1.515% 4096K 16384 64 8.40% 11.39% 36% 3.339% 4096K 16384 32 4.32% 5.79% 34% 1.536% I used reasonable figures for filesystem size (1G, 4G and 16G) and buffer cache (64MB), with blocksize 4K. All simulations used 10% for A1in and 25% for A1out queues. Almost all simulations show over 30% better hit rate as LRU, translating to ~3% real time speedup for normal workload. Speedup for scanning type workload (find, cat large file, etc.) should be even better - write access pattern generator and try it :-) Constant P determines "randomness" of accesses as follows: int x = random() % (NR * LEVEL); for (int i = LEVEL - 1; i > 0; --i) if (x >= NR * i) x = (x - NR * i) / (i + 1); where x is generated block # to be accessed, LEVEL is P and NR is # of disk blocks. I attach a crude program to simulate LRU, 2Q and FIFO buffer management policies. If you want, you can play with parameters a little bit and simulate other workloads (e.g., record what's going on in the system and then simulate real system block accesses). I would like to hear from you if you have some interesting results. -- Ivan Schreter is@zapwerk.com [-- Attachment #2: test_bufrep.cpp --] [-- Type: text/plain, Size: 7103 bytes --] #include <stdio.h> #include <string.h> #include <stdlib.h> #include <time.h> #include <signal.h> int ops = 0, hit_lru = 0, hit_2q = 0, hit_fifo = 0; #define BUFS 16384 // # of buffer entries in memory #define NR 262400 // # of DB buffers #define LEVEL 8 // max. test level (1=total random) //#define USE_FASTAGE #define BUFSA1IN (BUFS/10) #define BUFSA1OUT (BUFS/4) typedef struct { int next; int prev; int id; int age; } lru_entry; lru_entry lru[BUFS]; int lru_idx[NR]; int lru_free = 0, lru_head = 0, lru_tail = 0; void lru_page(int x) { if (lru_idx[x] >= 0) { hit_lru++; // move to beginning of the buffer x = lru_idx[x]; if (x == lru_head) return; lru[lru[x].prev].next = lru[x].next; if (x == lru_tail) { lru_tail = lru[x].prev; } else { lru[lru[x].next].prev = lru[x].prev; } lru[x].prev = -1; lru[x].next = lru_head; lru[lru_head].prev = x; lru_head = x; return; } if (lru_free == -1) { // remove one from tail lru[lru_tail].next = lru_free; lru[lru_tail = lru[lru_free = lru_tail].prev].next = -1; lru_idx[lru[lru_free].id] = -1; } // add to head int r = lru_free; lru_free = lru[r].next; lru[r].prev = -1; lru[r].next = lru_head; lru[r].id = x; lru[lru_head].prev = r; if (lru_head < 0) lru_tail = r; lru_head = r; lru_idx[x] = r; } lru_entry r2q[BUFS]; int r2q_a1out[BUFSA1OUT]; int r2q_idx[NR]; int r2q_free = 0, r2q_head = -1, r2q_tail = -1; int r2q_a1i_head = -1, r2q_a1i_tail = -1, r2q_a1i_cnt = 0; int r2q_a1o_head = 0, r2q_a1o_tail = 0, r2q_age = 0; #define R2Q_MASK 0xffffff #define R2Q_FLG_MASK 0xff000000 #define R2Q_A1I_FLG 0x1000000 #define R2Q_A1O_FLG 0x2000000 #define R2Q_AM_FLG 0x4000000 void r2q_reclaim() { // free one page if (r2q_a1i_cnt > BUFSA1IN) { // remove last page from A1in and put to A1out int r = r2q_a1i_tail; #if 0 if (r < 0) { printf("Reclaim err: %d - %d\n", r2q_a1i_tail, r2q_a1i_head); raise(SIGSEGV); } #endif r2q_a1i_tail = r2q[r].prev; r2q[r2q_a1i_tail].next = -1; --r2q_a1i_cnt; //printf("Del2 %d, cnt = %d\n", r, r2q_a1i_cnt); int x = r2q[r].id; r2q[r].next = r2q_free; r2q_free = r; r2q_idx[x] = R2Q_MASK | R2Q_A1O_FLG; //if (x >= NR || x < 0) printf("set at %d = %d (@%d), head %d\n", // r2q_a1o_tail, x, r, r2q_a1o_head); r2q_a1out[r2q_a1o_tail++] = x; if (r2q_a1o_tail == BUFSA1OUT) r2q_a1o_tail = 0; if (r2q_a1o_tail == r2q_a1o_head) { // overflow, remove from buf int d = r2q_a1out[r2q_a1o_head++]; if (r2q_a1o_head == BUFSA1OUT) r2q_a1o_head = 0; // unmark in index r2q_idx[d] &= ~R2Q_A1O_FLG; } return; } // remove from tail of Am, do not put in A1out (it was least recently // used, so no point in keeping it) int x = r2q[r2q_tail].id; int r = r2q_tail; r2q_idx[x] = R2Q_MASK; r2q_tail = r2q[r].prev; r2q[r2q_tail].next = -1; r2q[r].next = r2q_free; r2q_free = r; } void r2q_page(int x) { int r = r2q_idx[x] & R2Q_MASK, rm = r2q_idx[x] & R2Q_FLG_MASK; if (rm & R2Q_AM_FLG) { // already in buffer hit_2q++; // in LRU buffer, move to first if (r == r2q_head) return; r2q[r2q[r].prev].next = r2q[r].next; if (r == r2q_tail) { r2q_tail = r2q[r].prev; } else { r2q[r2q[r].next].prev = r2q[r].prev; } r2q[r].prev = -1; r2q[r].next = r2q_head; r2q[r2q_head].prev = r; r2q_head = r; return; } else if (rm & R2Q_A1O_FLG) { // was in A1out, put to head of Am if (rm & R2Q_A1I_FLG) { // remove from A1in (already in memory) hit_2q++; --r2q_a1i_cnt; if (r == r2q_a1i_head) { r2q_a1i_head = r2q[r].next; if (r2q_a1i_cnt) r2q[r2q_a1i_head].prev = -1; } else if (r == r2q_a1i_tail) { r2q_a1i_tail = r2q[r].prev; r2q[r2q_a1i_tail].next = -1; } else { // in the middle r2q[r2q[r].next].prev = r2q[r].prev; r2q[r2q[r].prev].next = r2q[r].next; } //printf("Del %d, cnt = %d\n", r, r2q_a1i_cnt); } else { // not yet in memory if (r2q_free < 0) r2q_reclaim(); r = r2q_free; r2q_free = r2q[r].next; r2q[r].id = x; r2q_idx[x] = r; } // add to head of Am r2q[r].prev = -1; r2q[r].next = r2q_head; if (r2q_head < 0) r2q_tail = r; else r2q[r2q_head].prev = r; r2q_head = r; // must check out flag, may be changed by reclaim r2q_idx[x] = r | R2Q_AM_FLG | (r2q_idx[x] & R2Q_A1O_FLG); return; } else if (rm & R2Q_A1I_FLG) { // do nothing - is in memory in A1in queue hit_2q++; #ifdef USE_FASTAGE #warning Using FastAge // mark into R2Q_A1O when in last 25% or so... int age = r2q_age - r2q[r].age; // age cannot be significantly more than r2q_cnt, normally is // less if ((r2q_a1i_cnt - age) < (r2q_a1i_cnt << 2)) { // is in last 25% of its life, put to out pages r2q_idx[x] |= R2Q_A1O_FLG; r2q_a1out[r2q_a1o_tail++] = x; if (r2q_a1o_tail == BUFSA1OUT) r2q_a1o_tail = 0; if (r2q_a1o_tail == r2q_a1o_head) { // overflow, remove from buf int d = r2q_a1out[r2q_a1o_head++]; if (r2q_a1o_head == BUFSA1OUT) r2q_a1o_head = 0; // unmark in index r2q_idx[d] &= ~R2Q_A1O_FLG; } } #endif return; } else { // is in no queue, load & add to head of A1in if (r2q_free < 0) r2q_reclaim(); r = r2q_free; r2q_free = r2q[r].next; r2q[r].id = x; r2q[r].age = r2q_age++; r2q[r].next = r2q_a1i_head; r2q[r].prev = -1; if (!r2q_a1i_cnt) { // first item, set also tail r2q_a1i_tail = r; } else { r2q[r2q_a1i_head].prev = r; } r2q_a1i_cnt++; r2q_a1i_head = r; //printf("Add %d, cnt = %d\n", r, r2q_a1i_cnt); //if (r == 0 && r2q_a1i_cnt > 1) raise(SIGSTOP); r2q_idx[x] = r | R2Q_A1I_FLG; } } int fifo_idx[NR]; int fifo[BUFS]; int fifo_head = 0, fifo_tail = 0; void fifo_page(int x) { if (fifo_idx[x] >= 0) { hit_fifo++; return; } fifo_idx[x] = fifo_head; fifo[fifo_head++] = x; if (fifo_head == BUFS) fifo_head = 0; if (fifo_head == fifo_tail) { fifo_idx[fifo[fifo_tail++]] = -1; if (fifo_tail == BUFS) fifo_tail = 0; } } int main() { memset(lru_idx, -1, sizeof(lru_idx)); memset(fifo_idx, -1, sizeof(fifo_idx)); for (int i = 0; i < NR; ++i) r2q_idx[i] = R2Q_MASK; for (int i = 0; i < BUFS - 1; ++i) { lru[i].next = i + 1; r2q[i].next = i + 1; } lru[BUFS-1].next = -1; r2q[BUFS-1].next = -1; time_t tm = time(NULL); for (;;) { int x = random() % (NR * LEVEL); for (int i = LEVEL - 1; i > 0; --i) if (x >= NR * i) x = (x - NR * i) / (i + 1); lru_page(x); r2q_page(x); fifo_page(x); ops++; time_t cur_tm = time(NULL); if (cur_tm != tm) { int lru_ptime = (ops + 99 * (ops - hit_lru)) / 1000; int r2q_ptime = (ops + 99 * (ops - hit_2q)) / 1000; printf("Ops: %d, LRU: %d %.2f%%, 2Q: %d %.2f%%, " "FIFO: %d %.2f%%, 2Q/LRU %.2f\n" "LRU ptime: %d, 2Q ptime: %d, " "ratio 2Q/LRU: %.3f%% +%.3f%%\n", ops / 1000, hit_lru / 1000, hit_lru * 100.0 / ops, hit_2q / 1000, hit_2q * 100.0 / ops, hit_fifo / 1000, hit_fifo * 100.0 / ops, hit_2q * 1.0 / hit_lru, lru_ptime, r2q_ptime, r2q_ptime * 100.0 / lru_ptime, lru_ptime * 100.0 / r2q_ptime - 100.0); tm = cur_tm; } } return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Buffer management - interesting idea 2001-06-15 9:41 ` Ivan Schreter @ 2001-06-15 17:05 ` Ivan Schreter 2001-06-15 18:50 ` Rik van Riel 2001-06-15 23:51 ` Ivan Schreter 0 siblings, 2 replies; 7+ messages in thread From: Ivan Schreter @ 2001-06-15 17:05 UTC (permalink / raw) To: John Clemens; +Cc: linux-kernel Hello, I'm not subscribed to list, so please replies CC: to me On Fri, 15 Jun 2001 12:20:52 -0400 (EDT) John Clemens <john@deater.net> wrote: > Is there any way for you to measure the relative computational overhead > required for the two algorithms? the benefit of the higher hit rate may > be lost of the computational time increases by too much. Well, *computational* overhead is negligible - processing is O(1), like LRU. Look at the program that was attached to my previous message. There is an implementation of this algorithm. In LRU, all you have to do is move page to the front of doubly-linked list when page is accessed and swap out last page when buffer is full and request for a new page is generated. In 2Q, when a page is present in LRU queue, you move it to the front of LRU queue as usual. Otherwise, if it is in memory, it must be in A1 queue (the FIFO one), so you DON'T do anything. When the page is NOT in memory, you load it conditionally to Am LRU queue (if it is present in A1out queue) or to A1 queue (FIFO), if not. It gets interesting when you need to swap out a page from memory. When the size of A1 FIFO is greater than limit (e.g., 10% of buffer space), a page from A1 is swapped out (and put into A1out list), otherwise a page from Am LRU is swapped out (and NOT put into A1out, since it hasn't been accessed for long time). So the general algorithm is not much more complicated as LRU. There is only one interesting question: How to implement A1out queue (also FIFO) so that we can quickly tell a page is in A1out. I'm currently using cyclic buffer for A1out in sample implementation and a flag in buffer map if a page is in A1out. In real system this flag must be replaced by some kind of hashtable which will contain an entry for all pages which are in A1out queue, so that we can decide in O(1) if a page is or isn't in A1out when loading it from the disk. Alternative is a bitmask, but then we need, e.g., for 64GB at 4kB sized pages, 2MB of RAM (which is too much). Considering we may have upto 32K pages mapped in memory (128MB buffer), is a hashtable with say 64K entries at 50% A1out (16K entries) much more memory efficient (512kB). If we limit A1out queue to something less (maybe 10-20% of buffer block count) we need even less space. Since I use it for database application and know in advance the size of the underlying dataspace, I use bitmask method in my program. This is probably not good for the kernel. Somebody should investigate the potential impact of the algorithm on real system. If you send me a trace from live system, I am willing to analyze it. Form for the trace: A text file with lines in format: time blockid buffersizecur buffersizemax time blockid touch where time is monotonically increasing, blockid is some unique ID of block requested (also for blocks already in memory!), buffersizecur is current size of buffer space (since we have variable buffer space under linux) and buffersizemax is buffersizecur + size of free memory (or a flag whether we can put a new block in a free memory block). Second form (with word "touch") is for the purposes of counting extra time needed for swapping out the block back to the disk. Optionally the time may be left out. -- Ivan Schreter is@zapwerk.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Buffer management - interesting idea 2001-06-15 17:05 ` Ivan Schreter @ 2001-06-15 18:50 ` Rik van Riel 2001-06-15 23:51 ` Ivan Schreter 1 sibling, 0 replies; 7+ messages in thread From: Rik van Riel @ 2001-06-15 18:50 UTC (permalink / raw) To: Ivan Schreter; +Cc: John Clemens, linux-kernel On Fri, 15 Jun 2001, Ivan Schreter wrote: > In 2Q, when a page is present in LRU queue, you move it to the front of > LRU queue as usual. Otherwise, if it is in memory, it must be in A1 queue > (the FIFO one), so you DON'T do anything. When the page is NOT in memory, > you load it conditionally to Am LRU queue (if it is present in A1out > queue) or to A1 queue (FIFO), if not. > > It gets interesting when you need to swap out a page from memory. When the > size of A1 FIFO is greater than limit (e.g., 10% of buffer space), a page > from A1 is swapped out (and put into A1out list), otherwise a page from Am > LRU is swapped out (and NOT put into A1out, since it hasn't been accessed > for long time). This description has me horribly confused. Do you have any pointers to state diagrams of this thing? ;) regards, Rik -- Executive summary of a recent Microsoft press release: "we are concerned about the GNU General Public License (GPL)" http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Buffer management - interesting idea 2001-06-15 17:05 ` Ivan Schreter 2001-06-15 18:50 ` Rik van Riel @ 2001-06-15 23:51 ` Ivan Schreter 1 sibling, 0 replies; 7+ messages in thread From: Ivan Schreter @ 2001-06-15 23:51 UTC (permalink / raw) To: Rik van Riel; +Cc: john, linux-kernel Hello, please CC: replies to me, since I'm not subscribed to the list. On Fri, 15 Jun 2001 15:50:33 -0300 (BRST) Rik van Riel <riel@conectiva.com.br> wrote: > On Fri, 15 Jun 2001, Ivan Schreter wrote: >> In 2Q, when a page is present in LRU queue, you move it to the front of >> [...] > This description has me horribly confused. Do you have > any pointers to state diagrams of this thing? ;) Well, I posted an URL where is complete paper about 2Q, here it is again: http://citeseer.nj.nec.com/63909.html (click there in top right corner on download PS or PDF). In general, it is like LRU, but the pages make it into LRU only after going through FIFO. So a page which is requested only once (or few times in a row) will pass through this FIFO and be swapped out. When a page is actively used, it will pass through FIFO and on a new request for this page it will not be loaded into FIFO, but into LRU. Since FIFO is small relative to LRU (10% or so), you don't waste buffer space for long time for once (or few times) used pages which are not needed anymore (like big find, cat, dd, etc.). The trick is how to determine whether the page should be loaded into FIFO or into LRU at swap-in. And here comes another queue - they call it A1out in original paper. This queue (or cyclic buffer or whatever) is another FIFO queue, which stores INDICES of physical pages, which were swapped out of FIFO queue (and lived only shortly). When a page is found in this A1out queue, it is put into LRU (it is a "hot" page) and will live longer in LRU list. A1out queue size is up to experiments, I used 50% of memory buffer count and it performed well. And yes, look into the program I posted last time, look into functions r2q_page(), which loads a page into buffer and r2q_reclaim() which swaps out a page to make space. I was also doing some experiments with not swapping out hot pages out of FIFO queue (USE_FASTPAGE conditional), but they didn't bring any reasonable improvement (few tenths of percent up or down from original performance), so it's probably not worth it. BTW, this 2Q algorithm can be well used for madvise() syscall implementation, like this: - NORMAL - no change - RANDOM - no change - SEQUENTIAL - load pages in FIFO and DON'T put them in A1out after expiry - WILLNEED - load pages directly in LRU - DONTNEED - move pages to FRONT of FIFO (or TAIL of LRU) (see BSD madvise() syscall, but I believe you know what I'm talking about ;-) BTW2, I'm quite happy that people care about new ideas :-) I would be happy to hack the kernel full-time and implement them myself, but unfortunately I need to make money out of something, so no time :-) -- Ivan Schreter is@zapwerk.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Buffer management - interesting idea 2001-06-06 11:39 Buffer management - interesting idea Ivan Schreter 2001-06-15 9:07 ` Helge Hafting 2001-06-15 9:41 ` Ivan Schreter @ 2001-06-16 18:35 ` Jeremy Fitzhardinge 2 siblings, 0 replies; 7+ messages in thread From: Jeremy Fitzhardinge @ 2001-06-16 18:35 UTC (permalink / raw) To: Helge Hafting; +Cc: Ivan Schreter, linux-kernel On 15 Jun 2001 11:07:20 +0200, Helge Hafting wrote: > The "resistance to scanning" seemed interesting, maybe one-time > activities like a "find" run or big cat/dd will have less impact with > this. It should also be good for streaming file use. It gives a natural way of detecting when you should be doing drop-behind (things fall out of the fifo without ever making it into the LRU); doubly nice because it works for both reads and writes without needing to treat them differently. It's also nice that it gives a natural interpretation to the various madvise() flags. J ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2001-06-16 18:36 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-06-06 11:39 Buffer management - interesting idea Ivan Schreter 2001-06-15 9:07 ` Helge Hafting 2001-06-15 9:41 ` Ivan Schreter 2001-06-15 17:05 ` Ivan Schreter 2001-06-15 18:50 ` Rik van Riel 2001-06-15 23:51 ` Ivan Schreter 2001-06-16 18:35 ` Jeremy Fitzhardinge
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).