* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this [not found] <20030829053510.GA12663@mail.jlokier.co.uk.suse.lists.linux.kernel> @ 2003-08-29 11:08 ` Andi Kleen 2003-08-29 11:17 ` Russell King 2003-09-01 5:03 ` Jamie Lokier 0 siblings, 2 replies; 106+ messages in thread From: Andi Kleen @ 2003-08-29 11:08 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel Jamie Lokier <jamie@shareable.org> writes: > I already got a surprise (to me): my Athlon MP is much slower > accessing multiple mappings which are within 32k of each other, than > mappings which are further apart, although it is coherent. The L1 Most x86 and probably most other modern CPUs have virtually addressed L1 caches. It's just too slow to wait for the MMU for an L1 access which is really critical. So such artifacts are expected > data cache is 64k. (The explanation is easy: virtually indexed, > physically tagged cache moves data among cache lines, possibly via L2). On x86 L2 is usually physically tagged. Mostly only ARM,MIPS et.al. have virtually tagged L2. -Andi ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 11:08 ` x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this Andi Kleen @ 2003-08-29 11:17 ` Russell King 2003-09-01 5:03 ` Jamie Lokier 1 sibling, 0 replies; 106+ messages in thread From: Russell King @ 2003-08-29 11:17 UTC (permalink / raw) To: Andi Kleen; +Cc: Jamie Lokier, linux-kernel On Fri, Aug 29, 2003 at 01:08:51PM +0200, Andi Kleen wrote: > Jamie Lokier <jamie@shareable.org> writes: > > data cache is 64k. (The explanation is easy: virtually indexed, > > physically tagged cache moves data among cache lines, possibly via L2). > > On x86 L2 is usually physically tagged. > > Mostly only ARM,MIPS et.al. have virtually tagged L2. Correction: ARM L1 is mostly VIVT. L2 cache isn't mandated by the architecture, and therefore generally doesn't exist. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 11:08 ` x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this Andi Kleen 2003-08-29 11:17 ` Russell King @ 2003-09-01 5:03 ` Jamie Lokier 1 sibling, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 5:03 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel Andi Kleen wrote: > > I already got a surprise (to me): my Athlon MP is much slower > > accessing multiple mappings which are within 32k of each other, than > > mappings which are further apart, although it is coherent. The L1 > > Most x86 and probably most other modern CPUs have virtually > addressed L1 caches. It's just too slow to wait for the MMU for an > L1 access which is really critical. > > So such artifacts are expected I hadn't thought at first because there's no artefact at all (not even a small one) on my Celeron, but you're right. They don't appear on any Intels(*), but they do on all AMDs that I have results for. (*) With the possible exception of one P4 that reports varying results. > > > data cache is 64k. (The explanation is easy: virtually indexed, > > physically tagged cache moves data among cache lines, possibly via L2). > > On x86 L2 is usually physically tagged. I'm speculating that L1 is physically tagged, and when there's a virtual alias the CPU moves data from one L1 line to another. L2 only comes into it because the line transfer is slow enough that a MESI-style transfer through L2 (as if another CPU or device requested the line) would account for the slowness. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this @ 2003-08-29 5:35 Jamie Lokier 2003-08-29 10:03 ` J.A. Magallon ` (22 more replies) 0 siblings, 23 replies; 106+ messages in thread From: Jamie Lokier @ 2003-08-29 5:35 UTC (permalink / raw) To: linux-kernel Dear All, I'd appreciate if folks would run the program below on various machines, especially those whose caches aren't automatically coherent at the hardware level. It searches for that address multiple which an application can use to get coherent multiple mappings of shared memory, with good performance. I want this information for two reasons: 1. To check it correctly detects archs which page fault for coherency or aren't coherent. 2. To check the timing test is robust, both for 1. and for detecting archs where the hardware is coherent but slows down (see Athlon below). 3. To check this is reliable enough to use at run time in an app. I already got a surprise (to me): my Athlon MP is much slower accessing multiple mappings which are within 32k of each other, than mappings which are further apart, although it is coherent. The L1 data cache is 64k. (The explanation is easy: virtually indexed, physically tagged cache moves data among cache lines, possibly via L2). This suggests scope for improving x86 kernel performance in the areas of kmap() and shared library / executable mappings, by good choice of _virtual_ addresses. This doesn't require a cache colouring page allocator, so maybe it's a new avenue? Anyway, please lots of people run the program and post the output + /proc/cpuinfo. Compile with optimisation, -O or -O2 is fine. (You can add -DHAVE_SYSV_SHM too if you like): gcc -o test test.c -O2 time ./test cat /proc/cpuinfo Thanks a lot :) -- Jamie ============================================================================== /* This code maps shared memory to multiple addresses and tests it for cache coherency and performance. Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include <assert.h> #include <stdlib.h> #include <string.h> #include <limits.h> #include <errno.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/signal.h> #include <sys/mman.h> #include <sys/time.h> #if HAVE_SYSV_SHM #include <sys/ipc.h> #include <sys/shm.h> #endif //#include "pagealias.h" /* Helpers to temporarily block all signals. These are used for when a race condition might leave a temporary file that should have been deleted -- we do our best to prevent this possibility. */ static void block_signals (sigset_t * save_state) { sigset_t all_signals; sigfillset (&all_signals); sigprocmask (SIG_BLOCK, &all_signals, save_state); } static void unblock_signals (sigset_t * restore_state) { sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0); } /* Open a new shared memory file, either using the POSIX.4 `shm_open' function, or using a regular temporary file in /tmp. Immediately after opening the file, it is unlinked from the global namespace using `shm_unlink' or `unlink'. On success, the value returned is a file descriptor. Otherwise, -1 is returned and `errno' is set. The descriptor can be closed using simply `close'. */ /* Note: `shm_open' requires link argument `-lposix4' on Suns. On GNU/Linux with Glibc, it requires `-lrt'. Unfortunately, Glibc's -lrt insists on linking to pthreads, which we may not want to use because that enables thread locking overhead in other functions. So we implement a direct method of opening shm on Linux. */ /* If this is changed, change the size of `buffer' below too. */ #if HAVE_SHM_OPEN #define SHM_DIR_PREFIX "/" /* `shm_open' arg needs "/" for portability. */ #elif defined (__linux__) #include <sys/statfs.h> #define SHM_DIR_PREFIX "/dev/shm/" #else #undef SHM_DIR_PREFIX #endif static int open_shared_memory_file (int use_tmp_file) { char * ptr, buffer [19]; int fd, i; unsigned long number; sigset_t save_signals; struct timeval tv; #if !HAVE_SHM_OPEN && defined (__linux__) struct statfs sfs; if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0 || sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */)) { errno = ENOSYS; return -1; } #endif loop: /* Print a randomised path name into `buffer'. The string depends on the directory and whether we are using POSIX.4 shared memory or a regular temporary file. RANDOM is a 5-digit, base-62 representation of a pseudo-random number. The string is used as a candidate in the search for an unused shared segment or file name. */ #ifdef SHM_DIR_PREFIX strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-"); #else strcpy (buffer, "/tmp/shm-"); #endif ptr = buffer + strlen (buffer); gettimeofday (&tv, (struct timezone *) 0); number = (unsigned long) random (); number += (unsigned long) getpid (); number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec; for (i = 0; i < 5; i++) { /* Don't use character arithmetic, as not all systems are ASCII. */ *ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62]; number /= 62; } *ptr = '\0'; /* Block signals between the open and unlink, to really minimise the chance of accidentally leaving an unwanted file around. */ block_signals (&save_signals); #if HAVE_SHM_OPEN if (!use_tmp_file) { fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600); if (fd != -1) shm_unlink (buffer); } else #endif /* HAVE_SHM_OPEN */ { fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600); if (fd != -1) unlink (buffer); } unblock_signals (&save_signals); /* If we failed due to a name collision or a signal, try again. */ if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR)) goto loop; return fd; } /* Allocate a region of address space `size' bytes long, so that the region will not be allocated for any other purpose. It is freed with `munmap'. Returns the mapped base address on success. Otherwise, MAP_FAILED is returned and `errno' is set. */ static size_t system_page_size; #if !defined (MAP_ANONYMOUS) && defined (MAP_ANON) #define MAP_ANONYMOUS MAP_ANON #endif #ifndef MAP_NORESERVE #define MAP_NORESERVE 0 #endif #ifndef MAP_FILE #define MAP_FILE 0 #endif #ifndef MAP_VARIABLE #define MAP_VARIABLE 0 #endif #ifndef MAP_FAILED #define MAP_FAILED ((void *) -1) #endif #ifndef PROT_NONE #define PROT_NONE PROT_READ #endif static void * map_address_space (void * optional_address, size_t size, int access) { void * addr; #ifdef MAP_ANONYMOUS addr = mmap (optional_address, size, access ? (PROT_READ | PROT_WRITE) : PROT_NONE, (MAP_PRIVATE | MAP_ANONYMOUS | (optional_address ? MAP_FIXED : MAP_VARIABLE) | (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0); #else /* not defined MAP_ANONYMOUS */ int save_errno, zero_fd = open ("/dev/zero", O_RDONLY); if (zero_fd == -1) return MAP_FAILED; addr = mmap (optional_address, size, access ? (PROT_READ | PROT_WRITE) : PROT_NONE, (MAP_PRIVATE | MAP_FILE | (optional_address ? MAP_FIXED : MAP_VARIABLE) | (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0); save_errno = errno; close (zero_fd); errno = save_errno; #endif /* not defined MAP_ANONMOUS */ return addr; } /* Set up a page alias mapping using mmap() on POSIX shared memory or on a temporary regular file. Returns the mapped base address on success. Otherwise, 0 is returned and `errno' is set. */ static void * page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file) { void * base_addr, * addr; int fd, i, save_errno; struct stat st; fd = open_shared_memory_file (use_tmp_file); if (fd == -1) goto fail; /* First, resize the shared memory file to the desired size. */ if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size) goto close_fail; /* Map an anonymous region `separation + size' bytes long. This is how we allocate sufficient contiguous address space. We over-map this with the aliased buffer. */ if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto close_fail; /* Map the same shared memory repeatedly, at different addresses. */ for (i = 0; i < 2; i++) { addr = mmap ((char *) base_addr + (i ? separation : 0), size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED, fd, (off_t) 0); if (addr == MAP_FAILED) goto unmap_fail; if (addr != (char *) base_addr + (i ? separation : 0)) { /* `mmap' ignored MAP_FIXED! Should never happen. */ munmap (addr, size); save_errno = EINVAL; goto unmap_fail_se; } } if (close (fd) != 0) goto unmap_fail; /* Success! */ return base_addr; /* Failure. */ unmap_fail: save_errno = errno; unmap_fail_se: munmap (base_addr, separation + size); errno = save_errno; close_fail: save_errno = errno; close (fd); errno = save_errno; fail: return 0; } /* Set up a page alias mapping using SYSV IPC shared memory. Returns the mapped base address on success. Otherwise, 0 is returned and `errno' is set. */ #if HAVE_SYSV_SHM static void * page_alias_using_sysv_shm (size_t size, size_t separation) { void * base_addr, * addr; sigset_t save_signals; int shmid, i, save_errno; /* Map an anonymous region `separation + size' bytes long. This is how we allocate sufficient contiguous address space. We over-map this with the aliased buffer. */ if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto fail; /* Block signals between the shmget() and IPC_RMID, to minimise the chance of accidentally leaving an unwanted shared segment around. */ block_signals (&save_signals); shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600); if (shmid == -1) goto unmap_fail; /* Map the same shared memory repeatedly, at different addresses. */ for (i = 0; i < 2; i++) { /* `shmat' is tried twice. The fist time it can fail if the local implementation of `shmat' refuses to map over a region mapped with `mmap'. In that case, we punch a hole using `munmap' and do it again. If the local `shmat' has this property, the `shmat' calls to fixed addresses might collide with a concurrent thread which is also doing mappings, and will fail. At least it is a safe failure. On the other hand, if the local `shmat' can map over already-mapped regions (in the same way that `mmap' does), it is essential that we do actually use an already-mapped region, so that collisions with a concurrent thread can't possibly result in both of us grabbing the same address range with no indication of error. */ addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0); if (addr == (void *) -1 && errno == EINVAL) { munmap ((char *) base_addr + (i ? separation : 0), size); addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0); } /* Check for errors. */ if (addr == (void *) -1) { save_errno = errno; if (i == 1) shmdt (base_addr); goto remove_shm_fail_se; } else if (addr != (char *) base_addr + (i ? separation : 0)) { /* `shmat' ignored the requested address! */ if (i == 1) shmdt (base_addr); save_errno = EINVAL; goto remove_shm_fail_se; } } if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0) goto remove_shm_fail; unblock_signals (&save_signals); /* Success! */ return base_addr; /* Failure. */ remove_shm_fail: save_errno = errno; remove_shm_fail_se: while (--i >= 0) shmdt ((char *) base_addr + (i ? separation : 0)); shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0); errno = save_errno; unmap_fail: save_errno = errno; unblock_signals (&save_signals); munmap (base_addr, separation + size); errno = save_errno; fail: return 0; } #endif /* HAVE_SYSV_SHM */ /* Map a page-aliased ring buffer. Shared memory of size `size' is mapped twice, with the difference between the two addresses being `separation', which must be at least `size'. The total address range used is `separation + size' bytes long. On success, *METHOD is filled with a number which must be passed to `page_alias_unmap', and the mapped base address is returned. Otherwise, 0 is returned and `errno' is set. */ static void * __page_alias_map (size_t size, size_t separation, int * method) { void * addr; if (((size | separation) & (system_page_size - 1)) != 0 || size > separation) { errno = -EINVAL; return 0; } /* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */ #ifdef SHM_DIR_PREFIX *method = 0; if ((addr = page_alias_using_mmap (size, separation, 0)) != 0) return addr; #endif #if HAVE_SYSV_SHM *method = 1; if ((addr = page_alias_using_sysv_shm (size, separation)) != 0) return addr; #endif *method = 2; return page_alias_using_mmap (size, separation, 1); } /* Unmap a page-aliased ring buffer previously allocated by `page_alias_map'. `address' is the base address, and `size' and `separation' are the arguments previously passed to `__page_alias_map'. `method' is the value previously stored in *METHOD. Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */ static int __page_alias_unmap (void * address, size_t size, size_t separation, int method) { #if HAVE_SYSV_SHM if (method == 1) { shmdt (address); shmdt (address + separation); if (separation > size) munmap (address + size, separation - size); return 0; } #endif return munmap (address, separation + size); } /* Map a page-aliased ring buffer. `size' is the size of the buffer to create; it will be mapped twice to cover a total address range `size * 2' bytes long. On success, *METHOD is filled with a number which must be passed to `page_alias_unmap', and the mapped base address is returned. Otherwise, 0 is returned and `errno' is set. */ void * page_alias_map (size_t size, int * method) { return __page_alias_map (size, size, method); } /* Unmap a page-aliased ring buffer previously allocated by `page_alias_map'. `address' is the base address, and `size' is the size of the buffer (which is half of the total mapped address range). `method' is a value previously stored in *METHOD by `page_alias_map'. Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */ int page_alias_unmap (void * address, size_t size, int method) { return __page_alias_unmap (address, size, size, method); } /* Map some memory which is not aliased, for timing comparisons against aliased pages. We use a combination of mappings similar to page_alias_*(), in case there are resource limitations which would prevent malloc() or a single mmap() working for the larger address range tests. */ static void * page_no_alias (size_t size, size_t separation) { void * base_addr, * addr; int i, save_errno; if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto fail; /* Map anonymous memory at the different addresses. */ for (i = 0; i < 2; i++) { addr = map_address_space ((char *) base_addr + (i ? separation : 0), size, 1); if (addr == MAP_FAILED) goto unmap_fail; if (addr != (char *) base_addr + (i ? separation : 0)) { /* `mmap' ignored MAP_FIXED! Should never happen. */ munmap (addr, size); save_errno = EINVAL; goto unmap_fail_se; } } /* Success! */ return base_addr; /* Failure. */ unmap_fail: save_errno = errno; unmap_fail_se: munmap (base_addr, separation + size); errno = save_errno; fail: return 0; } /* This should be a word size that the architecture can read and write fast in a single instruction. In principle, C's `int' is the natural word size, but in practice it isn't on 64-bit machines. */ #define WORD long /* These GCC-specific asm statements force values into registers, and also act as compiler memory barriers. These are used to force a group of write/write/read instructions as close together as possible, to maximise the detection of store buffer conditions. Despite being asm statements, these will work with any of GCC's target architectures, provided they have >= 4 registers. */ #if __GNUC__ >= 3 #define __noinline __attribute__ ((__noinline__)) #else #define __noinline #endif #ifdef __GNUC__ #define force_into_register(var) \ __asm__ ("" : "=r" (var) : "0" (var) : "memory") #define force_into_registers(var1, var2, var3, var4) \ __asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \ : "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory") #else #define force_into_register(var) do {} while (0) #define force_into_registers(var1, var2, var3, var4) do {} while (0) #endif /* This function tries to test whether a CPU snoops its store buffer for reads within a few instructions, and ignores virtual to physical address translations when doing that. In principle a CPU might do this even if it's L1 cache is physically tagged or indexed, although I have not seen such a system. (A CPU which uses store buffer snooping and with an off-board MMU, which the CPU is unaware of, could have this property). It isn't possible to do this test perfectly; we do our best. The `force_into_register' macros ensure that the write/write/read sequence is as compact as the compiler can make it. */ static WORD __noinline test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2) { register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2; register WORD __reg1 = 1, __reg2 = 0; force_into_registers (__reg1, __reg2, __regptr1, __regptr2); *__regptr1 = __reg1; *__regptr2 = __reg2; __reg1 = *__regptr1; force_into_register (__reg1); return __reg1; } /* This function tests whether writes to one page are seen in another page at a different virtual address, and whether they are nearly as fast as normal writes. The accesses are timed by the caller of this function. Alternate writes go to alternate pages, so that if aliasing is implemented using page faults, it will clearly show up in the timings. */ static int __noinline test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops) { WORD fail = 0; while (--timing_loops >= 0) fail |= test_store_buffer_snoop (ptr1, ptr2); return fail != 0; } /* This function tests L1 cache coherency without checking for store buffer snoop coherency. To do this, we add delays after each store to allow the store buffer to drain. The result of this function is not important: it is only used in a diagnostic message. */ static int __noinline test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2) { static volatile WORD dummy; int i, j; WORD fail = 0; for (i = 0; i < 10; i++) { *ptr1 = 1; for (j = 0; j < 1000; j++) /* Dummy volatile writes for delay. */ dummy = 0; *ptr2 = 0; for (j = 0; j < 1000; j++) /* Dummy volatile writes for delay. */ dummy = 0; fail |= *ptr1; } return fail != 0; } /* Thoroughly test a pair of aliased pages with a fixed address separation, to see if they really behave like memory appearing at two locations, and efficiently. We search through different values of `separation' searching for a suitable "cache colour" on this machine. */ static inline const char * test_one_separation (size_t separation) { void * buffers [2]; long timings [3]; int i, method, timing_loops = 64; /* We measure timings of 3 different tests, each 128 times to find the minimum. 0: Writes and reads to aliased pages. 1: Writes and reads to non-aliased pages, to compare with 1. 2: Doing nothing, to measure the time for `gettimeofday' itself. The measurements are done in a mixed up order. If we did 64 measurements of type 0, then 64 of type 1, then 64 of type 2, the results could be mislead due to synchronisation with other processes occuring on the machine. */ /* A previously generated random shuffle of bit-pairs. Each pair is a number from the set {0,1,2}. Each number occurs exactly 128 times. */ static const unsigned char pattern [96] = { 0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56, 0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49, 0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99, 0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25, 0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19, 0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15, 0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89, 0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85, }; buffers [0] = __page_alias_map (system_page_size, separation, &method); if (buffers [0] == 0) return "alias map failed"; buffers [1] = page_no_alias (system_page_size, separation); if (buffers [1] == 0) { __page_alias_unmap (buffers [0], system_page_size, separation, method); return "non-alias map failed"; } retry: timings [2] = timings [1] = timings [0] = LONG_MAX; for (i = 0; i < 384; i++) { struct timeval time_before, time_after; long time_delta; int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3; volatile WORD * ptr1 = (volatile WORD *) buffers [which_test]; volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation); /* Test whether writes to one page appear immediately in the other, and time how long the memory accesses take. */ gettimeofday (&time_before, (struct timezone *) 0); if (which_test < 2) fail = test_page_alias (ptr1, ptr2, timing_loops); gettimeofday (&time_after, (struct timezone *) 0); if (fail && which_test == 0) { /* Test whether the failure is due to a store buffer bypass which ignores virtual address translation. */ int l1_fail = test_l1_only (ptr1, ptr2); __page_alias_unmap (buffers [0], system_page_size, separation, method); munmap (buffers [1], separation + system_page_size); return l1_fail ? "cache not coherent" : "store buffer not coherent"; } time_delta = ((time_after.tv_usec - time_before.tv_usec) + 1000000 * (time_after.tv_sec - time_before.tv_sec)); /* Find the smallest time taken for each test. Ignore negative glitches due to Linux' tendancy to jump the clock backwards. */ if (time_delta >= 0 && time_delta < timings [which_test]) timings [which_test] = time_delta; } /* Remove the cost of `gettimeofday()' itself from measurements. */ timings [0] -= timings [2]; timings [1] -= timings [2]; /* Keep looping until at least one measurement becomes significant. A very fast CPU will show measurements of zero microseconds for smaller values of `timing_loops'. Also loop until the cost of `gettimeofday()' becomes insignificant. When the program is run under `strace', the latter is a big and this is needed to stabilise the results. */ if (timings [0] <= 10 * (1 + timings [2]) && timings [1] <= 10 * (1 + timings [2])) { timing_loops <<= 1; goto retry; } __page_alias_unmap (buffers [0], system_page_size, separation, method); munmap (buffers [1], separation + system_page_size); /* Reject page aliasing if it is much slower than accessing a single, definitely cached page directly. */ if (timings [0] > 2 * timings [1]) return "too slow"; /* Success! Passed all tests for these parameters. */ return 0; } size_t page_alias_smallest_size; void page_alias_init (void) { size_t size; #ifdef _SC_PAGESIZE system_page_size = sysconf (_SC_PAGESIZE); #elif defined (_SC_PAGE_SIZE) system_page_size = sysconf (_SC_PAGE_SIZE); #else system_page_size = getpagesize (); #endif for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2) { const char * reason = test_one_separation (size); printf ("Test separation: %lu bytes: %s%s\n", (unsigned long) size, reason ? "FAIL - " : "pass", reason ? reason : ""); /* This logic searches for the smallest _contiguous_ range of page sizes for which `page_alias_test' passes. */ if (reason == 0 && page_alias_smallest_size == 0) page_alias_smallest_size = size; else if (reason != 0 && page_alias_smallest_size != 0) { /* Fail, indicating that page-aliasing is not reliable, because there's a maximum size. We don't support that as it seems quite unlikely given our model of cache colouring. */ page_alias_smallest_size = 0; break; } } printf ("VM page alias coherency test: "); if (page_alias_smallest_size == 0) printf ("failed; will use copy buffers instead\n"); else if (page_alias_smallest_size == system_page_size) printf ("all sizes passed\n"); else printf ("minimum fast spacing: %lu (%lu page%s)\n", (unsigned long) page_alias_smallest_size, (unsigned long) (page_alias_smallest_size / system_page_size), (page_alias_smallest_size == system_page_size) ? "" : "s"); } //#ifdef TEST_PAGEALIAS int main () { page_alias_init (); return 0; } //#endif ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier @ 2003-08-29 10:03 ` J.A. Magallon 2003-08-29 10:36 ` Alan Cox 2003-09-01 4:49 ` Jamie Lokier 2003-08-29 10:04 ` Sergey S. Kostyliov ` (21 subsequent siblings) 22 siblings, 2 replies; 106+ messages in thread From: J.A. Magallon @ 2003-08-29 10:03 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On 08.29, Jamie Lokier wrote: > Dear All, [...] > > I already got a surprise (to me): my Athlon MP is much slower > accessing multiple mappings which are within 32k of each other, than > mappings which are further apart, although it is coherent. The L1 > data cache is 64k. (The explanation is easy: virtually indexed, > physically tagged cache moves data among cache lines, possibly via L2). > Sorry if this is a stupid question, but have you heard about 64K-aliasing ? We have seen it in P3/P4, do not know if Athlons also suffer it. In short, x86 is crap. It slows like a dog when accessing two memory positions sparated by 2^n (address decoder has two 16 bits adders, instead of 1 32 bits..., cache is 16 bit tagged, etc...) -- J.A. Magallon <jamagallon@able.es> \ Software is like sex: werewolf.able.es \ It's better when it's free Mandrake Linux release 9.2 (Cooker) for i586 Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk)) ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 10:03 ` J.A. Magallon @ 2003-08-29 10:36 ` Alan Cox 2003-09-01 4:49 ` Jamie Lokier 1 sibling, 0 replies; 106+ messages in thread From: Alan Cox @ 2003-08-29 10:36 UTC (permalink / raw) To: J.A. Magallon; +Cc: Jamie Lokier, Linux Kernel Mailing List On Gwe, 2003-08-29 at 11:03, J.A. Magallon wrote: > Sorry if this is a stupid question, but have you heard about 64K-aliasing ? > We have seen it in P3/P4, do not know if Athlons also suffer it. > In short, x86 is crap. It slows like a dog when accessing two memory > positions sparated by 2^n (address decoder has two 16 bits adders, instead > of 1 32 bits..., cache is 16 bit tagged, etc...) Pretty much all processors are bad at handling memory accesses on the same alignment within powers of two. Thats one of the reasons for slab and for things like the old kernel code putting skb structs at the end of the skbuff data. Grab a copy of "Unix systems for modern architectures". ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 10:03 ` J.A. Magallon 2003-08-29 10:36 ` Alan Cox @ 2003-09-01 4:49 ` Jamie Lokier 1 sibling, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 4:49 UTC (permalink / raw) To: J.A. Magallon; +Cc: linux-kernel J.A. Magallon wrote: > On 08.29, Jamie Lokier wrote: > > I already got a surprise (to me): my Athlon MP is much slower > > accessing multiple mappings which are within 32k of each other, than > > mappings which are further apart, although it is coherent. The L1 > > data cache is 64k. (The explanation is easy: virtually indexed, > > physically tagged cache moves data among cache lines, possibly via L2). > > > > Sorry if this is a stupid question, but have you heard about 64K-aliasing ? > We have seen it in P3/P4, do not know if Athlons also suffer it. > In short, x86 is crap. It slows like a dog when accessing two memory > positions sparated by 2^n (address decoder has two 16 bits adders, instead > of 1 32 bits..., cache is 16 bit tagged, etc...) I don't know what you mean. This test doesn't observe any gross timing effect at 64K. I have just tried it on a Celeron Coppermine printing more detailed numbers, and I don't notice anything at all. So, what exactly do you mean? What kind of code shows the effect you are talking about? Thanks, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier 2003-08-29 10:03 ` J.A. Magallon @ 2003-08-29 10:04 ` Sergey S. Kostyliov 2003-08-29 10:15 ` J.A. Magallon ` (20 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Sergey S. Kostyliov @ 2003-08-29 10:04 UTC (permalink / raw) To: Jamie Lokier, linux-kernel Hi Jamie, On Friday 29 August 2003 09:35, Jamie Lokier wrote: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. rathamahata@test rathamahata $ gcc -march=athlon-xp -mcpu=athlon-xp -fomit-frame-pointer -O2 -o test test.c rathamahata@test rathamahata $ time ./test Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) real 0m0.097s user 0m0.091s sys 0m0.006s rathamahata@test rathamahata $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) MP 2200+ stepping : 0 cpu MHz : 1800.967 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow bogomips : 3538.94 processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) Processor stepping : 0 cpu MHz : 1800.967 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow bogomips : 3596.28 -- Best regards, Sergey S. Kostyliov <rathamahata@php4.ru> Public PGP key: http://sysadminday.org.ru/rathamahata.asc ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier 2003-08-29 10:03 ` J.A. Magallon 2003-08-29 10:04 ` Sergey S. Kostyliov @ 2003-08-29 10:15 ` J.A. Magallon 2003-08-29 10:21 ` J.A. Magallon ` (19 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: J.A. Magallon @ 2003-08-29 10:15 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On 08.29, Jamie Lokier wrote: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. > Uh ? So good are my PII ? werewolf:~> gcc -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c werewolf:~> vm-test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed werewolf:~> gcc -DHAVE_SYSV_SHM -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c werewolf:~> vm-test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed werewolf:~> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 5 model name : Pentium II (Deschutes) stepping : 2 cpu MHz : 400.915 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 799.53 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 5 model name : Pentium II (Deschutes) stepping : 2 cpu MHz : 400.915 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 801.17 -- J.A. Magallon <jamagallon@able.es> \ Software is like sex: werewolf.able.es \ It's better when it's free Mandrake Linux release 9.2 (Cooker) for i586 Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk)) ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (2 preceding siblings ...) 2003-08-29 10:15 ` J.A. Magallon @ 2003-08-29 10:21 ` J.A. Magallon 2003-08-29 10:34 ` CaT ` (18 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: J.A. Magallon @ 2003-08-29 10:21 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On 08.29, Jamie Lokier wrote: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. > Dual P4 Xeon annwn:~> gcc -march=pentium4 -O2 -fomit-frame-pointer -o vm-test vm-test.c annwn:~> vm-test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed annwn:~> gcc -DHAVE_SYSV_SHM -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c annwn:~> vm-test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed annwn:~> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) CPU 1.80GHz stepping : 4 cpu MHz : 1784.328 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3552.05 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) CPU 1.80GHz stepping : 4 cpu MHz : 1784.328 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3565.15 -- J.A. Magallon <jamagallon@able.es> \ Software is like sex: werewolf.able.es \ It's better when it's free Mandrake Linux release 9.2 (Cooker) for i586 Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk)) ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (3 preceding siblings ...) 2003-08-29 10:21 ` J.A. Magallon @ 2003-08-29 10:34 ` CaT 2003-08-29 10:37 ` CaT ` (17 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: CaT @ 2003-08-29 10:34 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote: > gcc -o test test.c -O2 > time ./test > cat /proc/cpuinfo 16 [20:33:33] hogarth@theirongiant:/home/hogarth>> time ./coherencytest Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.206s user 0m0.135s sys 0m0.027s 16 [20:33:44] hogarth@theirongiant:/home/hogarth>> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 3 cpu MHz : 701.641 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1388.54 -- "How can I not love the Americans? They helped me with a flat tire the other day," he said. - http://tinyurl.com/h6fo ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (4 preceding siblings ...) 2003-08-29 10:34 ` CaT @ 2003-08-29 10:37 ` CaT 2003-08-29 10:49 ` Mikael Pettersson ` (16 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: CaT @ 2003-08-29 10:37 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote: > gcc -o test test.c -O2 > time ./test > cat /proc/cpuinfo Forgot about this one. :/ $ time ./coherencytest Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) real 0m0.543s user 0m0.230s sys 0m0.020s $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 5 model : 8 model name : AMD-K6(tm) 3D processor stepping : 12 cpu MHz : 300.691 cache size : 64 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr mce cx8 pge mmx syscall 3dnow k6_mtrr bogomips : 599.65 -- "How can I not love the Americans? They helped me with a flat tire the other day," he said. - http://tinyurl.com/h6fo ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (5 preceding siblings ...) 2003-08-29 10:37 ` CaT @ 2003-08-29 10:49 ` Mikael Pettersson 2003-08-29 11:41 ` Gianni Tedesco ` (15 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Mikael Pettersson @ 2003-08-29 10:49 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel Jamie Lokier writes: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. >From a dual Opteron 244 box: Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) 0.08user 0.01system 0:00.08elapsed 101%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (131major+38minor)pagefaults 0swaps processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 244 stepping : 1 cpu MHz : 1791.569 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3565.15 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 244 stepping : 1 cpu MHz : 1791.569 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3578.26 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts ttp ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (6 preceding siblings ...) 2003-08-29 10:49 ` Mikael Pettersson @ 2003-08-29 11:41 ` Gianni Tedesco 2003-08-29 11:51 ` James Morris ` (14 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Gianni Tedesco @ 2003-08-29 11:41 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1392 bytes --] On Fri, 2003-08-29 at 06:35, Jamie Lokier wrote: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. PPC (G4). Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed cpu : 7455, altivec supported clock : 667MHz revision : 2.1 (pvr 8001 0201) bogomips : 665.19 machine : PowerBook3,4 motherboard : PowerBook3,4 MacRISC2 MacRISC Power Macintosh board revision : 00000000 detected as : 73 (PowerBook Titanium III) pmac flags : 0000000b L2 cache : 256K unified memory : 512MB pmac-generation : NewWorld -- // Gianni Tedesco (gianni at scaramanga dot co dot uk) lynx --source www.scaramanga.co.uk/gianni-at-ecsc.asc | gpg --import 8646BE7D: 6D9F 2287 870E A2C9 8F60 3A3C 91B5 7669 8646 BE7D [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (7 preceding siblings ...) 2003-08-29 11:41 ` Gianni Tedesco @ 2003-08-29 11:51 ` James Morris 2003-08-29 15:41 ` Larry McVoy ` (13 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: James Morris @ 2003-08-29 11:51 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel Here's the result for sparc64 (Ultrasparc II): $ gcc -o test test.c -O2 $ time ./test Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (2 pages) real 0m0.194s user 0m0.160s sys 0m0.040s $ gcc -o test test.c -O2 -DHAVE_SYSV_SHM $ time ./test Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (2 pages) real 0m0.162s user 0m0.140s sys 0m0.020s $ cat /proc/cpuinfo cpu : TI UltraSparc II (BlackBird) fpu : UltraSparc II integrated FPU promlib : Version 3 Revision 23 prom : 3.23.1 type : sun4u ncpus probed : 2 ncpus active : 2 Cpu0Bogo : 591.46 Cpu0ClkTck : 0000000011a4f2ed Cpu2Bogo : 591.46 Cpu2ClkTck : 0000000011a4f2ed MMU Type : Spitfire State: CPU0: online CPU2: online -- James Morris <jmorris@intercode.com.au> ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (8 preceding siblings ...) 2003-08-29 11:51 ` James Morris @ 2003-08-29 15:41 ` Larry McVoy 2003-08-29 23:05 ` Mike Fedyk 2003-09-01 5:44 ` Jamie Lokier 2003-08-29 15:47 ` Herbert Poetzl ` (12 subsequent siblings) 22 siblings, 2 replies; 106+ messages in thread From: Larry McVoy @ 2003-08-29 15:41 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote: > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC If you care, I also have freebsd (v2, v3, v4), netbsd 1.5, openbsd 3.0 (all bsd systems are x86, mostly celerons), hpux 10.20, sco, solaris, solaris/x86, Irix, MacOS X, AIX, Tru64 and probably some others. ====== alpha.bitmover.com ====== Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux alpha.bitmover.com 2.4.21-pre5 #2 Thu Mar 20 07:54:03 PST 2003 alpha unknown cpu : Alpha cpu model : EV56 cpu variation : 7 cpu revision : 0 cpu serial number : system type : EB164 system variation : PC164 system revision : 0 system serial number : cycle frequency [Hz] : 500000000 timer frequency [Hz] : 1024.00 page size [bytes] : 8192 phys. address bits : 40 max. addr. space # : 127 BogoMIPS : 992.88 kernel unaligned acc : 0 (pc=0,va=0) user unaligned acc : 0 (pc=0,va=0) platform string : Digital AlphaPC 164 500 MHz cpus detected : 1 ====== ia64.bitmover.com ====== Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux ia64.bitmover.com 2.4.9-18smp #1 SMP Tue Dec 11 12:59:00 EST 2001 ia64 unknown processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium model : 0 revision : 7 archrev : 0 features : standard cpu number : 0 cpu regs : 4 cpu MHz : 799.486992 itc MHz : 799.486992 BogoMIPS : 796.91 processor : 1 vendor : GenuineIntel arch : IA-64 family : Itanium model : 0 revision : 7 archrev : 0 features : standard cpu number : 0 cpu regs : 4 cpu MHz : 799.486992 itc MHz : 799.486992 BogoMIPS : 796.91 ====== mips.bitmover.com ====== Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) Linux mips 2.4.18-r4k-ip22 #1 Sun Jun 23 15:30:50 CEST 2002 mips unknown system type : SGI Indy processor : 0 cpu model : R4000SC V6.0 FPU V0.0 BogoMIPS : 86.83 byteorder : big endian wait instruction : no microsecond timers : yes tlb_entries : 48 extra interrupt vector : no hardware watchpoint : yes VCED exceptions : 2955114 VCEI exceptions : 0 ====== netwinder.bitmover.com ====== Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: FAIL - cache not coherent Test separation: 32768 bytes: FAIL - cache not coherent Test separation: 65536 bytes: FAIL - cache not coherent Test separation: 131072 bytes: FAIL - cache not coherent Test separation: 262144 bytes: FAIL - cache not coherent Test separation: 524288 bytes: FAIL - cache not coherent Test separation: 1048576 bytes: FAIL - cache not coherent Test separation: 2097152 bytes: FAIL - cache not coherent Test separation: 4194304 bytes: FAIL - cache not coherent Test separation: 8388608 bytes: FAIL - cache not coherent Test separation: 16777216 bytes: FAIL - cache not coherent VM page alias coherency test: failed; will use copy buffers instead Linux netwinder 2.2.12-19991020 #1 Wed Oct 20 13:09:07 EDT 1999 armv4l unknown Processor : Intel sa110 rev 3 BogoMips : 262.14 Hardware : Rebel-NetWinder Serial # : 3464 Revision : 52ff ====== parisc.bitmover.com ====== Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: FAIL - cache not coherent Test separation: 32768 bytes: FAIL - cache not coherent Test separation: 65536 bytes: FAIL - cache not coherent Test separation: 131072 bytes: FAIL - cache not coherent Test separation: 262144 bytes: FAIL - store buffer not coherent Test separation: 524288 bytes: FAIL - store buffer not coherent Test separation: 1048576 bytes: FAIL - store buffer not coherent Test separation: 2097152 bytes: FAIL - store buffer not coherent Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 4194304 (1024 pages) Linux parisc 2.4.17-64 #1 Sat Mar 16 17:31:44 MST 2002 parisc64 unknown processor : 0 cpu family : PA-RISC 2.0 cpu : PA8600 (PCX-W+) cpu MHz : 550.000000 model : 9000/800/A500-5X model name : Crescendo 550 hversion : 0x00005d50 sversion : 0x00000491 I-cache : 512 KB D-cache : 1024 KB (WB) ITLB entries : 160 DTLB entries : 160 - shared with ITLB bogomips : 1097.72 software id : 580790518 ====== ppc.bitmover.com ====== Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux ppc.bitmover.com 2.4.6-pre2 #2 Sun Jun 10 20:21:17 PDT 2001 ppc unknown processor : 0 cpu : 750 temperature : 0 C clock : 333MHz revision : 2.2 bogomips : 665.69 zero pages : total: 0 (0Kb) current: 0 (0Kb) hits: 0/0 (0%) machine : iMac,1 motherboard : iMac MacRISC Power Macintosh L2 cache : 512K unified memory : 160MB pmac-generation : NewWorld ====== qube.bitmover.com ====== Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) 0.31user 0.10system 0:00.40elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (116major+34minor)pagefaults 0swaps Linux qube.bitmover.com 2.0.34 #1 Thu Jan 28 03:03:03 PST 1999 mips unknown cpu : MIPS cpu model : Nevada V10.0 system type : Cobalt Microserver 27 BogoMIPS : 249.86 byteorder : little endian unaligned accesses : 16 wait instruction : yes microsecond timers : yes extra interrupt vector : yes hardware watchpoint : no ====== redhat71.bitmover.com ====== Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux redhat71.bitmover.com 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : Celeron (Mendocino) stepping : 5 cpu MHz : 467.739 cache size : 128 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 933.88 ====== sparc.bitmover.com ====== Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (2 pages) 0.29user 0.02system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (107major+36minor)pagefaults 0swaps Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown cpu : TI UltraSparc IIi fpu : UltraSparc IIi integrated FPU promlib : Version 3 Revision 11 prom : 3.11.12 type : sun4u ncpus probed : 1 ncpus active : 1 BogoMips : 539.03 MMU Type : Spitfire -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 15:41 ` Larry McVoy @ 2003-08-29 23:05 ` Mike Fedyk 2003-08-31 5:10 ` David S. Miller 2003-09-01 5:44 ` Jamie Lokier 1 sibling, 1 reply; 106+ messages in thread From: Mike Fedyk @ 2003-08-29 23:05 UTC (permalink / raw) To: Larry McVoy; +Cc: Jamie Lokier, linux-kernel On Fri, Aug 29, 2003 at 08:41:01AM -0700, Larry McVoy wrote: > ====== sparc.bitmover.com ====== > Test separation: 8192 bytes: FAIL - cache not coherent > VM page alias coherency test: minimum fast spacing: 16384 (2 pages) > 0.29user 0.02system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (107major+36minor)pagefaults 0swaps > Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown > cpu : TI UltraSparc IIi > fpu : UltraSparc IIi integrated FPU > promlib : Version 3 Revision 11 > prom : 3.11.12 > type : sun4u > ncpus probed : 1 > ncpus active : 1 > BogoMips : 539.03 > MMU Type : Spitfire Does this mean that userspace has to take into consideration that the isn't coherent for adjacent small memory accesses on sparc? What could happen if it doesn't, or does it need to at all? ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 23:05 ` Mike Fedyk @ 2003-08-31 5:10 ` David S. Miller 2003-08-31 22:49 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: David S. Miller @ 2003-08-31 5:10 UTC (permalink / raw) To: Mike Fedyk; +Cc: lm, jamie, linux-kernel On Fri, 29 Aug 2003 16:05:21 -0700 Mike Fedyk <mfedyk@matchmail.com> wrote: > Does this mean that userspace has to take into consideration that the isn't > coherent for adjacent small memory accesses on sparc? What could happen if > it doesn't, or does it need to at all? For shared memory, we enforce the correct mapping alignment so that coherency issues don't crop up. How does this program work? I haven't taken a close look at it. Does it use MAP_SHARED or IPC shm? ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-31 5:10 ` David S. Miller @ 2003-08-31 22:49 ` Jamie Lokier 2003-09-01 5:31 ` David S. Miller 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-08-31 22:49 UTC (permalink / raw) To: David S. Miller; +Cc: Mike Fedyk, lm, linux-kernel David S. Miller wrote: > On Fri, 29 Aug 2003 16:05:21 -0700 > Mike Fedyk <mfedyk@matchmail.com> wrote: > > > Does this mean that userspace has to take into consideration that the isn't > > coherent for adjacent small memory accesses on sparc? What could happen if > > it doesn't, or does it need to at all? > > For shared memory, we enforce the correct mapping alignment > so that coherency issues don't crop up. > > How does this program work? I haven't taken a close look > at it. Does it use MAP_SHARED or IPC shm? It uses POSIX shared memory and (necessarily) MAP_SHARED, which doesn't constrain the mapping alignment. I had wondered if some kernels used page faults to maintain coherence between multiple shared mappings of the same file. It's one of the things the program checks, and I have seen it mentioned on l-k, which made me think it might be implemented. None of the results for any architecture show it, though. If userspace does create multiple shared mappings at non-coherent offsets, what is the recommended method for switching between accessing one page (or page cluster?) and accessing the other. Is it msync(), a special system call to flush parts of the data cache, a machine instruction, or something else? Thanks, -- Jamie ps. The program has code to try IPC shm instead. Change "#ifdef SHM_DIR_PREFIX" in __page_alias_map to "#if 0", and add -DHAVE_SYSV_SHM to the GCC command line. It should fail the same test sizes with a different message. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-31 22:49 ` Jamie Lokier @ 2003-09-01 5:31 ` David S. Miller 2003-09-01 6:42 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: David S. Miller @ 2003-09-01 5:31 UTC (permalink / raw) To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel On Sun, 31 Aug 2003 23:49:37 +0100 Jamie Lokier <jamie@shareable.org> wrote: > It uses POSIX shared memory and (necessarily) MAP_SHARED, which > doesn't constrain the mapping alignment. That's wrong. If a platform needs to, it should properly align the mapping when MAP_SHARED is used on a file. If you look in arch/sparc64/kernel/sys_sparc.c, you'll see that when we're mmap()'ing a file and MAP_SHARED is specified, we align things to SHMLBA. If userspace purposefully violates this alignment attempt, then it's at it's own peril to keep the mappings coherent, there is simply nothing the kernel should be doing to help out that case. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 5:31 ` David S. Miller @ 2003-09-01 6:42 ` Jamie Lokier 2003-09-01 7:06 ` David S. Miller 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 6:42 UTC (permalink / raw) To: David S. Miller; +Cc: mfedyk, lm, linux-kernel David S. Miller wrote: > On Sun, 31 Aug 2003 23:49:37 +0100 > Jamie Lokier <jamie@shareable.org> wrote: > > > It uses POSIX shared memory and (necessarily) MAP_SHARED, which > > doesn't constrain the mapping alignment. > > That's wrong. If a platform needs to, it should properly > align the mapping when MAP_SHARED is used on a file. > > If you look in arch/sparc64/kernel/sys_sparc.c, you'll see > that when we're mmap()'ing a file and MAP_SHARED is specified, > we align things to SHMLBA. Then you have a bug in the Sparc code. It looks like it should return -EINVAL when a misaligned mapping is used with MAP_FIXED|MAP_SHARED, but the test program is clearly getting mappings that aren't aligned to SHMLBA. > If userspace purposefully violates this alignment attempt, > then it's at it's own peril to keep the mappings coherent, > there is simply nothing the kernel should be doing to help > out that case. I understand that userspace needs to keep it coherent, or map to a multiple of SHMLBA. I don't mind whether the kernel constrains the mapping address or not, with a slight preference for userspace flexibility. Thus I have three Sparc-specific questions: 1. How does userspace find out the value of SHMLBA? On Sparc, it is not a compile-time constant. 2. Is flushing part of the data cache something I can do from userspace? (I'll figure out the exact machine instructions myself if I need to do this, but it'd be nice to know if it's possible before I have a go). 3. Is there a kernel bug on Sparc, because the test program is either getting mappings that aren't aligned to run time SHMLBA, or the kernel's run time SHMLBA value is not correct. Thanks, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 6:42 ` Jamie Lokier @ 2003-09-01 7:06 ` David S. Miller 2003-09-01 8:29 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: David S. Miller @ 2003-09-01 7:06 UTC (permalink / raw) To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel On Mon, 1 Sep 2003 07:42:31 +0100 Jamie Lokier <jamie@shareable.org> wrote: > David S. Miller wrote: > > On Sun, 31 Aug 2003 23:49:37 +0100 > > Jamie Lokier <jamie@shareable.org> wrote: > > > > > It uses POSIX shared memory and (necessarily) MAP_SHARED, which > > > doesn't constrain the mapping alignment. > > > > That's wrong. If a platform needs to, it should properly > > align the mapping when MAP_SHARED is used on a file. > > > > If you look in arch/sparc64/kernel/sys_sparc.c, you'll see > > that when we're mmap()'ing a file and MAP_SHARED is specified, > > we align things to SHMLBA. > > Then you have a bug in the Sparc code. It looks like it should return > -EINVAL when a misaligned mapping is used with MAP_FIXED|MAP_SHARED, > but the test program is clearly getting mappings that aren't aligned > to SHMLBA. I disagree, MAP_FIXED means "I know what I am doing don't override this unless the mapping area is not available in my address space." You should never specify MAP_FIXED unless you _REALLY_ know what you are doing. > Thus I have three Sparc-specific questions: > > 1. How does userspace find out the value of SHMLBA? > On Sparc, it is not a compile-time constant. Don't specify MAP_FIXED for MAP_SHARED mapping if you want proper coherency, that's my answer for this one. > 2. Is flushing part of the data cache something I can do from > userspace? (I'll figure out the exact machine instructions > myself if I need to do this, but it'd be nice to know if > it's possible before I have a go). There is no efficient way to do this from userspace, only the kernel has access to the more efficient cache flushing instructions. You'd need to flush via loads to displace the aliasing cache lines. > 3. Is there a kernel bug on Sparc, because the test program > is either getting mappings that aren't aligned to run time > SHMLBA, or the kernel's run time SHMLBA value is not correct. No, the user is allowed to hang himself with MAP_FIXED. The bug is in your code :) ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 7:06 ` David S. Miller @ 2003-09-01 8:29 ` Jamie Lokier 2003-09-01 9:02 ` David S. Miller 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 8:29 UTC (permalink / raw) To: David S. Miller; +Cc: mfedyk, lm, linux-kernel David S. Miller wrote: > I disagree, MAP_FIXED means "I know what I am doing don't override > this unless the mapping area is not available in my address space." > You should never specify MAP_FIXED unless you _REALLY_ know what you > are doing. So explain this from the Sparc architecture code: if (flags & MAP_FIXED) { /* We do not accept a shared mapping if it would violate * cache aliasing constraints. */ if ((flags & MAP_SHARED) && (addr & (SHMLBA - 1))) return -EINVAL; return addr; } Ok, I'll explain it :) At one time, the code did what the comment says, but nowadays linux/mm/mmap.c doesn't call arch_get_unmapped_area() for MAP_FIXED, so the above code is redundant and misleading. It already mislead me, so please remove it. sparc and sparc64 both have it. > > Thus I have three Sparc-specific questions: > > > > 1. How does userspace find out the value of SHMLBA? > > On Sparc, it is not a compile-time constant. > > Don't specify MAP_FIXED for MAP_SHARED mapping if you want > proper coherency, that's my answer for this one. I can't safely set up this kind of mapping without MAP_FIXED, unless I know SHMLBA. This is my strategy: mmap MAP_ANON without MAP_FIXED to find a free area mmap MAP_FIXED over the anon area at same address mmap MAP_FIXED over the anon area at larger address I don't see any strategy that lets me establish this kind of circular mapping on Sparc without either (a) knowing the value of SHMLBA, or (b) risking clobbering another thread's mmap. > > 3. Is there a kernel bug on Sparc, because the test program > > is either getting mappings that aren't aligned to run time > > SHMLBA, or the kernel's run time SHMLBA value is not correct. > > No, the user is allowed to hang himself with MAP_FIXED. > The bug is in your code :) Well, my code has no bug because I do run-time tests to see what rubbish the architecture gave me. As we see, they work :) I don't see any real alternative to doing that. But that's ok, it seems robust and portable. It's a shame about the slow cache flush, because I can sometimes use fast cache flushing to improve my DSP buffering algorithms. > > 2. Is flushing part of the data cache something I can do from > > userspace? (I'll figure out the exact machine instructions > > myself if I need to do this, but it'd be nice to know if > > it's possible before I have a go). > > There is no efficient way to do this from userspace, only the > kernel has access to the more efficient cache flushing instructions. > You'd need to flush via loads to displace the aliasing cache lines. Will msync() do it? Thanks, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 8:29 ` Jamie Lokier @ 2003-09-01 9:02 ` David S. Miller 2003-09-01 10:04 ` Jamie Lokier 2003-09-03 17:36 ` bill davidsen 0 siblings, 2 replies; 106+ messages in thread From: David S. Miller @ 2003-09-01 9:02 UTC (permalink / raw) To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel On Mon, 1 Sep 2003 09:29:11 +0100 Jamie Lokier <jamie@shareable.org> wrote: > David S. Miller wrote: > > I disagree, MAP_FIXED means "I know what I am doing don't override > > this unless the mapping area is not available in my address space." > > You should never specify MAP_FIXED unless you _REALLY_ know what you > > are doing. > > So explain this from the Sparc architecture code: > > if (flags & MAP_FIXED) { > /* We do not accept a shared mapping if it would violate > * cache aliasing constraints. > */ > if ((flags & MAP_SHARED) && (addr & (SHMLBA - 1))) > return -EINVAL; > return addr; > } > > Ok, I'll explain it :) At one time, the code did what the comment says, > but nowadays linux/mm/mmap.c doesn't call arch_get_unmapped_area() for > MAP_FIXED, so the above code is redundant and misleading. It already > mislead me, so please remove it. sparc and sparc64 both have it. I take back what I said, I think the -EINVAL behavior is better and mmap.c should call into this code to verify the requested mmap() parameters. > This is my strategy: > > mmap MAP_ANON without MAP_FIXED to find a free area > mmap MAP_FIXED over the anon area at same address > mmap MAP_FIXED over the anon area at larger address > > I don't see any strategy that lets me establish this kind of circular > mapping on Sparc without either (a) knowing the value of SHMLBA, or > (b) risking clobbering another thread's mmap. Why do you need the same piece of data mapped to multiple places in the first place, and why at specific addresses? It's purely an optimization of some sort, right? > Well, my code has no bug because I do run-time tests to see what > rubbish the architecture gave me. As we see, they work :) It doesn't work in just the right set of circumstances, if interrupts arrive at just the right moment it might flush the bad aliases out of the cache via displacement during your 'check' phase. Then during your actual computation you can hit the aliasing problem silently. That's just a bad way to do this. > I don't see any real alternative to doing that. I'd suggest instead to hardcode the SHMLBA stuff into your sources. > But that's ok, it seems robust and portable. Unfortunately, it is anything but robust. > > There is no efficient way to do this from userspace, only the > > kernel has access to the more efficient cache flushing instructions. > > You'd need to flush via loads to displace the aliasing cache lines. > > Will msync() do it? No. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 9:02 ` David S. Miller @ 2003-09-01 10:04 ` Jamie Lokier 2003-09-01 10:02 ` David S. Miller 2003-09-03 17:36 ` bill davidsen 1 sibling, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 10:04 UTC (permalink / raw) To: David S. Miller; +Cc: mfedyk, lm, linux-kernel David S. Miller wrote: > Why do you need the same piece of data mapped to multiple places > in the first place, and why at specific addresses? It's purely an > optimization of some sort, right? Right. It's a circular buffer for signal processing: DSP code sees contiguous ascending addresses. The multiple maps mean we don't have to copy the contents of the buffer back to the start periodically, nor mask the offset into the array on each memory access, nor write extra-complicated DSP code which can handle split regions. It's an optimisation, it works well on some architectures and on others it's not worth it. On those, I just copy - it keeps the DSP code fast and simple. > > Well, my code has no bug because I do run-time tests to see what > > rubbish the architecture gave me. As we see, they work :) > > It doesn't work in just the right set of circumstances, if interrupts > arrive at just the right moment it might flush the bad aliases out > of the cache via displacement during your 'check' phase. > > Then during your actual computation you can hit the aliasing problem > silently. To fool the coherence test, interrupts would need to arrive in a 2 instruction window, at least 8192 times. It is possible, but unlikely except in pathological situations. Of course if you make mmap() return EINVAL then it cannot possible fail :) > I'd suggest instead to hardcode the SHMLBA stuff into your sources. How? SHMLBA is a run time value on the Sparc; I have no idea how to work it out. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:04 ` Jamie Lokier @ 2003-09-01 10:02 ` David S. Miller 0 siblings, 0 replies; 106+ messages in thread From: David S. Miller @ 2003-09-01 10:02 UTC (permalink / raw) To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel On Mon, 1 Sep 2003 11:04:58 +0100 Jamie Lokier <jamie@shareable.org> wrote: > Of course if you make mmap() return EINVAL then it cannot possible fail :) Right :-) > > I'd suggest instead to hardcode the SHMLBA stuff into your sources. > > How? SHMLBA is a run time value on the Sparc; I have no idea how > to work it out. You're talking about 32-bit sparc, on sparc64 it's a constant 16K. For sparc 32-bit, just use 4MB, that's the largest possible value. And you have to check this with uname() results, not with ifdefs as 32-bit Sparc binaries run on sparc64 systems just fine. I also would not object at all to a kernel patch that exported the SHMLBA value via some sysctl value. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 9:02 ` David S. Miller 2003-09-01 10:04 ` Jamie Lokier @ 2003-09-03 17:36 ` bill davidsen 2003-09-04 22:50 ` Jamie Lokier 1 sibling, 1 reply; 106+ messages in thread From: bill davidsen @ 2003-09-03 17:36 UTC (permalink / raw) To: linux-kernel In article <20030901020203.1779efe8.davem@redhat.com>, David S. Miller <davem@redhat.com> wrote: | > This is my strategy: | > | > mmap MAP_ANON without MAP_FIXED to find a free area | > mmap MAP_FIXED over the anon area at same address | > mmap MAP_FIXED over the anon area at larger address | > | > I don't see any strategy that lets me establish this kind of circular | > mapping on Sparc without either (a) knowing the value of SHMLBA, or | > (b) risking clobbering another thread's mmap. | | Why do you need the same piece of data mapped to multiple places | in the first place, and why at specific addresses? It's purely an | optimization of some sort, right? I think he said he was doing DSP... there's a trick of double mapping the same memory to save one subscript calculation in FFT (or maybe DFT) inner loop. The only reason I know this is that a friend did a master's thesis on DSP about 20 years ago, and I absorbed some info I hope to never need. He also coded an FFT instruction in the LCS (programmable firmware) of a VAX. I am only speculating, of course. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 17:36 ` bill davidsen @ 2003-09-04 22:50 ` Jamie Lokier 0 siblings, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-04 22:50 UTC (permalink / raw) To: bill davidsen; +Cc: linux-kernel bill davidsen wrote: > | Why do you need the same piece of data mapped to multiple places > | in the first place, and why at specific addresses? It's purely an > | optimization of some sort, right? > > I think he said he was doing DSP... there's a trick of double mapping > the same memory to save one subscript calculation in FFT (or maybe DFT) > inner loop. It is for DSP, but nothing to do with FFT. I hadn't ever thought of using this techinque for FFT, and it would probably make little difference on a modern CPU given the form of FFT algorithms. No, I use it to make a circular buffer, in which the data always appears as a contiguous block - no split. This is useful for operations on streams of data, such as FIR & IIR filters, equalisers, upconverters, downcoverters, etc. Many DSP algorithms fall into this category. A characteristic of these algorithms is that they consist of a long, tight sequence of streaming memory accesses with calculations at each step. DSP chips often implement circular buffers by masking the offset into the buffer's memory. On a CPU, I prefer to avoid the masking operation which happens for each address calculation. This saves a couple of registers, as I can just use an incrementing pointer into the buffer, rather than a base address, offset and mask value. Especially on x86, a couple of registers saved is good. It's possible to write DSP algorithms which avoid address masking, after all a circular buffer in an ordinary array is just two separate regions. But that complicates the algorithms especially with corner cases, and some of them are complicated enough already. Using the duplicate mappings, I can use the most straightforward streaming DSP code, and it runs as fast as possible if the mappings don't incur a penalty. When mappings aren't available or are too slow, then I just copy the contents of the buffer backwards whenever the write pointer will cross the end of the array. That costs some, but keeps the DSP code simple. Fwiw, the test program asseses whether there's a cost to using duplicate mappings and whether they work. However, for the above kind of DSP buffer, the measurement isn't the best it could be (although it's what I'm using). There's a balance of factors. For a large buffer, it's ok even if page faults were to be needed as we switch between alias pages, because the access pattern doesn't do that very often. Then the occasional page faults are just a potentially faster version of the copy backwards. On the other hand, if aliased pages are made coherent by making then uncacheable (such as the ARM port), even though that's much faster than faulting, it isn't good for the DSP algorithms. Fwiw#2, in the DSP I'm working on it's better to use the copying method for most of my buffers even on x86, because they aren't that large and fit better into L1 cache without the mappings. Maybe for a different project, it will get used for more of the buffers. Mainly, having developed the testing code, I wanted to know if it worked properly on the different architectures. It's nice to see some spin offs, such as finding the ARM write buffer bug. So thanks to everyone who responded. I'll post a table of the results soon. Thanks, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 15:41 ` Larry McVoy 2003-08-29 23:05 ` Mike Fedyk @ 2003-09-01 5:44 ` Jamie Lokier 2003-09-01 14:43 ` Larry McVoy 1 sibling, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 5:44 UTC (permalink / raw) To: Larry McVoy, linux-kernel Larry McVoy wrote: > On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote: > > I'd appreciate if folks would run the program below on various > > machines, especially those whose caches aren't automatically coherent > > at the hardware level. > > Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC Thanks Larry. That's a great range you have! Collected and will be posted shortly in a table with the others. > If you care, I also have freebsd (v2, v3, v4), netbsd 1.5, openbsd 3.0 (all > bsd systems are x86, mostly celerons), hpux 10.20, sco, solaris, solaris/x86, > Irix, MacOS X, AIX, Tru64 and probably some others. AIX would be interesting; I don't have an RS6000. The rest of the CPUs I have results for, and it sounds like a lot of effort for what's basically a compile/compatibility test. However, if it's very little effort for you to run the test on them please do! Thanks, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 5:44 ` Jamie Lokier @ 2003-09-01 14:43 ` Larry McVoy 2003-09-01 16:33 ` Jamie Lokier 2003-09-02 20:29 ` Jamie Lokier 0 siblings, 2 replies; 106+ messages in thread From: Larry McVoy @ 2003-09-01 14:43 UTC (permalink / raw) To: Jamie Lokier; +Cc: Larry McVoy, linux-kernel Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC, s390 on Linux and hpux/parisc, {freebsd, netbsd, openbsd}/x86, sco/x86, solaris/sparc, solaris/x86, irix/mips, osx/ppc, aix/ppc, tru64/alpha. This is most of our test machines, it doesn't include all the Windows boxes but I figured you didn't care. The version of test.c is the one you posted later. If I got it wrong send me the latest. work ~/jamie wc test.c 773 3726 25064 test.c work ~/jamie md5sum test.c 1e7b9e6fa525c21211abbb8986d7b2e7 test.c I'm a little concerned I have the wrong test, why would a 2.1Ghz Athlon say it is too slow? Format: ==== host name ==== Notes (may be blank) Results uname -a output /proc/cpuinfo (if there) ==== aix ==== 332Mhz 604e 7043-150 Test separation: 4096 bytes: FAIL - alias map failed Test separation: 8192 bytes: FAIL - alias map failed Test separation: 16384 bytes: FAIL - alias map failed Test separation: 32768 bytes: FAIL - alias map failed Test separation: 65536 bytes: FAIL - alias map failed Test separation: 131072 bytes: FAIL - alias map failed Test separation: 262144 bytes: FAIL - alias map failed Test separation: 524288 bytes: FAIL - alias map failed Test separation: 1048576 bytes: FAIL - alias map failed Test separation: 2097152 bytes: FAIL - alias map failed Test separation: 4194304 bytes: FAIL - alias map failed Test separation: 8388608 bytes: FAIL - alias map failed Test separation: 16777216 bytes: FAIL - alias map failed VM page alias coherency test: failed; will use copy buffers instead AIX aix 1 4 004376804C00 ==== alpha ==== PC something-164, that really common cheapo motherboard/test kit. (512) [14,14,0] Test separation: 8192 bytes: pass (512) [14,14,0] Test separation: 16384 bytes: pass (512) [14,14,0] Test separation: 32768 bytes: pass (512) [14,14,0] Test separation: 65536 bytes: pass (512) [14,14,0] Test separation: 131072 bytes: pass (512) [14,14,0] Test separation: 262144 bytes: pass (512) [14,14,0] Test separation: 524288 bytes: pass (512) [14,14,0] Test separation: 1048576 bytes: pass (512) [14,14,0] Test separation: 2097152 bytes: pass (512) [14,14,0] Test separation: 4194304 bytes: pass (512) [14,14,0] Test separation: 8388608 bytes: pass (512) [14,14,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux alpha.bitmover.com 2.4.21-pre5 #2 Thu Mar 20 07:54:03 PST 2003 alpha unknown cpu : Alpha cpu model : EV56 cpu variation : 7 cpu revision : 0 cpu serial number : system type : EB164 system variation : PC164 system revision : 0 system serial number : cycle frequency [Hz] : 500000000 timer frequency [Hz] : 1024.00 page size [bytes] : 8192 phys. address bits : 40 max. addr. space # : 127 BogoMIPS : 992.88 kernel unaligned acc : 0 (pc=0,va=0) user unaligned acc : 0 (pc=0,va=0) platform string : Digital AlphaPC 164 500 MHz cpus detected : 1 ==== disks ==== (128) [17,1,0] Test separation: 4096 bytes: FAIL - too slow (128) [17,1,0] Test separation: 8192 bytes: FAIL - too slow (128) [17,1,0] Test separation: 16384 bytes: FAIL - too slow (1024) [10,13,0] Test separation: 32768 bytes: pass (1024) [10,13,0] Test separation: 65536 bytes: pass (1024) [10,13,0] Test separation: 131072 bytes: pass (1024) [10,13,0] Test separation: 262144 bytes: pass (1024) [10,13,0] Test separation: 524288 bytes: pass (1024) [10,13,0] Test separation: 1048576 bytes: pass (1024) [10,13,0] Test separation: 2097152 bytes: pass (1024) [10,13,0] Test separation: 4194304 bytes: pass (1024) [10,13,0] Test separation: 8388608 bytes: pass (1024) [10,13,0] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) Linux disks.bitmover.com 2.4.18-14 #1 Wed Sep 4 12:13:11 EDT 2002 i686 athlon i386 GNU/Linux processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) XP 1900+ stepping : 2 cpu MHz : 1593.143 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3172.64 ==== freebsd ==== (512) [32,32,1] Test separation: 4096 bytes: pass (512) [32,32,1] Test separation: 8192 bytes: pass (512) [32,32,1] Test separation: 16384 bytes: pass (512) [32,32,1] Test separation: 32768 bytes: pass (512) [32,32,1] Test separation: 65536 bytes: pass (512) [32,32,1] Test separation: 131072 bytes: pass (512) [32,32,1] Test separation: 262144 bytes: pass (512) [32,32,1] Test separation: 524288 bytes: pass (512) [32,32,1] Test separation: 1048576 bytes: pass (512) [32,32,1] Test separation: 2097152 bytes: pass (512) [32,32,1] Test separation: 4194304 bytes: pass (512) [32,32,1] Test separation: 8388608 bytes: pass (512) [32,32,1] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed FreeBSD freebsd.bitmover.com 2.2.8-RELEASE FreeBSD 2.2.8-RELEASE #0: Mon Nov 30 06:34:08 GMT 1998 jkh@time.cdrom.com:/usr/src/sys/compile/GENERIC i386 ==== freebsd3 ==== (64) [33,3,1] Test separation: 4096 bytes: FAIL - too slow (64) [33,3,1] Test separation: 8192 bytes: FAIL - too slow (512) [19,26,1] Test separation: 16384 bytes: pass (512) [19,26,1] Test separation: 32768 bytes: pass (512) [19,26,1] Test separation: 65536 bytes: pass (512) [19,26,1] Test separation: 131072 bytes: pass (512) [19,26,1] Test separation: 262144 bytes: pass (512) [19,26,1] Test separation: 524288 bytes: pass (512) [19,26,1] Test separation: 1048576 bytes: pass (512) [19,26,1] Test separation: 2097152 bytes: pass (512) [19,26,1] Test separation: 4194304 bytes: pass (512) [19,26,1] Test separation: 8388608 bytes: pass (512) [19,26,1] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) FreeBSD freebsd3.bitmover.com 3.2-RELEASE FreeBSD 3.2-RELEASE #0: Fri Jun 2 11:34:52 PDT 2000 root@freebsd3.bitmover.com:/usr/src/sys/compile/DAVICOM i386 ==== freebsd4 ==== (256) [92,26,5] Test separation: 4096 bytes: FAIL - too slow (256) [92,26,5] Test separation: 8192 bytes: FAIL - too slow (1024) [75,101,5] Test separation: 16384 bytes: pass (1024) [75,101,5] Test separation: 32768 bytes: pass (1024) [75,101,5] Test separation: 65536 bytes: pass (1024) [75,101,5] Test separation: 131072 bytes: pass (1024) [75,101,5] Test separation: 262144 bytes: pass (1024) [75,101,5] Test separation: 524288 bytes: pass (1024) [75,101,5] Test separation: 1048576 bytes: pass (1024) [75,101,5] Test separation: 2097152 bytes: pass (1024) [75,101,5] Test separation: 4194304 bytes: pass (1024) [75,101,5] Test separation: 8388608 bytes: pass (1024) [75,101,5] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) FreeBSD freebsd4.bitmover.com 4.1-RELEASE FreeBSD 4.1-RELEASE #0: Fri Jul 28 14:30:31 GMT 2000 jkh@ref4.freebsd.org:/usr/src/sys/compile/GENERIC i386 ==== hp ==== C360, HPUX 10.20 Test separation: 4096 bytes: FAIL - alias map failed Test separation: 8192 bytes: FAIL - alias map failed Test separation: 16384 bytes: FAIL - alias map failed Test separation: 32768 bytes: FAIL - alias map failed Test separation: 65536 bytes: FAIL - alias map failed Test separation: 131072 bytes: FAIL - alias map failed Test separation: 262144 bytes: FAIL - alias map failed Test separation: 524288 bytes: FAIL - alias map failed Test separation: 1048576 bytes: FAIL - alias map failed Test separation: 2097152 bytes: FAIL - alias map failed Test separation: 4194304 bytes: FAIL - alias map failed Test separation: 8388608 bytes: FAIL - alias map failed Test separation: 16777216 bytes: FAIL - alias map failed VM page alias coherency test: failed; will use copy buffers instead HP-UX hp B.10.20 A 9000/785 2004452144 two-user license ==== ia64 ==== (512) [17,17,0] Test separation: 16384 bytes: pass (512) [17,17,0] Test separation: 32768 bytes: pass (512) [17,17,0] Test separation: 65536 bytes: pass (512) [17,17,0] Test separation: 131072 bytes: pass (512) [17,17,0] Test separation: 262144 bytes: pass (512) [17,17,0] Test separation: 524288 bytes: pass (512) [17,17,0] Test separation: 1048576 bytes: pass (512) [17,17,0] Test separation: 2097152 bytes: pass (512) [17,17,0] Test separation: 4194304 bytes: pass (512) [17,17,0] Test separation: 8388608 bytes: pass (512) [17,17,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux ia64.bitmover.com 2.4.9-18smp #1 SMP Tue Dec 11 12:59:00 EST 2001 ia64 unknown processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium model : 0 revision : 7 archrev : 0 features : standard cpu number : 0 cpu regs : 4 cpu MHz : 799.486992 itc MHz : 799.486992 BogoMIPS : 796.91 processor : 1 vendor : GenuineIntel arch : IA-64 family : Itanium model : 0 revision : 7 archrev : 0 features : standard cpu number : 0 cpu regs : 4 cpu MHz : 799.486992 itc MHz : 799.486992 BogoMIPS : 796.91 ==== macos ==== Imac, OS X 10.2 (2048) [67,67,3] Test separation: 4096 bytes: pass (2048) [67,67,3] Test separation: 8192 bytes: pass (2048) [67,67,3] Test separation: 16384 bytes: pass (2048) [67,67,3] Test separation: 32768 bytes: pass (2048) [67,67,3] Test separation: 65536 bytes: pass (2048) [67,67,3] Test separation: 131072 bytes: pass (2048) [67,67,3] Test separation: 262144 bytes: pass (2048) [67,67,3] Test separation: 524288 bytes: pass (2048) [67,67,3] Test separation: 1048576 bytes: pass (2048) [67,67,3] Test separation: 2097152 bytes: pass (2048) [67,67,3] Test separation: 4194304 bytes: pass (2048) [67,67,3] Test separation: 8388608 bytes: pass (2048) [67,67,3] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Darwin macos.bitmover.com 6.6 Darwin Kernel Version 6.6: Thu May 1 21:48:54 PDT 2003; root:xnu/xnu-344.34.obj~1/RELEASE_PPC Power Macintosh powerpc ==== mips ==== (64) [276,11,2] Test separation: 4096 bytes: FAIL - too slow (64) [276,11,2] Test separation: 8192 bytes: FAIL - too slow (128) [26,43,2] Test separation: 16384 bytes: pass (128) [26,43,2] Test separation: 32768 bytes: pass (128) [26,43,2] Test separation: 65536 bytes: pass (128) [26,43,2] Test separation: 131072 bytes: pass (128) [26,43,2] Test separation: 262144 bytes: pass (128) [26,43,2] Test separation: 524288 bytes: pass (128) [26,43,2] Test separation: 1048576 bytes: pass (128) [26,43,2] Test separation: 2097152 bytes: pass (128) [26,43,2] Test separation: 4194304 bytes: pass (128) [26,43,2] Test separation: 8388608 bytes: pass (128) [26,43,2] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) Linux mips 2.4.18-r4k-ip22 #1 Sun Jun 23 15:30:50 CEST 2002 mips unknown system type : SGI Indy processor : 0 cpu model : R4000SC V6.0 FPU V0.0 BogoMIPS : 86.83 byteorder : big endian wait instruction : no microsecond timers : yes tlb_entries : 48 extra interrupt vector : no hardware watchpoint : yes VCED exceptions : 8055726 VCEI exceptions : 0 ==== netbsd ==== (1024) [53,53,4] Test separation: 4096 bytes: pass (2048) [106,106,4] Test separation: 8192 bytes: pass (2048) [104,105,5] Test separation: 16384 bytes: pass (2048) [105,104,5] Test separation: 32768 bytes: pass (2048) [105,104,5] Test separation: 65536 bytes: pass (2048) [104,104,5] Test separation: 131072 bytes: pass (2048) [105,105,5] Test separation: 262144 bytes: pass (2048) [105,105,5] Test separation: 524288 bytes: pass (1024) [53,53,4] Test separation: 1048576 bytes: pass (2048) [104,104,5] Test separation: 2097152 bytes: pass (2048) [106,106,4] Test separation: 4194304 bytes: pass (2048) [105,106,4] Test separation: 8388608 bytes: pass (2048) [104,105,5] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed NetBSD netbsd.bitmover.com 1.5 NetBSD 1.5 (GENERIC) #1: Sun Nov 19 21:42:11 MET 2000 fvdl@sushi:/work/trees/netbsd-1-5/sys/arch/i386/compile/GENERIC i386 ==== netwinder ==== Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: FAIL - cache not coherent Test separation: 32768 bytes: FAIL - cache not coherent Test separation: 65536 bytes: FAIL - cache not coherent Test separation: 131072 bytes: FAIL - cache not coherent Test separation: 262144 bytes: FAIL - cache not coherent Test separation: 524288 bytes: FAIL - cache not coherent Test separation: 1048576 bytes: FAIL - cache not coherent Test separation: 2097152 bytes: FAIL - cache not coherent Test separation: 4194304 bytes: FAIL - cache not coherent Test separation: 8388608 bytes: FAIL - cache not coherent Test separation: 16777216 bytes: FAIL - cache not coherent VM page alias coherency test: failed; will use copy buffers instead Linux netwinder 2.2.12-19991020 #1 Wed Oct 20 13:09:07 EDT 1999 armv4l unknown Processor : Intel sa110 rev 3 BogoMips : 262.14 Hardware : Rebel-NetWinder Serial # : 3464 Revision : 52ff ==== openbsd ==== (512) [27,27,1] Test separation: 4096 bytes: pass (512) [27,27,1] Test separation: 8192 bytes: pass (512) [27,27,1] Test separation: 16384 bytes: pass (512) [27,27,1] Test separation: 32768 bytes: pass (512) [27,27,1] Test separation: 65536 bytes: pass (512) [27,27,1] Test separation: 131072 bytes: pass (512) [27,27,1] Test separation: 262144 bytes: pass (512) [27,27,1] Test separation: 524288 bytes: pass (512) [27,27,1] Test separation: 1048576 bytes: pass (512) [27,27,1] Test separation: 2097152 bytes: pass (512) [27,27,1] Test separation: 4194304 bytes: pass (512) [27,27,1] Test separation: 8388608 bytes: pass (512) [27,27,1] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed OpenBSD openbsd 3.0 GENERIC#94 i386 ==== parisc ==== A500 Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: FAIL - cache not coherent Test separation: 32768 bytes: FAIL - cache not coherent Test separation: 65536 bytes: FAIL - cache not coherent Test separation: 131072 bytes: FAIL - cache not coherent Test separation: 262144 bytes: FAIL - store buffer not coherent Test separation: 524288 bytes: FAIL - store buffer not coherent Test separation: 1048576 bytes: FAIL - store buffer not coherent Test separation: 2097152 bytes: FAIL - store buffer not coherent (2048) [41,41,2] Test separation: 4194304 bytes: pass (2048) [41,41,2] Test separation: 8388608 bytes: pass (2048) [41,41,2] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 4194304 (1024 pages) Linux parisc 2.4.17-64 #1 Sat Mar 16 17:31:44 MST 2002 parisc64 unknown processor : 0 cpu family : PA-RISC 2.0 cpu : PA8600 (PCX-W+) cpu MHz : 550.000000 model : 9000/800/A500-5X model name : Crescendo 550 hversion : 0x00005d50 sversion : 0x00000491 I-cache : 512 KB D-cache : 1024 KB (WB) ITLB entries : 160 DTLB entries : 160 - shared with ITLB bogomips : 1097.72 software id : 580790518 ==== ppc ==== (1024) [40,40,1] Test separation: 4096 bytes: pass (1024) [40,40,1] Test separation: 8192 bytes: pass (1024) [40,40,1] Test separation: 16384 bytes: pass (1024) [40,40,1] Test separation: 32768 bytes: pass (1024) [40,40,1] Test separation: 65536 bytes: pass (1024) [40,40,1] Test separation: 131072 bytes: pass (1024) [40,40,1] Test separation: 262144 bytes: pass (1024) [40,40,1] Test separation: 524288 bytes: pass (1024) [40,40,1] Test separation: 1048576 bytes: pass (1024) [40,40,1] Test separation: 2097152 bytes: pass (1024) [40,40,1] Test separation: 4194304 bytes: pass (1024) [40,40,1] Test separation: 8388608 bytes: pass (1024) [40,40,1] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux ppc.bitmover.com 2.4.6-pre2 #2 Sun Jun 10 20:21:17 PDT 2001 ppc unknown processor : 0 cpu : 750 temperature : 0 C clock : 333MHz revision : 2.2 bogomips : 665.69 zero pages : total: 0 (0Kb) current: 0 (0Kb) hits: 0/0 (0%) machine : iMac,1 motherboard : iMac MacRISC Power Macintosh L2 cache : 512K unified memory : 160MB pmac-generation : NewWorld ==== qube ==== Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent (512) [47,47,2] Test separation: 16384 bytes: pass (512) [47,47,2] Test separation: 32768 bytes: pass (512) [47,47,2] Test separation: 65536 bytes: pass (512) [47,47,2] Test separation: 131072 bytes: pass (512) [47,47,2] Test separation: 262144 bytes: pass (512) [47,47,2] Test separation: 524288 bytes: pass (512) [47,47,2] Test separation: 1048576 bytes: pass (512) [47,47,2] Test separation: 2097152 bytes: pass (512) [47,47,2] Test separation: 4194304 bytes: pass (512) [47,47,2] Test separation: 8388608 bytes: pass (512) [47,47,2] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) Linux qube.bitmover.com 2.0.34 #1 Thu Jan 28 03:03:03 PST 1999 mips unknown cpu : MIPS cpu model : Nevada V10.0 system type : Cobalt Microserver 27 BogoMIPS : 249.86 byteorder : little endian unaligned accesses : 16 wait instruction : yes microsecond timers : yes extra interrupt vector : yes hardware watchpoint : no ==== redhat52 ==== (256) [12,12,0] Test separation: 4096 bytes: pass (256) [12,12,0] Test separation: 8192 bytes: pass (256) [12,12,0] Test separation: 16384 bytes: pass (256) [12,12,0] Test separation: 32768 bytes: pass (256) [12,12,0] Test separation: 65536 bytes: pass (256) [12,12,0] Test separation: 131072 bytes: pass (256) [12,12,0] Test separation: 262144 bytes: pass (256) [12,12,0] Test separation: 524288 bytes: pass (256) [12,12,0] Test separation: 1048576 bytes: pass (256) [12,12,0] Test separation: 2097152 bytes: pass (256) [12,12,0] Test separation: 4194304 bytes: pass (256) [12,12,0] Test separation: 8388608 bytes: pass (256) [12,12,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux redhat52.bitmover.com 2.2.15pre9 #10 Sat Apr 8 17:59:35 PDT 2000 i686 unknown processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : Celeron (Mendocino) stepping : 5 cpu MHz : 534.561273 cache size : 128 KB fdiv_bug : no hlt_bug : no sep_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 532.48 ==== redhat62 ==== (256) [12,12,0] Test separation: 4096 bytes: pass (256) [12,12,0] Test separation: 8192 bytes: pass (256) [12,12,0] Test separation: 16384 bytes: pass (256) [12,12,0] Test separation: 32768 bytes: pass (256) [12,12,0] Test separation: 65536 bytes: pass (256) [12,12,0] Test separation: 131072 bytes: pass (256) [12,12,0] Test separation: 262144 bytes: pass (256) [12,12,0] Test separation: 524288 bytes: pass (256) [12,12,0] Test separation: 1048576 bytes: pass (256) [12,12,0] Test separation: 2097152 bytes: pass (256) [12,12,0] Test separation: 4194304 bytes: pass (256) [12,12,0] Test separation: 8388608 bytes: pass (256) [12,12,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux redhat62.bitmover.com 2.2.14-5.0 #1 Tue Mar 7 21:07:39 EST 2000 i686 unknown processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : Celeron (Mendocino) stepping : 5 cpu MHz : 534.552424 cache size : 128 KB fdiv_bug : no hlt_bug : no sep_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 532.48 ==== redhat71 ==== (256) [14,14,0] Test separation: 4096 bytes: pass (256) [14,14,0] Test separation: 8192 bytes: pass (256) [14,14,0] Test separation: 16384 bytes: pass (256) [14,14,0] Test separation: 32768 bytes: pass (256) [14,14,0] Test separation: 65536 bytes: pass (256) [14,14,0] Test separation: 131072 bytes: pass (256) [14,14,0] Test separation: 262144 bytes: pass (256) [14,14,0] Test separation: 524288 bytes: pass (256) [14,14,0] Test separation: 1048576 bytes: pass (256) [14,14,0] Test separation: 2097152 bytes: pass (256) [14,14,0] Test separation: 4194304 bytes: pass (256) [14,14,0] Test separation: 8388608 bytes: pass (256) [14,14,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux redhat71.bitmover.com 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : Celeron (Mendocino) stepping : 5 cpu MHz : 467.739 cache size : 128 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 933.88 ==== sco ==== (1024) [48,48,2] Test separation: 4096 bytes: pass (1024) [48,48,2] Test separation: 8192 bytes: pass (1024) [48,48,2] Test separation: 16384 bytes: pass (1024) [48,48,2] Test separation: 32768 bytes: pass (1024) [48,48,2] Test separation: 65536 bytes: pass (1024) [48,48,2] Test separation: 131072 bytes: pass (1024) [48,48,1] Test separation: 262144 bytes: pass (1024) [49,49,1] Test separation: 524288 bytes: pass (1024) [48,48,2] Test separation: 1048576 bytes: pass (1024) [48,48,2] Test separation: 2097152 bytes: pass (1024) [48,48,2] Test separation: 4194304 bytes: pass (1024) [48,48,2] Test separation: 8388608 bytes: pass (1024) [48,48,2] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed SCO_SV sco 3.2 5.0.7 i386 ==== sgi ==== FPU: MIPS R10010 Floating Point Chip Revision: 0.0 CPU: MIPS R10000 Processor Chip Revision: 2.6 1 195 MHZ IP28 Processor Main memory size: 192 Mbytes Secondary unified instruction/data cache size: 1 Mbyte Instruction cache size: 32 Kbytes Data cache size: 32 Kbytes (1024) [103,103,5] Test separation: 16384 bytes: pass (1024) [103,103,5] Test separation: 32768 bytes: pass (1024) [103,103,5] Test separation: 65536 bytes: pass (1024) [103,103,5] Test separation: 131072 bytes: pass (1024) [103,103,5] Test separation: 262144 bytes: pass (1024) [103,103,5] Test separation: 524288 bytes: pass (1024) [103,103,5] Test separation: 1048576 bytes: pass (1024) [103,103,5] Test separation: 2097152 bytes: pass (1024) [103,103,5] Test separation: 4194304 bytes: pass (1024) [103,103,5] Test separation: 8388608 bytes: pass (1024) [103,103,5] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed IRIX64 sgi 6.5 10120105 IP28 ==== slovax ==== (128) [12,1,0] Test separation: 4096 bytes: FAIL - too slow (128) [12,1,0] Test separation: 8192 bytes: FAIL - too slow (128) [12,1,0] Test separation: 16384 bytes: FAIL - too slow (2048) [15,16,0] Test separation: 32768 bytes: pass (2048) [13,16,0] Test separation: 65536 bytes: pass (2048) [13,16,0] Test separation: 131072 bytes: pass (2048) [15,16,0] Test separation: 262144 bytes: pass (2048) [15,16,0] Test separation: 524288 bytes: pass (2048) [15,16,0] Test separation: 1048576 bytes: pass (2048) [15,16,0] Test separation: 2097152 bytes: pass (2048) [15,16,0] Test separation: 4194304 bytes: pass (2048) [15,16,0] Test separation: 8388608 bytes: pass (2048) [13,16,0] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) Linux slovax.bitmover.com 2.4.18-14 #1 Wed Sep 4 12:13:11 EDT 2002 i686 athlon i386 GNU/Linux processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) XP 2700+ stepping : 1 cpu MHz : 2162.685 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 4297.33 ==== sparc ==== Test separation: 8192 bytes: FAIL - cache not coherent (1024) [65,71,2] Test separation: 16384 bytes: pass (1024) [65,68,2] Test separation: 32768 bytes: pass (512) [2,50,2] Test separation: 65536 bytes: pass (512) [33,19,2] Test separation: 131072 bytes: pass (512) [33,20,2] Test separation: 262144 bytes: pass (512) [33,50,2] Test separation: 524288 bytes: pass (512) [33,19,2] Test separation: 1048576 bytes: pass (1024) [35,68,2] Test separation: 2097152 bytes: pass (512) [33,42,2] Test separation: 4194304 bytes: pass (512) [2,50,2] Test separation: 8388608 bytes: pass (512) [5,50,2] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (2 pages) Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown cpu : TI UltraSparc IIi fpu : UltraSparc IIi integrated FPU promlib : Version 3 Revision 11 prom : 3.11.12 type : sun4u ncpus probed : 1 ncpus active : 1 BogoMips : 539.03 MMU Type : Spitfire ==== sun ==== cpu0: SUNW,UltraSPARC-II (upaid 0 impl 0x11 ver 0x20 clock 296 MHz) cpu1: SUNW,UltraSPARC-II (upaid 1 impl 0x11 ver 0x20 clock 296 MHz) SunOS Release 5.6 Version Generic_105181-05 [UNIX(R) System V Release 4.0] (128) [11,7,0] Test separation: 8192 bytes: pass (256) [15,21,0] Test separation: 16384 bytes: pass (256) [15,21,0] Test separation: 32768 bytes: pass (256) [15,21,0] Test separation: 65536 bytes: pass (256) [15,21,0] Test separation: 131072 bytes: pass (256) [15,21,0] Test separation: 262144 bytes: pass (256) [15,21,0] Test separation: 524288 bytes: pass (256) [15,21,0] Test separation: 1048576 bytes: pass (256) [15,21,0] Test separation: 2097152 bytes: pass (256) [15,21,0] Test separation: 4194304 bytes: pass (256) [15,21,0] Test separation: 8388608 bytes: pass (256) [15,21,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed SunOS sun 5.6 Generic_105181-05 sun4u sparc SUNW,Ultra-2 ==== sunx86 ==== 2x 450Mhz Xeons (512) [29,29,1] Test separation: 4096 bytes: pass (512) [29,29,1] Test separation: 8192 bytes: pass (512) [29,29,1] Test separation: 16384 bytes: pass (512) [29,29,1] Test separation: 32768 bytes: pass (512) [29,29,1] Test separation: 65536 bytes: pass (512) [29,29,1] Test separation: 131072 bytes: pass (512) [29,29,1] Test separation: 262144 bytes: pass (512) [29,29,1] Test separation: 524288 bytes: pass (512) [29,29,1] Test separation: 1048576 bytes: pass (512) [29,29,1] Test separation: 2097152 bytes: pass (512) [29,29,1] Test separation: 4194304 bytes: pass (512) [29,29,1] Test separation: 8388608 bytes: pass (512) [29,29,1] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed SunOS sunx86.bitmover.com 5.7 Generic_106542-18 i86pc i386 i86pc ==== tru64 ==== 600AU (nicely made machine) (65536) [976,976,0] Test separation: 8192 bytes: pass (65536) [976,976,0] Test separation: 16384 bytes: pass (65536) [976,976,0] Test separation: 32768 bytes: pass (65536) [976,976,0] Test separation: 65536 bytes: pass (65536) [976,976,0] Test separation: 131072 bytes: pass (65536) [976,976,0] Test separation: 262144 bytes: pass (65536) [976,976,0] Test separation: 524288 bytes: pass (65536) [976,976,0] Test separation: 1048576 bytes: pass (65536) [976,976,0] Test separation: 2097152 bytes: pass (65536) [976,976,0] Test separation: 4194304 bytes: pass (65536) [976,976,0] Test separation: 8388608 bytes: pass (65536) [976,976,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed OSF1 tru64.bitmover.com V5.1 2650 alpha ==== winxp ==== I just did a gcc on this system, I have no idea what that did but it didn't complain so it did something. win32-xp /build/jamie ./a.exe Test separation: 4096 bytes: FAIL - alias map failed Test separation: 8192 bytes: FAIL - alias map failed Test separation: 16384 bytes: FAIL - alias map failed Test separation: 32768 bytes: FAIL - alias map failed Test separation: 65536 bytes: FAIL - alias map failed Test separation: 131072 bytes: FAIL - alias map failed Test separation: 262144 bytes: FAIL - alias map failed Test separation: 524288 bytes: FAIL - alias map failed Test separation: 1048576 bytes: FAIL - alias map failed Test separation: 2097152 bytes: FAIL - alias map failed Test separation: 4194304 bytes: FAIL - alias map failed Test separation: 8388608 bytes: FAIL - alias map failed Test separation: 16777216 bytes: FAIL - alias map failed VM page alias coherency test: failed; will use copy buffers instead === zseries/RedHat === (256) [11,11,0] Test separation: 4096 bytes: pass (256) [11,11,0] Test separation: 8192 bytes: pass (256) [11,11,0] Test separation: 16384 bytes: pass (256) [11,11,0] Test separation: 32768 bytes: pass (256) [11,11,0] Test separation: 65536 bytes: pass (256) [11,11,0] Test separation: 131072 bytes: pass (256) [11,11,0] Test separation: 262144 bytes: pass (256) [11,11,0] Test separation: 524288 bytes: pass (256) [11,13,0] Test separation: 1048576 bytes: pass (256) [11,13,0] Test separation: 2097152 bytes: pass (256) [11,13,0] Test separation: 4194304 bytes: pass (256) [11,13,0] Test separation: 8388608 bytes: pass (256) [11,13,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux l006034.zseriespenguins.ihost.com 2.4.9-38 #1 SMP Tue Sep 10 00:16:26 CEST 2002 s390 unknown vendor_id : IBM/S390 # processors : 1 bogomips per cpu: 612.76 processor 0: version = FF, identification = 049321, machine = 9672 === zseries/SuSE === (512) [21,21,1] Test separation: 4096 bytes: pass (256) [11,11,0] Test separation: 8192 bytes: pass (512) [21,21,1] Test separation: 16384 bytes: pass (512) [21,21,1] Test separation: 32768 bytes: pass (512) [21,21,1] Test separation: 65536 bytes: pass (512) [22,22,0] Test separation: 131072 bytes: pass (512) [22,22,0] Test separation: 262144 bytes: pass (512) [21,21,1] Test separation: 524288 bytes: pass (512) [21,25,1] Test separation: 1048576 bytes: pass (512) [22,26,0] Test separation: 2097152 bytes: pass (256) [11,13,0] Test separation: 4194304 bytes: pass (512) [22,26,0] Test separation: 8388608 bytes: pass (512) [21,25,1] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed Linux lh003022 2.2.16 #6 SMP Wed May 23 16:39:31 EDT 2001 s390 unknown vendor_id : IBM/S390 # processors : 1 bogomips per cpu: 581.63 processor 0: version = FF, identification = 049321, machine = 9672 ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 14:43 ` Larry McVoy @ 2003-09-01 16:33 ` Jamie Lokier 2003-09-01 16:58 ` Larry McVoy 2003-09-02 20:29 ` Jamie Lokier 1 sibling, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 16:33 UTC (permalink / raw) To: Larry McVoy, Larry McVoy, linux-kernel Larry McVoy wrote: > I'm a little concerned I have the wrong test, why would a 2.1Ghz Athlon > say it is too slow? It's the right test. "too slow" means that where shared memory is mapped at a certain separation, alternating accesses between the different virtual addresses are much slower (10-20 times) than if the underlying mapped memory is not shared. All Athlons show this slowdown for any virtual address separation which is not a multiple of 32k. No Intels do, with the possible exception of a P4 which showed inconsistent results and needs further investigation. Your freebsds don't what CPU they are, but let me guess.. freebsd isn't an AMD freebsd3 and freebsd4 are both AMD K6, and freebsd3 is the faster -- Jamie > ==== freebsd ==== > (512) [32,32,1] Test separation: 4096 bytes: pass ... > FreeBSD freebsd.bitmover.com 2.2.8-RELEASE FreeBSD 2.2.8-RELEASE #0: Mon Nov 30 06:34:08 GMT 1998 jkh@time.cdrom.com:/usr/src/sys/compile/GENERIC i386 > ==== freebsd3 ==== > (64) [33,3,1] Test separation: 4096 bytes: FAIL - too slow > (64) [33,3,1] Test separation: 8192 bytes: FAIL - too slow > (512) [19,26,1] Test separation: 16384 bytes: pass > VM page alias coherency test: minimum fast spacing: 16384 (4 pages) > > FreeBSD freebsd3.bitmover.com 3.2-RELEASE FreeBSD 3.2-RELEASE #0: Fri Jun 2 11:34:52 PDT 2000 root@freebsd3.bitmover.com:/usr/src/sys/compile/DAVICOM i386 > > ==== freebsd4 ==== > (256) [92,26,5] Test separation: 4096 bytes: FAIL - too slow > (256) [92,26,5] Test separation: 8192 bytes: FAIL - too slow > (1024) [75,101,5] Test separation: 16384 bytes: pass > VM page alias coherency test: minimum fast spacing: 16384 (4 pages) > > FreeBSD freebsd4.bitmover.com 4.1-RELEASE FreeBSD 4.1-RELEASE #0: Fri Jul 28 14:30:31 GMT 2000 jkh@ref4.freebsd.org:/usr/src/sys/compile/GENERIC i386 ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 16:33 ` Jamie Lokier @ 2003-09-01 16:58 ` Larry McVoy 0 siblings, 0 replies; 106+ messages in thread From: Larry McVoy @ 2003-09-01 16:58 UTC (permalink / raw) To: Jamie Lokier; +Cc: Larry McVoy, linux-kernel On Mon, Sep 01, 2003 at 05:33:54PM +0100, Jamie Lokier wrote: > Your freebsds don't what CPU they are, but let me guess.. > > freebsd isn't an AMD > freebsd3 and freebsd4 are both AMD K6, and freebsd3 is the faster Right you are on all points. freebsd: CPU: Unknown 80686 (400.91-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x660 Stepping=0 Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,<b16>,<b17>,MMX,<b24>> freebsd3 CPU: AMD-K6(tm) 3D processor (451.03-MHz 586-class CPU) Origin = "AuthenticAMD" Id = 0x58c Stepping=12 Features=0x8021bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,PGE,MMX> freebsd4 CPU: AMD-K6tm w/ multimedia extensions (233.87-MHz 586-class CPU) Origin = "AuthenticAMD" Id = 0x562 Stepping = 2 Features=0x8001bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,MMX> AMD Features=0x400<<b10>> -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 14:43 ` Larry McVoy 2003-09-01 16:33 ` Jamie Lokier @ 2003-09-02 20:29 ` Jamie Lokier 1 sibling, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-02 20:29 UTC (permalink / raw) To: Larry McVoy, Larry McVoy, linux-kernel Larry McVoy wrote: > Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC, s390 > on Linux and hpux/parisc, {freebsd, netbsd, openbsd}/x86, sco/x86, > solaris/sparc, solaris/x86, irix/mips, osx/ppc, aix/ppc, tru64/alpha. It's interesting to see all the free unixes, Solaris and SCO have no trouble mapping files. But AIX, HPUX and whatever environment you have on Windows XP couldn't even do the mmaps. Could you be able to try the aix/ppc, hpux/parisc and Windows XP (or any Windows) tests again, but this time try each of these: 1. Compile with -DHAVE_SHM_OPEN 2. Compile with -DHAVE_SYSV_SHM Thanks again, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (9 preceding siblings ...) 2003-08-29 15:41 ` Larry McVoy @ 2003-08-29 15:47 ` Herbert Poetzl 2003-08-30 1:48 ` Stuart Longland 2003-08-29 16:27 ` Geert Uytterhoeven ` (11 subsequent siblings) 22 siblings, 1 reply; 106+ messages in thread From: Herbert Poetzl @ 2003-08-29 15:47 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel # gcc -o test test.c -O2 # ./test Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) MP 1800+ stepping : 2 cpu MHz : 1533.425 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3060.53 processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1533.425 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3060.53 ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 15:47 ` Herbert Poetzl @ 2003-08-30 1:48 ` Stuart Longland 0 siblings, 0 replies; 106+ messages in thread From: Stuart Longland @ 2003-08-30 1:48 UTC (permalink / raw) To: jamie; +Cc: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, I've thrown this at a Gateway Microserver (aka. Sun Cobalt Qube) which runs an r5k little endian MIPS. I'd also throw this at a Silicon Graphics Indy, but I don't feel energetic enough right now to go and drag the beast out. Also attached, is the results from my laptop (Toshiba Protege 7010CT) and web server (Generic Dual P-Pro). - -- +-------------------------------------------------------------+ | Stuart Longland stuartl at longlandclan.hopto.org | | Brisbane Mesh Node: 719 http://stuartl.cjb.net/ | | I haven't lost my mind - it's backed up on a tape somewhere | | Griffith Student No: Course: Bachelor/IT (Nathan) | +-------------------------------------------------------------+ - -------------------< From the qube >----------------------- Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) real 0m0.276s user 0m0.140s sys 0m0.120s system type : MIPS Cobalt processor : 0 cpu model : Nevada V10.0 FPU V10.0 BogoMIPS : 249.85 wait instruction : yes microsecond timers : yes tlb_entries : 48 extra interrupt vector : yes hardware watchpoint : no VCED exceptions : not available VCEI exceptions : not available - -------------------< From the qube >----------------------- - ------------------< From the laptop >---------------------- Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.195s user 0m0.142s sys 0m0.052s processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 5 model name : Pentium II (Deschutes) stepping : 2 cpu MHz : 300.026 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 591.87 - ------------------< From the laptop >---------------------- - ----------------< From the web server >-------------------- Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.279s user 0m0.210s sys 0m0.060s processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 1 model name : Pentium Pro stepping : 9 cpu MHz : 199.434 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov bogomips : 398.13 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 1 model name : Pentium Pro stepping : 9 cpu MHz : 199.434 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov bogomips : 398.13 - ----------------< From the web server >-------------------- -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE/UAKFIGJk7gLSDPcRAif8AJ9WKjTGIGYJdHgME/Fkac4cNZKUkACdHwA5 yHQlu/O96H4IUHKGflJncmI= =yAoq -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (10 preceding siblings ...) 2003-08-29 15:47 ` Herbert Poetzl @ 2003-08-29 16:27 ` Geert Uytterhoeven 2003-09-01 5:58 ` Jamie Lokier 2003-08-29 16:31 ` Brian Jackson ` (10 subsequent siblings) 22 siblings, 1 reply; 106+ messages in thread From: Geert Uytterhoeven @ 2003-08-29 16:27 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linux Kernel Development On Fri, 29 Aug 2003, Jamie Lokier wrote: > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. Are you also interested in m68k? ;-) cassandra:/tmp# time ./test Test separation: 4096 bytes: FAIL - store buffer not coherent Test separation: 8192 bytes: FAIL - store buffer not coherent Test separation: 16384 bytes: FAIL - store buffer not coherent Test separation: 32768 bytes: FAIL - store buffer not coherent Test separation: 65536 bytes: FAIL - store buffer not coherent Test separation: 131072 bytes: FAIL - store buffer not coherent Test separation: 262144 bytes: FAIL - store buffer not coherent Test separation: 524288 bytes: FAIL - store buffer not coherent Test separation: 1048576 bytes: FAIL - store buffer not coherent Test separation: 2097152 bytes: FAIL - store buffer not coherent Test separation: 4194304 bytes: FAIL - store buffer not coherent Test separation: 8388608 bytes: FAIL - store buffer not coherent Test separation: 16777216 bytes: FAIL - store buffer not coherent VM page alias coherency test: failed; will use copy buffers instead real 0m0.478s user 0m0.110s sys 0m0.190s cassandra:/tmp# cat /proc/cpuinfo CPU: 68040 MMU: 68040 FPU: 68040 Clocking: 24.8MHz BogoMips: 16.53 Calibration: 82688 loops cassandra:/tmp# callisto$ time ./test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.329s user 0m0.270s sys 0m0.050s callisto$ cat /proc/cpuinfo cpu : 604r clock : 200MHz revision : 18.3 (pvr 0009 1203) bogomips : 398.13 machine : CHRP IBM,LongTrail-2 memory bank 0 : 32 MB SDRAM memory bank 1 : 32 MB SDRAM memory bank 2 : 32 MB SDRAM memory bank 3 : 32 MB SDRAM board l2 : 512 KB Pipelined Synchronous (Write-Through) callisto$ Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 16:27 ` Geert Uytterhoeven @ 2003-09-01 5:58 ` Jamie Lokier 2003-09-01 8:34 ` Geert Uytterhoeven 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 5:58 UTC (permalink / raw) To: Geert Uytterhoeven; +Cc: Linux Kernel Development Geert Uytterhoeven wrote: > Are you also interested in m68k? ;-) > > cassandra:/tmp# time ./test > Test separation: 4096 bytes: FAIL - store buffer not coherent Especially! I hadn't expected to see any machine that would print "store buffer not coherent". It means that if there's an L1 cache, it is coherent, but any store-then-load bypass in the CPU pipeline is using the virtual address with no rollback after MMU translation. I had thought it would only be the case with chips using an external MMU, but now that I think about it, the older simpler chips aren't going to bother with things like pipeline rollback wherever they can get away without it! (The other CPU that is reporting "store buffer not coherent" is PA-RISC, which is even more of an eye opener. That has a big 1MiB coherent L1 cache, and the pipeline bypass is coherent for very large separations but not others!) Thanks, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 5:58 ` Jamie Lokier @ 2003-09-01 8:34 ` Geert Uytterhoeven 2003-09-01 9:09 ` Kars de Jong ` (2 more replies) 0 siblings, 3 replies; 106+ messages in thread From: Geert Uytterhoeven @ 2003-09-01 8:34 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linux/m68k, Linux Kernel Development On Mon, 1 Sep 2003, Jamie Lokier wrote: > Geert Uytterhoeven wrote: > > Are you also interested in m68k? ;-) > > > > cassandra:/tmp# time ./test > > Test separation: 4096 bytes: FAIL - store buffer not coherent > > Especially! I hadn't expected to see any machine that would print > "store buffer not coherent". It means that if there's an L1 cache, it > is coherent, but any store-then-load bypass in the CPU pipeline is > using the virtual address with no rollback after MMU translation. > > I had thought it would only be the case with chips using an external > MMU, but now that I think about it, the older simpler chips aren't > going to bother with things like pipeline rollback wherever they can > get away without it! As you probably know the 68020 had an external MMU (68551, or Sun-3 or Apollo MMU). Probably Motorola didn't bother to change the behavior when the MMU got integrated in later generations (68030 and up). BTW, probably you want us to run your test program on other m68k boxes? Mine got a 68040, that leaves us with: - 68020+68551 - 68020+Sun-3 MMU - 68030 - 68060 For linux-m68k: You can find the test program source in Jamie's original posting on lkml. For your convenience, I put a binary for m68k at http://home.tvd.be/cr26864/Linux/m68k/jamie_test.gz. Just tell us the program's output and give us a copy of your /proc/cpuinfo. Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 8:34 ` Geert Uytterhoeven @ 2003-09-01 9:09 ` Kars de Jong 2003-09-01 10:08 ` Jamie Lokier 2003-09-01 10:35 ` Sam Creasey 2003-09-03 8:00 ` Kars de Jong 2 siblings, 1 reply; 106+ messages in thread From: Kars de Jong @ 2003-09-01 9:09 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote: > BTW, probably you want us to run your test program on other m68k boxes? Mine > got a 68040, that leaves us with: > - 68020+68551 > - 68060 I can run it on these boxes if no-one else has done it yet before I come home tonight. I'm sure there are more people with a 68060 out there, not too sure about the 68020+68851. Regards, Kars. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 9:09 ` Kars de Jong @ 2003-09-01 10:08 ` Jamie Lokier 2003-09-01 11:13 ` Roman Zippel 2003-09-02 20:42 ` Kars de Jong 0 siblings, 2 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 10:08 UTC (permalink / raw) To: Kars de Jong Cc: Geert Uytterhoeven, Linux/m68k kernel mailing list, Linux Kernel Development Kars de Jong wrote: > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote: > > BTW, probably you want us to run your test program on other m68k boxes? Mine > > got a 68040, that leaves us with: > > - 68020+68551 > > - 68060 > > I can run it on these boxes if no-one else has done it yet before I come > home tonight. I'm sure there are more people with a 68060 out there, not > too sure about the 68020+68851. I would prefer that you run the attached program. It fixes a bug in the function which tests whether the problem is in the L1 cache or store buffer. The bug probably didn't affect the test, but it might have. Ideally you could run the program Geert linked to as well? Please remember to compile both with optimisation. Thanks, -- Jamie /* This code maps shared memory to multiple addresses and tests it for cache coherency and performance. Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include <assert.h> #include <stdlib.h> #include <string.h> #include <limits.h> #include <errno.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/signal.h> #include <sys/mman.h> #include <sys/time.h> #if HAVE_SYSV_SHM #include <sys/ipc.h> #include <sys/shm.h> #endif //#include "pagealias.h" /* Helpers to temporarily block all signals. These are used for when a race condition might leave a temporary file that should have been deleted -- we do our best to prevent this possibility. */ static void block_signals (sigset_t * save_state) { sigset_t all_signals; sigfillset (&all_signals); sigprocmask (SIG_BLOCK, &all_signals, save_state); } static void unblock_signals (sigset_t * restore_state) { sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0); } /* Open a new shared memory file, either using the POSIX.4 `shm_open' function, or using a regular temporary file in /tmp. Immediately after opening the file, it is unlinked from the global namespace using `shm_unlink' or `unlink'. On success, the value returned is a file descriptor. Otherwise, -1 is returned and `errno' is set. The descriptor can be closed using simply `close'. */ /* Note: `shm_open' requires link argument `-lposix4' on Suns. On GNU/Linux with Glibc, it requires `-lrt'. Unfortunately, Glibc's -lrt insists on linking to pthreads, which we may not want to use because that enables thread locking overhead in other functions. So we implement a direct method of opening shm on Linux. */ /* If this is changed, change the size of `buffer' below too. */ #if HAVE_SHM_OPEN #define SHM_DIR_PREFIX "/" /* `shm_open' arg needs "/" for portability. */ #elif defined (__linux__) #include <sys/statfs.h> #define SHM_DIR_PREFIX "/dev/shm/" #else #undef SHM_DIR_PREFIX #endif static int open_shared_memory_file (int use_tmp_file) { char * ptr, buffer [19]; int fd, i; unsigned long number; sigset_t save_signals; struct timeval tv; #if !HAVE_SHM_OPEN && defined (__linux__) struct statfs sfs; if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0 || sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */)) { errno = ENOSYS; return -1; } #endif loop: /* Print a randomised path name into `buffer'. The string depends on the directory and whether we are using POSIX.4 shared memory or a regular temporary file. RANDOM is a 5-digit, base-62 representation of a pseudo-random number. The string is used as a candidate in the search for an unused shared segment or file name. */ #ifdef SHM_DIR_PREFIX strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-"); #else strcpy (buffer, "/tmp/shm-"); #endif ptr = buffer + strlen (buffer); gettimeofday (&tv, (struct timezone *) 0); number = (unsigned long) random (); number += (unsigned long) getpid (); number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec; for (i = 0; i < 5; i++) { /* Don't use character arithmetic, as not all systems are ASCII. */ *ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62]; number /= 62; } *ptr = '\0'; /* Block signals between the open and unlink, to really minimise the chance of accidentally leaving an unwanted file around. */ block_signals (&save_signals); #if HAVE_SHM_OPEN if (!use_tmp_file) { fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600); if (fd != -1) shm_unlink (buffer); } else #endif /* HAVE_SHM_OPEN */ { fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600); if (fd != -1) unlink (buffer); } unblock_signals (&save_signals); /* If we failed due to a name collision or a signal, try again. */ if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR)) goto loop; return fd; } /* Allocate a region of address space `size' bytes long, so that the region will not be allocated for any other purpose. It is freed with `munmap'. Returns the mapped base address on success. Otherwise, MAP_FAILED is returned and `errno' is set. */ static size_t system_page_size; #if !defined (MAP_ANONYMOUS) && defined (MAP_ANON) #define MAP_ANONYMOUS MAP_ANON #endif #ifndef MAP_NORESERVE #define MAP_NORESERVE 0 #endif #ifndef MAP_FILE #define MAP_FILE 0 #endif #ifndef MAP_VARIABLE #define MAP_VARIABLE 0 #endif #ifndef MAP_FAILED #define MAP_FAILED ((void *) -1) #endif #ifndef PROT_NONE #define PROT_NONE PROT_READ #endif static void * map_address_space (void * optional_address, size_t size, int access) { void * addr; #ifdef MAP_ANONYMOUS addr = mmap (optional_address, size, access ? (PROT_READ | PROT_WRITE) : PROT_NONE, (MAP_PRIVATE | MAP_ANONYMOUS | (optional_address ? MAP_FIXED : MAP_VARIABLE) | (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0); #else /* not defined MAP_ANONYMOUS */ int save_errno, zero_fd = open ("/dev/zero", O_RDONLY); if (zero_fd == -1) return MAP_FAILED; addr = mmap (optional_address, size, access ? (PROT_READ | PROT_WRITE) : PROT_NONE, (MAP_PRIVATE | MAP_FILE | (optional_address ? MAP_FIXED : MAP_VARIABLE) | (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0); save_errno = errno; close (zero_fd); errno = save_errno; #endif /* not defined MAP_ANONMOUS */ return addr; } /* Set up a page alias mapping using mmap() on POSIX shared memory or on a temporary regular file. Returns the mapped base address on success. Otherwise, 0 is returned and `errno' is set. */ static void * page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file) { void * base_addr, * addr; int fd, i, save_errno; struct stat st; fd = open_shared_memory_file (use_tmp_file); if (fd == -1) goto fail; /* First, resize the shared memory file to the desired size. */ if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size) goto close_fail; /* Map an anonymous region `separation + size' bytes long. This is how we allocate sufficient contiguous address space. We over-map this with the aliased buffer. */ if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto close_fail; /* Map the same shared memory repeatedly, at different addresses. */ for (i = 0; i < 2; i++) { addr = mmap ((char *) base_addr + (i ? separation : 0), size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED, fd, (off_t) 0); if (addr == MAP_FAILED) goto unmap_fail; if (addr != (char *) base_addr + (i ? separation : 0)) { /* `mmap' ignored MAP_FIXED! Should never happen. */ munmap (addr, size); save_errno = EINVAL; goto unmap_fail_se; } } if (close (fd) != 0) goto unmap_fail; /* Success! */ return base_addr; /* Failure. */ unmap_fail: save_errno = errno; unmap_fail_se: munmap (base_addr, separation + size); errno = save_errno; close_fail: save_errno = errno; close (fd); errno = save_errno; fail: return 0; } /* Set up a page alias mapping using SYSV IPC shared memory. Returns the mapped base address on success. Otherwise, 0 is returned and `errno' is set. */ #if HAVE_SYSV_SHM static void * page_alias_using_sysv_shm (size_t size, size_t separation) { void * base_addr, * addr; sigset_t save_signals; int shmid, i, save_errno; /* Map an anonymous region `separation + size' bytes long. This is how we allocate sufficient contiguous address space. We over-map this with the aliased buffer. */ if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto fail; /* Block signals between the shmget() and IPC_RMID, to minimise the chance of accidentally leaving an unwanted shared segment around. */ block_signals (&save_signals); shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600); if (shmid == -1) goto unmap_fail; /* Map the same shared memory repeatedly, at different addresses. */ for (i = 0; i < 2; i++) { /* `shmat' is tried twice. The fist time it can fail if the local implementation of `shmat' refuses to map over a region mapped with `mmap'. In that case, we punch a hole using `munmap' and do it again. If the local `shmat' has this property, the `shmat' calls to fixed addresses might collide with a concurrent thread which is also doing mappings, and will fail. At least it is a safe failure. On the other hand, if the local `shmat' can map over already-mapped regions (in the same way that `mmap' does), it is essential that we do actually use an already-mapped region, so that collisions with a concurrent thread can't possibly result in both of us grabbing the same address range with no indication of error. */ addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0); if (addr == (void *) -1 && errno == EINVAL) { munmap ((char *) base_addr + (i ? separation : 0), size); addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0); } /* Check for errors. */ if (addr == (void *) -1) { save_errno = errno; if (i == 1) shmdt (base_addr); goto remove_shm_fail_se; } else if (addr != (char *) base_addr + (i ? separation : 0)) { /* `shmat' ignored the requested address! */ if (i == 1) shmdt (base_addr); save_errno = EINVAL; goto remove_shm_fail_se; } } if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0) goto remove_shm_fail; unblock_signals (&save_signals); /* Success! */ return base_addr; /* Failure. */ remove_shm_fail: save_errno = errno; remove_shm_fail_se: while (--i >= 0) shmdt ((char *) base_addr + (i ? separation : 0)); shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0); errno = save_errno; unmap_fail: save_errno = errno; unblock_signals (&save_signals); munmap (base_addr, separation + size); errno = save_errno; fail: return 0; } #endif /* HAVE_SYSV_SHM */ /* Map a page-aliased ring buffer. Shared memory of size `size' is mapped twice, with the difference between the two addresses being `separation', which must be at least `size'. The total address range used is `separation + size' bytes long. On success, *METHOD is filled with a number which must be passed to `page_alias_unmap', and the mapped base address is returned. Otherwise, 0 is returned and `errno' is set. */ static void * __page_alias_map (size_t size, size_t separation, int * method) { void * addr; if (((size | separation) & (system_page_size - 1)) != 0 || size > separation) { errno = -EINVAL; return 0; } /* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */ #ifdef SHM_DIR_PREFIX *method = 0; if ((addr = page_alias_using_mmap (size, separation, 0)) != 0) return addr; #endif #if HAVE_SYSV_SHM *method = 1; if ((addr = page_alias_using_sysv_shm (size, separation)) != 0) return addr; #endif *method = 2; return page_alias_using_mmap (size, separation, 1); } /* Unmap a page-aliased ring buffer previously allocated by `page_alias_map'. `address' is the base address, and `size' and `separation' are the arguments previously passed to `__page_alias_map'. `method' is the value previously stored in *METHOD. Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */ static int __page_alias_unmap (void * address, size_t size, size_t separation, int method) { #if HAVE_SYSV_SHM if (method == 1) { shmdt (address); shmdt (address + separation); if (separation > size) munmap (address + size, separation - size); return 0; } #endif return munmap (address, separation + size); } /* Map a page-aliased ring buffer. `size' is the size of the buffer to create; it will be mapped twice to cover a total address range `size * 2' bytes long. On success, *METHOD is filled with a number which must be passed to `page_alias_unmap', and the mapped base address is returned. Otherwise, 0 is returned and `errno' is set. */ void * page_alias_map (size_t size, int * method) { return __page_alias_map (size, size, method); } /* Unmap a page-aliased ring buffer previously allocated by `page_alias_map'. `address' is the base address, and `size' is the size of the buffer (which is half of the total mapped address range). `method' is a value previously stored in *METHOD by `page_alias_map'. Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */ int page_alias_unmap (void * address, size_t size, int method) { return __page_alias_unmap (address, size, size, method); } /* Map some memory which is not aliased, for timing comparisons against aliased pages. We use a combination of mappings similar to page_alias_*(), in case there are resource limitations which would prevent malloc() or a single mmap() working for the larger address range tests. */ static void * page_no_alias (size_t size, size_t separation) { void * base_addr, * addr; int i, save_errno; if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto fail; /* Map anonymous memory at the different addresses. */ for (i = 0; i < 2; i++) { addr = map_address_space ((char *) base_addr + (i ? separation : 0), size, 1); if (addr == MAP_FAILED) goto unmap_fail; if (addr != (char *) base_addr + (i ? separation : 0)) { /* `mmap' ignored MAP_FIXED! Should never happen. */ munmap (addr, size); save_errno = EINVAL; goto unmap_fail_se; } } /* Success! */ return base_addr; /* Failure. */ unmap_fail: save_errno = errno; unmap_fail_se: munmap (base_addr, separation + size); errno = save_errno; fail: return 0; } /* This should be a word size that the architecture can read and write fast in a single instruction. In principle, C's `int' is the natural word size, but in practice it isn't on 64-bit machines. */ #define WORD long /* These GCC-specific asm statements force values into registers, and also act as compiler memory barriers. These are used to force a group of write/write/read instructions as close together as possible, to maximise the detection of store buffer conditions. Despite being asm statements, these will work with any of GCC's target architectures, provided they have >= 4 registers. */ #if __GNUC__ >= 3 #define __noinline __attribute__ ((__noinline__)) #else #define __noinline #endif #ifdef __GNUC__ #define force_into_register(var) \ __asm__ ("" : "=r" (var) : "0" (var) : "memory") #define force_into_registers(var1, var2, var3, var4) \ __asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \ : "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory") #else #define force_into_register(var) do {} while (0) #define force_into_registers(var1, var2, var3, var4) do {} while (0) #endif /* This function tries to test whether a CPU snoops its store buffer for reads within a few instructions, and ignores virtual to physical address translations when doing that. In principle a CPU might do this even if it's L1 cache is physically tagged or indexed, although I have not seen such a system. (A CPU which uses store buffer snooping and with an off-board MMU, which the CPU is unaware of, could have this property). It isn't possible to do this test perfectly; we do our best. The `force_into_register' macros ensure that the write/write/read sequence is as compact as the compiler can make it. */ static WORD __noinline test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2) { register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2; register WORD __reg1 = 1, __reg2 = 0; force_into_registers (__reg1, __reg2, __regptr1, __regptr2); *__regptr1 = __reg1; *__regptr2 = __reg2; __reg1 = *__regptr1; force_into_register (__reg1); return __reg1; } /* This function tests whether writes to one page are seen in another page at a different virtual address, and whether they are nearly as fast as normal writes. The accesses are timed by the caller of this function. Alternate writes go to alternate pages, so that if aliasing is implemented using page faults, it will clearly show up in the timings. */ static int __noinline test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops) { WORD fail = 0; while (--timing_loops >= 0) fail |= test_store_buffer_snoop (ptr1, ptr2); return fail != 0; } /* This function tests L1 cache coherency without checking for store buffer snoop coherency. To do this, we add enough stores that the writes to *PTR1 are flushed (or drain due to the time delay) from the store buffer before we read from *PTR1. The result of this function is not important: it is only used in a diagnostic message. */ static int __noinline test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2) { int i, j; WORD fail = 0; for (i = 0; i < 10; i++) { *ptr1 = 1; /* This loop of volatile writes creates a short time delay. The delay gives the store to *PTR1 time to flush from the store buffer and/or the many writes flush the store buffer. The loop writes to *PTR2 because if we pick another fixed address and write to it, that would be testing 3 cache lines (PTR1, PTR2 and the fixed address) and the fixed address _might_ happen to collide with PTR1 or PTR2 in the L1 cache. If the L1 cache is 2-way set-associative, that would flush it every time, possibly making it appear coherent when it isn't. */ for (j = 0; j < 1000; j++) *ptr2 = 0; fail |= *ptr1; } return fail != 0; } /* Thoroughly test a pair of aliased pages with a fixed address separation, to see if they really behave like memory appearing at two locations, and efficiently. We search through different values of `separation' searching for a suitable "cache colour" on this machine. */ static inline const char * test_one_separation (size_t separation) { void * buffers [2]; long timings [3]; int i, method, timing_loops = 64; /* We measure timings of 3 different tests, each 128 times to find the minimum. 0: Writes and reads to aliased pages. 1: Writes and reads to non-aliased pages, to compare with 1. 2: Doing nothing, to measure the time for `gettimeofday' itself. The measurements are done in a mixed up order. If we did 64 measurements of type 0, then 64 of type 1, then 64 of type 2, the results could be mislead due to synchronisation with other processes occuring on the machine. */ /* A previously generated random shuffle of bit-pairs. Each pair is a number from the set {0,1,2}. Each number occurs exactly 128 times. */ static const unsigned char pattern [96] = { 0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56, 0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49, 0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99, 0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25, 0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19, 0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15, 0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89, 0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85, }; buffers [0] = __page_alias_map (system_page_size, separation, &method); if (buffers [0] == 0) return "alias map failed"; buffers [1] = page_no_alias (system_page_size, separation); if (buffers [1] == 0) { __page_alias_unmap (buffers [0], system_page_size, separation, method); return "non-alias map failed"; } retry: timings [2] = timings [1] = timings [0] = LONG_MAX; for (i = 0; i < 384; i++) { struct timeval time_before, time_after; long time_delta; int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3; volatile WORD * ptr1 = (volatile WORD *) buffers [which_test]; volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation); /* Test whether writes to one page appear immediately in the other, and time how long the memory accesses take. */ gettimeofday (&time_before, (struct timezone *) 0); if (which_test < 2) fail = test_page_alias (ptr1, ptr2, timing_loops); gettimeofday (&time_after, (struct timezone *) 0); if (fail && which_test == 0) { /* Test whether the failure is due to a store buffer bypass which ignores virtual address translation. */ int l1_fail = test_l1_only (ptr1, ptr2); __page_alias_unmap (buffers [0], system_page_size, separation, method); munmap (buffers [1], separation + system_page_size); return l1_fail ? "cache not coherent" : "store buffer not coherent"; } time_delta = ((time_after.tv_usec - time_before.tv_usec) + 1000000 * (time_after.tv_sec - time_before.tv_sec)); /* Find the smallest time taken for each test. Ignore negative glitches due to Linux' tendancy to jump the clock backwards. */ if (time_delta >= 0 && time_delta < timings [which_test]) timings [which_test] = time_delta; } /* Remove the cost of `gettimeofday()' itself from measurements. */ timings [0] -= timings [2]; timings [1] -= timings [2]; /* Keep looping until at least one measurement becomes significant. A very fast CPU will show measurements of zero microseconds for smaller values of `timing_loops'. Also loop until the cost of `gettimeofday()' becomes insignificant. When the program is run under `strace', the latter is a big and this is needed to stabilise the results. */ if (timings [0] <= 10 * (1 + timings [2]) && timings [1] <= 10 * (1 + timings [2])) { timing_loops <<= 1; goto retry; } __page_alias_unmap (buffers [0], system_page_size, separation, method); munmap (buffers [1], separation + system_page_size); printf ("(%d) [%ld,%ld,%ld] ", timing_loops, timings [0], timings [1], timings [2]); /* Reject page aliasing if it is much slower than accessing a single, definitely cached page directly. */ if (timings [0] > 2 * timings [1]) return "too slow"; /* Success! Passed all tests for these parameters. */ return 0; } size_t page_alias_smallest_size; void page_alias_init (void) { size_t size; #ifdef _SC_PAGESIZE system_page_size = sysconf (_SC_PAGESIZE); #elif defined (_SC_PAGE_SIZE) system_page_size = sysconf (_SC_PAGE_SIZE); #else system_page_size = getpagesize (); #endif for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2) { const char * reason = test_one_separation (size); printf ("Test separation: %lu bytes: %s%s\n", (unsigned long) size, reason ? "FAIL - " : "pass", reason ? reason : ""); /* This logic searches for the smallest _contiguous_ range of page sizes for which `page_alias_test' passes. */ if (reason == 0 && page_alias_smallest_size == 0) page_alias_smallest_size = size; else if (reason != 0 && page_alias_smallest_size != 0) { /* Fail, indicating that page-aliasing is not reliable, because there's a maximum size. We don't support that as it seems quite unlikely given our model of cache colouring. */ page_alias_smallest_size = 0; break; } } printf ("VM page alias coherency test: "); if (page_alias_smallest_size == 0) printf ("failed; will use copy buffers instead\n"); else if (page_alias_smallest_size == system_page_size) printf ("all sizes passed\n"); else printf ("minimum fast spacing: %lu (%lu page%s)\n", (unsigned long) page_alias_smallest_size, (unsigned long) (page_alias_smallest_size / system_page_size), (page_alias_smallest_size == system_page_size) ? "" : "s"); } //#ifdef TEST_PAGEALIAS int main () { page_alias_init (); return 0; } //#endif ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:08 ` Jamie Lokier @ 2003-09-01 11:13 ` Roman Zippel 2003-09-02 20:42 ` Kars de Jong 1 sibling, 0 replies; 106+ messages in thread From: Roman Zippel @ 2003-09-01 11:13 UTC (permalink / raw) To: Jamie Lokier Cc: Kars de Jong, Geert Uytterhoeven, Linux/m68k kernel mailing list, Linux Kernel Development Hi, On Mon, 1 Sep 2003, Jamie Lokier wrote: > I would prefer that you run the attached program. It fixes a bug in > the function which tests whether the problem is in the L1 cache or > store buffer. The bug probably didn't affect the test, but it might > have. This is the result for a 060: $ ./a.out (256) [175,175,11] Test separation: 4096 bytes: pass (256) [173,175,11] Test separation: 8192 bytes: pass (256) [176,175,10] Test separation: 16384 bytes: pass (256) [174,173,11] Test separation: 32768 bytes: pass (256) [174,175,11] Test separation: 65536 bytes: pass (256) [175,175,10] Test separation: 131072 bytes: pass (256) [176,176,10] Test separation: 262144 bytes: pass (256) [175,175,11] Test separation: 524288 bytes: pass (256) [173,175,11] Test separation: 1048576 bytes: pass (256) [174,174,11] Test separation: 2097152 bytes: pass (256) [176,176,10] Test separation: 4194304 bytes: pass (256) [177,177,9] Test separation: 8388608 bytes: pass (256) [175,176,10] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed $ cat /proc/cpuinfo CPU: 68060 MMU: 68060 FPU: 68060 Clocking: 49.7MHz BogoMips: 99.53 Calibration: 497664 loops bye, Roman ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:08 ` Jamie Lokier 2003-09-01 11:13 ` Roman Zippel @ 2003-09-02 20:42 ` Kars de Jong 2003-09-02 21:39 ` Jamie Lokier 2003-09-03 7:59 ` Geert Uytterhoeven 1 sibling, 2 replies; 106+ messages in thread From: Kars de Jong @ 2003-09-02 20:42 UTC (permalink / raw) To: Jamie Lokier Cc: Geert Uytterhoeven, Linux/m68k kernel mailing list, Linux Kernel Development On Mon, 2003-09-01 at 12:08, Jamie Lokier wrote: > Kars de Jong wrote: > > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote: > > > BTW, probably you want us to run your test program on other m68k boxes? Mine > > > got a 68040, that leaves us with: > > > - 68020+68551 > > > - 68060 > > > > I can run it on these boxes if no-one else has done it yet before I come > > home tonight. I'm sure there are more people with a 68060 out there, not > > too sure about the 68020+68851. > > I would prefer that you run the attached program. It fixes a bug in > the function which tests whether the problem is in the L1 cache or > store buffer. The bug probably didn't affect the test, but it might > have. > > Ideally you could run the program Geert linked to as well? > Please remember to compile both with optimisation. OK, here are my results (I'll skip the 68060 because Roman has already run the program on that one): This is on a Plessey PME 68-22. It's sooooo fast... Sam, is there a Sun slower than this? Original program: fikkie:/tmp# ./jamie_test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed New program: fikkie:/tmp# time ./jamie_test2 (2048) [10000,10000,0] Test separation: 4096 bytes: pass (2048) [10000,10000,0] Test separation: 8192 bytes: pass (2048) [10000,10000,0] Test separation: 16384 bytes: pass (2048) [10000,10000,0] Test separation: 32768 bytes: pass (2048) [10000,10000,0] Test separation: 65536 bytes: pass (2048) [10000,10000,0] Test separation: 131072 bytes: pass (2048) [10000,10000,0] Test separation: 262144 bytes: pass (2048) [10000,10000,0] Test separation: 524288 bytes: pass (2048) [10000,10000,0] Test separation: 1048576 bytes: pass (2048) [10000,10000,0] Test separation: 2097152 bytes: pass (2048) [10000,10000,0] Test separation: 4194304 bytes: pass (2048) [10000,10000,0] Test separation: 8388608 bytes: pass (2048) [10000,10000,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 1m51.210s user 1m44.950s sys 0m4.930s fikkie:/tmp# cat /proc/cpuinfo CPU: 68020 MMU: 68851 FPU: 68881 Clocking: 15.6MHz BogoMips: 3.90 Calibration: 19520 loops fikkie:/tmp# And no, this board has no way of getting a better time resolution than the 100 Hz tick timer either ;-) Regards, Kars. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 20:42 ` Kars de Jong @ 2003-09-02 21:39 ` Jamie Lokier 2003-09-03 7:59 ` Geert Uytterhoeven 1 sibling, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-02 21:39 UTC (permalink / raw) To: Kars de Jong Cc: Geert Uytterhoeven, Linux/m68k kernel mailing list, Linux Kernel Development Kars de Jong wrote: > And no, this board has no way of getting a better time resolution than > the 100 Hz tick timer either ;-) The coherency test is fine. That's just logic. The clock granularity got me wondering whether the timing measurement is meaningful on these machines. It's possible for the shared test to take 2000 microseconds and the unshared test to take 10 microseconds, and they can still show up as 10ms if they both cross a clock tick boundary. The minimum of 128 tests of each type is likely to report 0 until timing_loops is larger enough to make all 128 consistently almost 10ms, according to the timing when each test starts. Then as we only care if there is an approximately 2:1 ratio or more, it is fine. That depends on the timing of each test not being synchronised with the clock ticks, or when they are, that not affecting the result. I'm not sure, but I have a feeling that the random shuffle makes it ok. Hmm. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 20:42 ` Kars de Jong 2003-09-02 21:39 ` Jamie Lokier @ 2003-09-03 7:59 ` Geert Uytterhoeven 2003-09-03 9:13 ` Jamie Lokier 2003-09-03 12:13 ` Jan-Benedict Glaw 1 sibling, 2 replies; 106+ messages in thread From: Geert Uytterhoeven @ 2003-09-03 7:59 UTC (permalink / raw) To: Kars de Jong Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development On 2 Sep 2003, Kars de Jong wrote: > fikkie:/tmp# ./jamie_test > Test separation: 4096 bytes: pass > Test separation: 8192 bytes: pass > Test separation: 16384 bytes: pass > Test separation: 32768 bytes: pass > Test separation: 65536 bytes: pass > Test separation: 131072 bytes: pass > Test separation: 262144 bytes: pass > Test separation: 524288 bytes: pass > Test separation: 1048576 bytes: pass > Test separation: 2097152 bytes: pass > Test separation: 4194304 bytes: pass > Test separation: 8388608 bytes: pass > Test separation: 16777216 bytes: pass > VM page alias coherency test: all sizes passed > > New program: > > fikkie:/tmp# time ./jamie_test2 > (2048) [10000,10000,0] Test separation: 4096 bytes: pass > (2048) [10000,10000,0] Test separation: 8192 bytes: pass > (2048) [10000,10000,0] Test separation: 16384 bytes: pass > (2048) [10000,10000,0] Test separation: 32768 bytes: pass > (2048) [10000,10000,0] Test separation: 65536 bytes: pass > (2048) [10000,10000,0] Test separation: 131072 bytes: pass > (2048) [10000,10000,0] Test separation: 262144 bytes: pass > (2048) [10000,10000,0] Test separation: 524288 bytes: pass > (2048) [10000,10000,0] Test separation: 1048576 bytes: pass > (2048) [10000,10000,0] Test separation: 2097152 bytes: pass > (2048) [10000,10000,0] Test separation: 4194304 bytes: pass > (2048) [10000,10000,0] Test separation: 8388608 bytes: pass > (2048) [10000,10000,0] Test separation: 16777216 bytes: pass > VM page alias coherency test: all sizes passed > > real 1m51.210s > user 1m44.950s > sys 0m4.930s > fikkie:/tmp# cat /proc/cpuinfo > CPU: 68020 > MMU: 68851 > FPU: 68881 > Clocking: 15.6MHz > BogoMips: 3.90 > Calibration: 19520 loops > fikkie:/tmp# So the store buffer is coherent on 68020 with external MMU, while it isn't on 68040 with internal MMU... Now all that's left is the 68030. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 7:59 ` Geert Uytterhoeven @ 2003-09-03 9:13 ` Jamie Lokier 2003-09-03 9:26 ` Geert Uytterhoeven 2003-09-03 12:13 ` Jan-Benedict Glaw 1 sibling, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-03 9:13 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Geert Uytterhoeven wrote: > So the store buffer is coherent on 68020 with external MMU, while it > isn't on 68040 with internal MMU... Does the 68020 even _have_ the equivalent of a store buffer? -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 9:13 ` Jamie Lokier @ 2003-09-03 9:26 ` Geert Uytterhoeven 2003-09-03 12:17 ` Roman Zippel 0 siblings, 1 reply; 106+ messages in thread From: Geert Uytterhoeven @ 2003-09-03 9:26 UTC (permalink / raw) To: Jamie Lokier Cc: Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development On Wed, 3 Sep 2003, Jamie Lokier wrote: > Geert Uytterhoeven wrote: > > So the store buffer is coherent on 68020 with external MMU, while it > > isn't on 68040 with internal MMU... > > Does the 68020 even _have_ the equivalent of a store buffer? Good question :-) After I sent the previous mail, I realized the '030 has 256 bytes I cache and 256 bytes D cache, while the '020 has 256 bytes I cache only. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 9:26 ` Geert Uytterhoeven @ 2003-09-03 12:17 ` Roman Zippel 2003-09-03 12:36 ` Geert Uytterhoeven 0 siblings, 1 reply; 106+ messages in thread From: Roman Zippel @ 2003-09-03 12:17 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Jamie Lokier, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Hi, On Wed, 3 Sep 2003, Geert Uytterhoeven wrote: > > Does the 68020 even _have_ the equivalent of a store buffer? > > Good question :-) > > After I sent the previous mail, I realized the '030 has 256 bytes I cache and > 256 bytes D cache, while the '020 has 256 bytes I cache only. BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060 caches are PIPT. bye, Roman ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 12:17 ` Roman Zippel @ 2003-09-03 12:36 ` Geert Uytterhoeven 2003-09-03 13:29 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Geert Uytterhoeven @ 2003-09-03 12:36 UTC (permalink / raw) To: Roman Zippel Cc: Jamie Lokier, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development On Wed, 3 Sep 2003, Roman Zippel wrote: > On Wed, 3 Sep 2003, Geert Uytterhoeven wrote: > > > Does the 68020 even _have_ the equivalent of a store buffer? > > > > Good question :-) > > > > After I sent the previous mail, I realized the '030 has 256 bytes I cache and > > 256 bytes D cache, while the '020 has 256 bytes I cache only. > > BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060 > caches are PIPT. That explains a bit. But the '060 stores are coherent, while the '040 stores aren't. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 12:36 ` Geert Uytterhoeven @ 2003-09-03 13:29 ` Jamie Lokier 2003-09-03 16:07 ` Nagendra Singh Tomar 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-03 13:29 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Geert Uytterhoeven wrote: > > BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060 > > caches are PIPT. > > That explains a bit. But the '060 stores are coherent, while the '040 stores > aren't. The L1 cache is coherent on the '040 according to the results. It's the store buffer snooping which fails. Presumably the CPU core is looking ahead at recent writes comparing just virtual addresses. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 13:29 ` Jamie Lokier @ 2003-09-03 16:07 ` Nagendra Singh Tomar 2003-09-04 5:03 ` Davide Libenzi 2003-09-04 11:19 ` Alan Cox 0 siblings, 2 replies; 106+ messages in thread From: Nagendra Singh Tomar @ 2003-09-03 16:07 UTC (permalink / raw) To: Jamie Lokier Cc: Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Jamie, Just wondered if the store buffer is snooped in some architectures. In that case I believe the OS need not do anything for serialization (except for aliases, if they do not hit the same cache line). In x86 store buffer is not snooped which leads to all these serialization issues (other CPUs looking at stale value of data which is in the store buffer of some other CPU). Pl correct me if I have got anything wrong/ Thanx, tomar On Wed, 3 Sep 2003, Jamie Lokier wrote: > Geert Uytterhoeven wrote: > > > BTW the 020/030 caches are VIVT (and also only writethrough), the > 040/060 > > > caches are PIPT. > > > > That explains a bit. But the '060 stores are coherent, while the '040 > stores > > aren't. > > The L1 cache is coherent on the '040 according to the results. It's > the store buffer snooping which fails. Presumably the CPU core is > looking ahead at recent writes comparing just virtual addresses. > > -- Jamie > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 16:07 ` Nagendra Singh Tomar @ 2003-09-04 5:03 ` Davide Libenzi 2003-09-03 18:03 ` Nagendra Singh Tomar 2003-09-04 11:19 ` Alan Cox 1 sibling, 1 reply; 106+ messages in thread From: Davide Libenzi @ 2003-09-04 5:03 UTC (permalink / raw) To: Nagendra Singh Tomar Cc: Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote: > Jamie, > Just wondered if the store buffer is snooped in some > architectures. In that case I believe the OS need not do anything for > serialization (except for aliases, if they do not hit the same cache line). > In x86 store buffer is not snooped which leads to all these serialization > issues (other CPUs looking at stale value of data which is in the store > buffer of some other CPU). > Pl correct me if I have got anything wrong/ To avoid the so called 'load hazard' (that, BTW, triggers read over writes, that are not allowed in x86) you have two options. Snoop the write buffer or flush it upon L1 miss. Otherwise you might end up getting stale data from L2. - Davide ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-04 5:03 ` Davide Libenzi @ 2003-09-03 18:03 ` Nagendra Singh Tomar 2003-09-04 6:38 ` Davide Libenzi 0 siblings, 1 reply; 106+ messages in thread From: Nagendra Singh Tomar @ 2003-09-03 18:03 UTC (permalink / raw) To: Davide Libenzi Cc: Tomar, Nagendra, Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development On Thu, 4 Sep 2003, Davide Libenzi wrote: > On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote: > > > Jamie, > > Just wondered if the store buffer is snooped in some > > architectures. In that case I believe the OS need not do anything for > > serialization (except for aliases, if they do not hit the same cache > line). > > In x86 store buffer is not snooped which leads to all these > serialization > > issues (other CPUs looking at stale value of data which is in the > store > > buffer of some other CPU). > > Pl correct me if I have got anything wrong/ > > To avoid the so called 'load hazard' (that, BTW, triggers read over > writes, that are not allowed in x86) you have two options. Snoop the > write > buffer or flush it upon L1 miss. Otherwise you might end up getting > stale > data from L2. > I meant to ask if the store buffer is snooped by *other CPUs*. To maintain self coherence the local store buffer has to be anyway consulted by local loads to give the latest stored value. Thanx, tomar > > > - Davide > ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 18:03 ` Nagendra Singh Tomar @ 2003-09-04 6:38 ` Davide Libenzi 0 siblings, 0 replies; 106+ messages in thread From: Davide Libenzi @ 2003-09-04 6:38 UTC (permalink / raw) To: Nagendra Singh Tomar Cc: Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote: > I meant to ask if the store buffer is snooped by *other CPUs*. To maintain > self coherence the local store buffer has to be anyway consulted by local > loads to give the latest stored value. There are CPUs (at least some version of Alpha, 21064 IIRC) that uses flush upon L1 read miss, so they do not snoop their local WB. IIRC P5 has internal and external snooping while P6, using a write allocate L1, does not have external snooping. - Davide ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 16:07 ` Nagendra Singh Tomar 2003-09-04 5:03 ` Davide Libenzi @ 2003-09-04 11:19 ` Alan Cox 2003-09-05 21:24 ` Pavel Machek 1 sibling, 1 reply; 106+ messages in thread From: Alan Cox @ 2003-09-04 11:19 UTC (permalink / raw) To: nagendra_tomar Cc: Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development On Mer, 2003-09-03 at 17:07, Nagendra Singh Tomar wrote: > In x86 store buffer is not snooped which leads to all these serialization > issues (other CPUs looking at stale value of data which is in the store > buffer of some other CPU). x86 gives you coherency and store ordering (barring errata and special CPU modes) ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-04 11:19 ` Alan Cox @ 2003-09-05 21:24 ` Pavel Machek 2003-09-06 23:09 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Pavel Machek @ 2003-09-05 21:24 UTC (permalink / raw) To: Alan Cox Cc: nagendra_tomar, Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Hi! > > In x86 store buffer is not snooped which leads to all these serialization > > issues (other CPUs looking at stale value of data which is in the store > > buffer of some other CPU). > > x86 gives you coherency and store ordering (barring errata and special > CPU modes) Special CPU modes? You mean some special SSE stores? Pavel -- When do you have a heart between your knees? [Johanka's followup: and *two* hearts?] ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-05 21:24 ` Pavel Machek @ 2003-09-06 23:09 ` Jamie Lokier 2003-09-07 13:10 ` Pavel Machek 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-06 23:09 UTC (permalink / raw) To: Pavel Machek Cc: Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Pavel Machek wrote: > > x86 gives you coherency and store ordering (barring errata and special > > CPU modes) > > Special CPU modes? You mean some special SSE stores? Take a look at arch/i386/kernel/cpu/centaur.c, and CONFIG_X86_OOSTORE. You can change the memory settings to weakly ordered writes, which means that a plain write isn't suitable for spin_unlock. Presumably this mode is faster (though I don't see why, if Intel, AMD et al. can manage good memory performance without weak writes). -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-06 23:09 ` Jamie Lokier @ 2003-09-07 13:10 ` Pavel Machek 2003-09-07 13:35 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Pavel Machek @ 2003-09-07 13:10 UTC (permalink / raw) To: Jamie Lokier Cc: Pavel Machek, Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Hi! > > > x86 gives you coherency and store ordering (barring errata and special > > > CPU modes) > > > > Special CPU modes? You mean some special SSE stores? > > Take a look at arch/i386/kernel/cpu/centaur.c, and CONFIG_X86_OOSTORE. > > You can change the memory settings to weakly ordered writes, which > means that a plain write isn't suitable for spin_unlock. Presumably > this mode is faster (though I don't see why, if Intel, AMD et al. can > manage good memory performance without weak writes). Wow, seems interesting, how much performance does it buy? [Maybe AMD and Intel just threw a lot of silicon at the problem and it went away. Centaur solution might be nicer, through -- spin_unlock is so uncommon that this seems like nice optimalization.] -- Horseback riding is like software... ...vgf orggre jura vgf serr. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-07 13:10 ` Pavel Machek @ 2003-09-07 13:35 ` Jamie Lokier 2003-09-07 13:40 ` Pavel Machek 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-07 13:35 UTC (permalink / raw) To: Pavel Machek Cc: Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Pavel Machek wrote: > Wow, seems interesting, how much performance does it buy? [Maybe AMD > and Intel just threw a lot of silicon at the problem and it went > away. Centaur solution might be nicer, through -- spin_unlock is so > uncommon that this seems like nice optimalization.] I didn't realise Centaur SMP systems existed, but I guess they must do for weak memory writes to mean anything. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-07 13:35 ` Jamie Lokier @ 2003-09-07 13:40 ` Pavel Machek 2003-09-07 13:53 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Pavel Machek @ 2003-09-07 13:40 UTC (permalink / raw) To: Jamie Lokier Cc: Pavel Machek, Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Hi! Perhaps weak ordering matters when you are writting to the MMIO, too? > > Wow, seems interesting, how much performance does it buy? [Maybe AMD > > and Intel just threw a lot of silicon at the problem and it went > > away. Centaur solution might be nicer, through -- spin_unlock is so > > uncommon that this seems like nice optimalization.] > > I didn't realise Centaur SMP systems existed, but I guess they must do > for weak memory writes to mean anything. > > -- Jamie -- Horseback riding is like software... ...vgf orggre jura vgf serr. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-07 13:40 ` Pavel Machek @ 2003-09-07 13:53 ` Jamie Lokier 2003-09-07 17:56 ` Alan Cox 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-07 13:53 UTC (permalink / raw) To: Pavel Machek Cc: Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development Pavel Machek wrote: > Perhaps weak ordering matters when you are writting to the MMIO, too? Perhaps, but the code in arch/i386/kernel/cpu/centaur.c seems to try hard to set weak ordering for RAM, not the whole address space. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-07 13:53 ` Jamie Lokier @ 2003-09-07 17:56 ` Alan Cox 0 siblings, 0 replies; 106+ messages in thread From: Alan Cox @ 2003-09-07 17:56 UTC (permalink / raw) To: Jamie Lokier Cc: Pavel Machek, nagendra_tomar, Geert Uytterhoeven, Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development On Sul, 2003-09-07 at 14:53, Jamie Lokier wrote: > Pavel Machek wrote: > > Perhaps weak ordering matters when you are writting to the MMIO, too? > > Perhaps, but the code in arch/i386/kernel/cpu/centaur.c seems to try > hard to set weak ordering for RAM, not the whole address space. There are three cases I know of where you get weak store ordering that is visible in some way #1 Pentium Pro due to an errata, hence the need for lock in the spin_unlock #2 Centaur Winchip (where OOSTORE off is worth 10-30% performance on common tasks). A lot of that has to do with the nature of the CPU and the old socket 7 bus stuff. Its not SMP but we have to care about it for mmio not because mmio is itself out of order (we leave it in order) but because of DMA. We must ensure that our writes to ram finish -before- we kick off the hardware copying the data... #3 Weak store ordering via sse type instructions, where its intentional and an sfence is needed eventually ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 7:59 ` Geert Uytterhoeven 2003-09-03 9:13 ` Jamie Lokier @ 2003-09-03 12:13 ` Jan-Benedict Glaw 1 sibling, 0 replies; 106+ messages in thread From: Jan-Benedict Glaw @ 2003-09-03 12:13 UTC (permalink / raw) To: Linux/m68k kernel mailing list, Linux Kernel Development [-- Attachment #1: Type: text/plain, Size: 635 bytes --] On Wed, 2003-09-03 09:59:02 +0200, Geert Uytterhoeven <geert@linux-m68k.org> wrote in message <Pine.GSO.4.21.0309030958130.6985-100000@waterleaf.sonytel.be>: > On 2 Sep 2003, Kars de Jong wrote: > Now all that's left is the 68030. Maybe I get my Amiga 3000 installed these days... I think it has got an 68030. MfG, JBG -- Jan-Benedict Glaw jbglaw@lug-owl.de . +49-172-7608481 "Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg fuer einen Freien Staat voll Freier Bürger" | im Internet! | im Irak! ret = do_actions((curr | FREE_SPEECH) & ~(IRAQ_WAR_2 | DRM | TCPA)); [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 8:34 ` Geert Uytterhoeven 2003-09-01 9:09 ` Kars de Jong @ 2003-09-01 10:35 ` Sam Creasey 2003-09-01 10:48 ` Jamie Lokier 2003-09-03 8:00 ` Kars de Jong 2 siblings, 1 reply; 106+ messages in thread From: Sam Creasey @ 2003-09-01 10:35 UTC (permalink / raw) To: Geert Uytterhoeven; +Cc: Jamie Lokier, Linux/m68k, Linux Kernel Development On Mon, 1 Sep 2003, Geert Uytterhoeven wrote: > As you probably know the 68020 had an external MMU (68551, or Sun-3 or Apollo > MMU). Probably Motorola didn't bother to change the behavior when the MMU got > integrated in later generations (68030 and up). > > BTW, probably you want us to run your test program on other m68k boxes? Mine > got a 68040, that leaves us with: > - 68020+Sun-3 MMU 68020+Sun-3 MMU results attached below (this is for a 3/60, and it's not suprising that it passes, as there's no real cache in this configuration (the sun3/2xx did have external cache, but the onboard ethernet in my 3/210 is on the fritz, and it's not booting at the moment). Note that this is the newer version of the program which Jamie just posted. bash-2.03# time ./jamie-test2 (2048) [10000,10000,0] Test separation: 8192 bytes: pass (2048) [10000,10000,0] Test separation: 16384 bytes: pass (2048) [10000,10000,0] Test separation: 32768 bytes: pass (2048) [10000,10000,0] Test separation: 65536 bytes: pass (2048) [10000,10000,0] Test separation: 131072 bytes: pass (2048) [10000,10000,0] Test separation: 262144 bytes: pass (2048) [10000,10000,0] Test separation: 524288 bytes: pass (2048) [10000,10000,0] Test separation: 1048576 bytes: pass (2048) [10000,10000,0] Test separation: 2097152 bytes: pass (2048) [10000,10000,0] Test separation: 4194304 bytes: pass (2048) [10000,10000,0] Test separation: 8388608 bytes: pass (2048) [10000,10000,0] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 1m34.330s user 1m30.030s sys 0m4.070s bash-2.03# cat /proc/cpuinfo CPU: 68020 MMU: Sun-3 FPU: 68881 Clocking: 19.9MHz BogoMips: 4.97 Calibration: 24896 loops -- Sam ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:35 ` Sam Creasey @ 2003-09-01 10:48 ` Jamie Lokier 2003-09-01 12:23 ` Sam Creasey 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 10:48 UTC (permalink / raw) To: Sam Creasey; +Cc: Geert Uytterhoeven, Linux/m68k, Linux Kernel Development Sam Creasey wrote: > 68020+Sun-3 MMU results attached below (this is for a 3/60, and it's not > suprising that it passes, as there's no real cache in this configuration > (the sun3/2xx did have external cache, but the onboard ethernet in my > 3/210 is on the fritz, and it's not booting at the moment). Note that > this is the newer version of the program which Jamie just posted. Thanks. > bash-2.03# time ./jamie-test2 > (2048) [10000,10000,0] Test separation: 8192 bytes: pass Mighty suspicious gettimeofday() you have there. > real 1m34.330s > user 1m30.030s > sys 0m4.070s Indeed, on other systems the test completes in a few seconds at most, not because of CPU speed, but because gettimeofday() returns high resolution time on them. Isn't there a way to read high resolution time on the 68020 Sun-3? -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:48 ` Jamie Lokier @ 2003-09-01 12:23 ` Sam Creasey 0 siblings, 0 replies; 106+ messages in thread From: Sam Creasey @ 2003-09-01 12:23 UTC (permalink / raw) To: Jamie Lokier; +Cc: Geert Uytterhoeven, Linux/m68k, Linux Kernel Development On Mon, 1 Sep 2003, Jamie Lokier wrote: > Sam Creasey wrote: > > > bash-2.03# time ./jamie-test2 > > (2048) [10000,10000,0] Test separation: 8192 bytes: pass > > Mighty suspicious gettimeofday() you have there. > > > real 1m34.330s > > user 1m30.030s > > sys 0m4.070s > > Indeed, on other systems the test completes in a few seconds at most, > not because of CPU speed, but because gettimeofday() returns high > resolution time on them. > > Isn't there a way to read high resolution time on the 68020 Sun-3? AFAICT, no. I've dug through the datasheets for the intersil RTC used, as well as the NetBSD code, and SunOS headers, and it seems that we're stuck with 1/100th second accuracy. Bummer. -- Sam ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 8:34 ` Geert Uytterhoeven 2003-09-01 9:09 ` Kars de Jong 2003-09-01 10:35 ` Sam Creasey @ 2003-09-03 8:00 ` Kars de Jong 2003-09-03 8:05 ` Geert Uytterhoeven 2 siblings, 1 reply; 106+ messages in thread From: Kars de Jong @ 2003-09-03 8:00 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote: > BTW, probably you want us to run your test program on other m68k boxes? Mine > got a 68040, that leaves us with: > - 68030 Ah, I forgot, I've got one of these here too, a Motorola MVME147 board: sasscm:/tmp# time ./jamie_test2 Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: FAIL - cache not coherent Test separation: 32768 bytes: FAIL - cache not coherent Test separation: 65536 bytes: FAIL - cache not coherent Test separation: 131072 bytes: FAIL - cache not coherent Test separation: 262144 bytes: FAIL - cache not coherent Test separation: 524288 bytes: FAIL - cache not coherent Test separation: 1048576 bytes: FAIL - cache not coherent Test separation: 2097152 bytes: FAIL - cache not coherent Test separation: 4194304 bytes: FAIL - cache not coherent Test separation: 8388608 bytes: FAIL - cache not coherent Test separation: 16777216 bytes: FAIL - cache not coherent VM page alias coherency test: failed; will use copy buffers instead real 0m1.149s user 0m0.240s sys 0m0.670s sasscm:/tmp# cat /proc/cpuinfo CPU: 68030 MMU: 68030 FPU: 68882 Clocking: 19.6MHz BogoMips: 4.90 Calibration: 24512 loops Regards, Kars. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 8:00 ` Kars de Jong @ 2003-09-03 8:05 ` Geert Uytterhoeven 2003-09-03 9:24 ` Kars de Jong 0 siblings, 1 reply; 106+ messages in thread From: Geert Uytterhoeven @ 2003-09-03 8:05 UTC (permalink / raw) To: Kars de Jong Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development On 3 Sep 2003, Kars de Jong wrote: > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote: > > BTW, probably you want us to run your test program on other m68k boxes? Mine > > got a 68040, that leaves us with: > > - 68030 > > Ah, I forgot, I've got one of these here too, a Motorola MVME147 board: > > sasscm:/tmp# time ./jamie_test2 > Test separation: 4096 bytes: FAIL - cache not coherent I guess the Plessey PME 68-22 didn't have cache, since the test passed? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 8:05 ` Geert Uytterhoeven @ 2003-09-03 9:24 ` Kars de Jong 0 siblings, 0 replies; 106+ messages in thread From: Kars de Jong @ 2003-09-03 9:24 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development On Wed, 2003-09-03 at 10:05, Geert Uytterhoeven wrote: > On 3 Sep 2003, Kars de Jong wrote: > > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote: > > > BTW, probably you want us to run your test program on other m68k boxes? Mine > > > got a 68040, that leaves us with: > > > - 68030 > > > > Ah, I forgot, I've got one of these here too, a Motorola MVME147 board: > > > > sasscm:/tmp# time ./jamie_test2 > > Test separation: 4096 bytes: FAIL - cache not coherent > > I guess the Plessey PME 68-22 didn't have cache, since the test passed? No, no cache. Well. A very tiny instruction cache in the 68020 itself. Regards, Kars. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (11 preceding siblings ...) 2003-08-29 16:27 ` Geert Uytterhoeven @ 2003-08-29 16:31 ` Brian Jackson 2003-08-29 17:39 ` Matt Porter ` (9 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Brian Jackson @ 2003-08-29 16:31 UTC (permalink / raw) To: Jamie Lokier, linux-kernel On Friday 29 August 2003 12:35 am, Jamie Lokier wrote: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. > <snip> Didn't see a 512k cache athlon-xp yet skyline:/share/linux/projects/cachetest # sh go processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 10 model name : AMD Athlon(tm) XP 2800+ stepping : 0 cpu MHz : 2088.111 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 4168.08 Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) real 0m0.110s user 0m0.070s sys 0m0.030s --Brian Jackson -- OpenGFS -- http://opengfs.sourceforge.net Gentoo -- http://gentoo.brianandsara.net Home -- http://www.brianandsara.net ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (12 preceding siblings ...) 2003-08-29 16:31 ` Brian Jackson @ 2003-08-29 17:39 ` Matt Porter 2003-09-01 6:00 ` Jamie Lokier 2003-08-29 19:37 ` Thorsten Kranzkowski ` (8 subsequent siblings) 22 siblings, 1 reply; 106+ messages in thread From: Matt Porter @ 2003-08-29 17:39 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote: > Anyway, please lots of people run the program and post the output + > /proc/cpuinfo. Compile with optimisation, -O or -O2 is fine. (You > can add -DHAVE_SYSV_SHM too if you like): > > gcc -o test test.c -O2 > time ./test > cat /proc/cpuinfo PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI ----- 440gx-1:~/cachetest# gcc -o test test.c -O2 440gx-1:~/cachetest# time ./test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.193s user 0m0.140s sys 0m0.010s 440gx-1:~/cachetest# cat /proc/cpuinfo cpu : 440GX Rev. A revision : 24.80 (pvr 51b2 1850) bogomips : 624.23 vendor : IBM machine : PPC440GX EVB (Ocotea) 440gx-1:~/cachetest# -- Matt Porter mporter@kernel.crashing.org ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 17:39 ` Matt Porter @ 2003-09-01 6:00 ` Jamie Lokier 2003-09-01 11:17 ` Alan Cox 2003-09-01 17:22 ` Roland Dreier 0 siblings, 2 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 6:00 UTC (permalink / raw) To: Matt Porter; +Cc: linux-kernel Matt Porter wrote: > PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI The cache looks very coherent to me. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 6:00 ` Jamie Lokier @ 2003-09-01 11:17 ` Alan Cox 2003-09-01 17:22 ` Roland Dreier 1 sibling, 0 replies; 106+ messages in thread From: Alan Cox @ 2003-09-01 11:17 UTC (permalink / raw) To: Jamie Lokier; +Cc: Matt Porter, Linux Kernel Mailing List On Llu, 2003-09-01 at 07:00, Jamie Lokier wrote: > Matt Porter wrote: > > PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI > > The cache looks very coherent to me. The only x86 which will show the user non cache coherent behaviour (and then only in a really weird situation) is SMP Pentium Pro due to the store fence errata. The Winchip is non SMP so you won't see CPU<->CPU store ordering changes although I guess mmap of mmio space might show you stuff if you really tried hard The Geode has bus level magic so its out of order but if you ask then you get the right answer (kind of the zen question about falling trees implemented in silicon). ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 6:00 ` Jamie Lokier 2003-09-01 11:17 ` Alan Cox @ 2003-09-01 17:22 ` Roland Dreier 2003-09-02 2:16 ` Matt Porter 1 sibling, 1 reply; 106+ messages in thread From: Roland Dreier @ 2003-09-01 17:22 UTC (permalink / raw) To: Jamie Lokier; +Cc: Matt Porter, linux-kernel Matt> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache Matt> is PTPI Jamie> The cache looks very coherent to me. Matt (like me) is probably just used to thinking of the IBM PPC 440 chips as non-coherent because they are not cache coherent with respect to external bus masters (eg they don't snoop the PCI bus). Of course, this is a different type of coherency from what you are measuring. - Roland ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 17:22 ` Roland Dreier @ 2003-09-02 2:16 ` Matt Porter 2003-09-02 5:40 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Matt Porter @ 2003-09-02 2:16 UTC (permalink / raw) To: Roland Dreier; +Cc: Jamie Lokier, Matt Porter, linux-kernel On Mon, Sep 01, 2003 at 10:22:02AM -0700, Roland Dreier wrote: > Matt> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache > Matt> is PTPI > > Jamie> The cache looks very coherent to me. > > Matt (like me) is probably just used to thinking of the IBM PPC 440 > chips as non-coherent because they are not cache coherent with respect > to external bus masters (eg they don't snoop the PCI bus). Of course, > this is a different type of coherency from what you are measuring. Exactly. After reading some other subthreads I see the other version of "cache coherency" that Jamie is interested in. -Matt ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 2:16 ` Matt Porter @ 2003-09-02 5:40 ` Jamie Lokier 0 siblings, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-02 5:40 UTC (permalink / raw) To: Matt Porter; +Cc: Roland Dreier, linux-kernel Matt Porter wrote: > Exactly. After reading some other subthreads I see the other version of > "cache coherency" that Jamie is interested in. Indeed, quite a lot of systems don't offer cache coherence with peripherals, other CPUs (if any) and in some cases even with other tasks on the same CPU. Isn't memory fun? :) -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (13 preceding siblings ...) 2003-08-29 17:39 ` Matt Porter @ 2003-08-29 19:37 ` Thorsten Kranzkowski 2003-08-29 20:03 ` Sean Neakums ` (7 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Thorsten Kranzkowski @ 2003-08-29 19:37 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. Dual Alpha ev6: ds20:~/src/cachetest$ ./doit Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (4 pages) real 0m4.148s user 0m4.029s sys 0m0.075s cpu : Alpha cpu model : EV6 cpu variation : 7 cpu revision : 0 cpu serial number : system type : Tsunami system variation : Goldrush system revision : 0 system serial number : ay91560403 cycle frequency [Hz] : 500000000 timer frequency [Hz] : 1024.00 page size [bytes] : 8192 phys. address bits : 44 max. addr. space # : 255 BogoMIPS : 998.56 kernel unaligned acc : 0 (pc=0,va=0) user unaligned acc : 0 (pc=0,va=0) platform string : AlphaServer DS20 500 MHz cpus detected : 2 cpus active : 2 cpu active mask : 0000000000000003 Single Alpha ev4 (AXPpci33): Marvin:~/src/cachetest$ ./doit Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m1.442s user 0m0.853s sys 0m0.471s cpu : Alpha cpu model : LCA4 cpu variation : -4294967301 cpu revision : 0 cpu serial number : Linux_is_Great! system type : Noname system variation : 0 system revision : 0 system serial number : MILO-2.2-17 cycle frequency [Hz] : 166868457 timer frequency [Hz] : 1024.00 page size [bytes] : 8192 phys. address bits : 34 max. addr. space # : 63 BogoMIPS : 320.40 kernel unaligned acc : 56014443 (pc=fffffc0000ab65a4,va=fffffc0000b99105) user unaligned acc : 2695 (pc=2000031ff90,va=11fffef26) platform string : N/A cpus detected : 0 ordinary Pentium II: bash-2.03$ ./doit Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.342s user 0m0.290s sys 0m0.030s processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 3 model name : Pentium II (Klamath) stepping : 4 cpu MHz : 300.691 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov mmx bogomips : 599.65 bye, Thorsten -- | Thorsten Kranzkowski Internet: dl8bcu@dl8bcu.de | | Mobile: ++49 170 1876134 Snail: Kiebitzstr. 14, 49324 Melle, Germany | | Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8bcu@marvin.dl8bcu.ampr.org [44.130.8.19] | ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (14 preceding siblings ...) 2003-08-29 19:37 ` Thorsten Kranzkowski @ 2003-08-29 20:03 ` Sean Neakums 2003-08-29 20:14 ` Iulian Musat ` (6 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Sean Neakums @ 2003-08-29 20:03 UTC (permalink / raw) To: linux-kernel Jamie Lokier <jamie@shareable.org> writes: > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. 2-way Pentium III: $ time ./va Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.096s user 0m0.073s sys 0m0.023s $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 11 model name : Intel(R) Pentium(R) III CPU family 1133MHz stepping : 1 cpu MHz : 1129.879 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 2220.03 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 11 model name : Intel(R) Pentium(R) III CPU family 1133MHz stepping : 1 cpu MHz : 1129.879 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 2252.80 ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (15 preceding siblings ...) 2003-08-29 20:03 ` Sean Neakums @ 2003-08-29 20:14 ` Iulian Musat 2003-08-29 20:26 ` Paul J.Y. Lahaie ` (5 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Iulian Musat @ 2003-08-29 20:14 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel Jamie Lokier wrote: > Anyway, please lots of people run the program and post the output + > /proc/cpuinfo. Compile with optimisation, -O or -O2 is fine. (You > can add -DHAVE_SYSV_SHM too if you like): > > gcc -o test test.c -O2 > time ./test > cat /proc/cpuinfo 2 AMD Athlon 4 Itanium II (on an altix machine) 2 Pentium III 1 AMD XP 1 Pentium IV 2 AMD Athlon : ~~~~~~~~~~~~~~~~~~~~~~~~ gcc -o test test.c -O2 time ./test Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) real 0m0.088s user 0m0.080s sys 0m0.004s cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1526.385 cache size : 256 KB Physical processor ID : -2084402944 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3038.00 processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1526.385 cache size : 256 KB Physical processor ID : 410321912 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3046.10 ~~~~~~~~~~~~~~~~~~~~~~~~ 4 Itanium II (on an altix machine) ~~~~~~~~~~~~~~~~~~~~~~~~ gcc -o test test.c -O2 time ./test Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.095s user 0m0.065s sys 0m0.028s cat /proc/cpuinfo processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 900.000000 itc MHz : 900.000000 BogoMIPS : 1346.37 processor : 1 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 900.000000 itc MHz : 900.000000 BogoMIPS : 1346.37 processor : 2 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 900.000000 itc MHz : 900.000000 BogoMIPS : 1342.17 processor : 3 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 900.000000 itc MHz : 900.000000 BogoMIPS : 1342.17 ~~~~~~~~~~~~~~~~~~~~~~~~ 2 Pentium III ~~~~~~~~~~~~~~~~~~~~~~~~ gcc -o test test.c -O2 time ./test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.154s user 0m0.109s sys 0m0.020s cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 3 cpu MHz : 846.353 cache size : 256 KB Physical processor ID : 0 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1682.99 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 3 cpu MHz : 846.353 cache size : 256 KB Physical processor ID : 0 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1691.09 ~~~~~~~~~~~~~~~~~~~~~~~~ 1 AMD XP ~~~~~~~~~~~~~~~~~~~~~~~~ gcc -o test test.c -O2 time ./test Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) real 0m0.077s user 0m0.060s sys 0m0.010s cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) XP 2100+ stepping : 2 cpu MHz : 1746.168 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3486.51 ~~~~~~~~~~~~~~~~~~~~~~~~ 1 Pentium IV ~~~~~~~~~~~~~~~~~~~~~~~~ gcc -o test test.c -O2 time ./test Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.221s user 0m0.180s sys 0m0.025s cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 0 model name : Intel(R) Pentium(R) 4 CPU 1700MHz stepping : 10 cpu MHz : 1694.928 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3365.99 ~~~~~~~~~~~~~~~~~~~~~~~~ -iulian ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (16 preceding siblings ...) 2003-08-29 20:14 ` Iulian Musat @ 2003-08-29 20:26 ` Paul J.Y. Lahaie 2003-09-01 8:15 ` Russell King 2003-08-29 22:35 ` Kenneth Johansson ` (4 subsequent siblings) 22 siblings, 1 reply; 106+ messages in thread From: Paul J.Y. Lahaie @ 2003-08-29 20:26 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 2055 bytes --] Ran it on a few systems here. Corel NetWinder (275MHz StrongARM) Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: FAIL - cache not coherent Test separation: 32768 bytes: FAIL - cache not coherent Test separation: 65536 bytes: FAIL - cache not coherent Test separation: 131072 bytes: FAIL - cache not coherent Test separation: 262144 bytes: FAIL - cache not coherent Test separation: 524288 bytes: FAIL - cache not coherent Test separation: 1048576 bytes: FAIL - cache not coherent Test separation: 2097152 bytes: FAIL - cache not coherent Test separation: 4194304 bytes: FAIL - cache not coherent Test separation: 8388608 bytes: FAIL - cache not coherent Test separation: 16777216 bytes: FAIL - cache not coherent VM page alias coherency test: failed; will use copy buffers instead cat /proc/cpuinfo Processor : StrongARM-110 rev 3 (v4l) BogoMIPS : 185.95 Features : swp half 26bit fastmult Hardware : Rebel-NetWinder Revision : 52ff Serial : 00000000000008bf HP zx6000 (2xItanium 2) time ./test Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m7.455s user 0m7.412s sys 0m0.040s cat /proc/cpuinfo processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 900.000000 itc MHz : 900.000000 BogoMIPS : 1346.37 [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 20:26 ` Paul J.Y. Lahaie @ 2003-09-01 8:15 ` Russell King 2003-09-01 10:12 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Russell King @ 2003-09-01 8:15 UTC (permalink / raw) To: Paul J.Y. Lahaie; +Cc: Jamie Lokier, linux-kernel This looks like an old kernel on your NetWinder. Later 2.4 kernels should get this right (by marking the pages uncacheable in user space.) However, when I tried this program, it seemed to have some unexpected results, sometimes claiming that its too slow, sometimes that the store buffer isn't coherent, and sometimes saying that the cache isn't coherent. Oddly, davem's cache aliasing test program works every time. It's something which I need to look into, but I don't know when I'm going to find the time to delve into the memory management stuff. On Fri, Aug 29, 2003 at 04:26:28PM -0400, Paul J.Y. Lahaie wrote: > Corel NetWinder (275MHz StrongARM) > Test separation: 4096 bytes: FAIL - cache not coherent > Test separation: 8192 bytes: FAIL - cache not coherent > Test separation: 16384 bytes: FAIL - cache not coherent > Test separation: 32768 bytes: FAIL - cache not coherent > Test separation: 65536 bytes: FAIL - cache not coherent > Test separation: 131072 bytes: FAIL - cache not coherent > Test separation: 262144 bytes: FAIL - cache not coherent > Test separation: 524288 bytes: FAIL - cache not coherent > Test separation: 1048576 bytes: FAIL - cache not coherent > Test separation: 2097152 bytes: FAIL - cache not coherent > Test separation: 4194304 bytes: FAIL - cache not coherent > Test separation: 8388608 bytes: FAIL - cache not coherent > Test separation: 16777216 bytes: FAIL - cache not coherent > VM page alias coherency test: failed; will use copy buffers instead > > cat /proc/cpuinfo > Processor : StrongARM-110 rev 3 (v4l) > BogoMIPS : 185.95 > Features : swp half 26bit fastmult > > Hardware : Rebel-NetWinder > Revision : 52ff > Serial : 00000000000008bf -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 8:15 ` Russell King @ 2003-09-01 10:12 ` Jamie Lokier 2003-09-01 11:30 ` Geert Uytterhoeven ` (2 more replies) 0 siblings, 3 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 10:12 UTC (permalink / raw) To: Russell King, Paul J.Y. Lahaie, linux-kernel Russell King wrote: > This looks like an old kernel on your NetWinder. Later 2.4 kernels > should get this right (by marking the pages uncacheable in user space.) How do they know which pages to mark uncacheable? Surely not all MAP_SHARED|MAP_FIXED mappings are uncacheable? > However, when I tried this program, it seemed to have some unexpected > results, sometimes claiming that its too slow, sometimes that the > store buffer isn't coherent, and sometimes saying that the cache > isn't coherent. If it says the store buffer isn't coherent, that means the main test for coherence failed (test_page_alias), but a second test (test_l1_only), which is designed to allow any CPU delayed stores to drain, is showing the same memory to be coherent. There is a bug in test_l1_only which I just noticed. It's unlikely, but if `dummy' happens to have the same L1 cache address as both words being tested, and it's a 2-way (or less) set-associative cache, then it will inadvertently flush the cache and say "store buffer not coherent" when it means to say "cache not coherent". If the duplicate mapping is uncacheable, it should always say it's too slow, however if _all_ MAP_FIXED|MAP_SHARED mappings are uncacheable then it compares the timings and will think there is no penalty for the duplicate mapping. > On Fri, Aug 29, 2003 at 04:26:28PM -0400, Paul J.Y. Lahaie wrote: > > Corel NetWinder (275MHz StrongARM) > > Test separation: 4096 bytes: FAIL - cache not coherent All the 3 results I have for ARM say that they are all incoherent. Those results are all for SA-110s of different speeds. Please try the program below, which is the same as before but with test_l1_only hopefully improved, and it prints some more helpful numbers. Thanks, -- Jamie ========================================== /* Version 3! This code maps shared memory to multiple addresses and tests it for cache coherency and performance. Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include <assert.h> #include <stdlib.h> #include <string.h> #include <limits.h> #include <errno.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/signal.h> #include <sys/mman.h> #include <sys/time.h> #if HAVE_SYSV_SHM #include <sys/ipc.h> #include <sys/shm.h> #endif //#include "pagealias.h" /* Helpers to temporarily block all signals. These are used for when a race condition might leave a temporary file that should have been deleted -- we do our best to prevent this possibility. */ static void block_signals (sigset_t * save_state) { sigset_t all_signals; sigfillset (&all_signals); sigprocmask (SIG_BLOCK, &all_signals, save_state); } static void unblock_signals (sigset_t * restore_state) { sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0); } /* Open a new shared memory file, either using the POSIX.4 `shm_open' function, or using a regular temporary file in /tmp. Immediately after opening the file, it is unlinked from the global namespace using `shm_unlink' or `unlink'. On success, the value returned is a file descriptor. Otherwise, -1 is returned and `errno' is set. The descriptor can be closed using simply `close'. */ /* Note: `shm_open' requires link argument `-lposix4' on Suns. On GNU/Linux with Glibc, it requires `-lrt'. Unfortunately, Glibc's -lrt insists on linking to pthreads, which we may not want to use because that enables thread locking overhead in other functions. So we implement a direct method of opening shm on Linux. */ /* If this is changed, change the size of `buffer' below too. */ #if HAVE_SHM_OPEN #define SHM_DIR_PREFIX "/" /* `shm_open' arg needs "/" for portability. */ #elif defined (__linux__) #include <sys/statfs.h> #define SHM_DIR_PREFIX "/dev/shm/" #else #undef SHM_DIR_PREFIX #endif static int open_shared_memory_file (int use_tmp_file) { char * ptr, buffer [19]; int fd, i; unsigned long number; sigset_t save_signals; struct timeval tv; #if !HAVE_SHM_OPEN && defined (__linux__) struct statfs sfs; if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0 || sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */)) { errno = ENOSYS; return -1; } #endif loop: /* Print a randomised path name into `buffer'. The string depends on the directory and whether we are using POSIX.4 shared memory or a regular temporary file. RANDOM is a 5-digit, base-62 representation of a pseudo-random number. The string is used as a candidate in the search for an unused shared segment or file name. */ #ifdef SHM_DIR_PREFIX strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-"); #else strcpy (buffer, "/tmp/shm-"); #endif ptr = buffer + strlen (buffer); gettimeofday (&tv, (struct timezone *) 0); number = (unsigned long) random (); number += (unsigned long) getpid (); number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec; for (i = 0; i < 5; i++) { /* Don't use character arithmetic, as not all systems are ASCII. */ *ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62]; number /= 62; } *ptr = '\0'; /* Block signals between the open and unlink, to really minimise the chance of accidentally leaving an unwanted file around. */ block_signals (&save_signals); #if HAVE_SHM_OPEN if (!use_tmp_file) { fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600); if (fd != -1) shm_unlink (buffer); } else #endif /* HAVE_SHM_OPEN */ { fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600); if (fd != -1) unlink (buffer); } unblock_signals (&save_signals); /* If we failed due to a name collision or a signal, try again. */ if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR)) goto loop; return fd; } /* Allocate a region of address space `size' bytes long, so that the region will not be allocated for any other purpose. It is freed with `munmap'. Returns the mapped base address on success. Otherwise, MAP_FAILED is returned and `errno' is set. */ static size_t system_page_size; #if !defined (MAP_ANONYMOUS) && defined (MAP_ANON) #define MAP_ANONYMOUS MAP_ANON #endif #ifndef MAP_NORESERVE #define MAP_NORESERVE 0 #endif #ifndef MAP_FILE #define MAP_FILE 0 #endif #ifndef MAP_VARIABLE #define MAP_VARIABLE 0 #endif #ifndef MAP_FAILED #define MAP_FAILED ((void *) -1) #endif #ifndef PROT_NONE #define PROT_NONE PROT_READ #endif static void * map_address_space (void * optional_address, size_t size, int access) { void * addr; #ifdef MAP_ANONYMOUS addr = mmap (optional_address, size, access ? (PROT_READ | PROT_WRITE) : PROT_NONE, (MAP_PRIVATE | MAP_ANONYMOUS | (optional_address ? MAP_FIXED : MAP_VARIABLE) | (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0); #else /* not defined MAP_ANONYMOUS */ int save_errno, zero_fd = open ("/dev/zero", O_RDONLY); if (zero_fd == -1) return MAP_FAILED; addr = mmap (optional_address, size, access ? (PROT_READ | PROT_WRITE) : PROT_NONE, (MAP_PRIVATE | MAP_FILE | (optional_address ? MAP_FIXED : MAP_VARIABLE) | (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0); save_errno = errno; close (zero_fd); errno = save_errno; #endif /* not defined MAP_ANONMOUS */ return addr; } /* Set up a page alias mapping using mmap() on POSIX shared memory or on a temporary regular file. Returns the mapped base address on success. Otherwise, 0 is returned and `errno' is set. */ static void * page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file) { void * base_addr, * addr; int fd, i, save_errno; struct stat st; fd = open_shared_memory_file (use_tmp_file); if (fd == -1) goto fail; /* First, resize the shared memory file to the desired size. */ if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size) goto close_fail; /* Map an anonymous region `separation + size' bytes long. This is how we allocate sufficient contiguous address space. We over-map this with the aliased buffer. */ if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto close_fail; /* Map the same shared memory repeatedly, at different addresses. */ for (i = 0; i < 2; i++) { addr = mmap ((char *) base_addr + (i ? separation : 0), size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED, fd, (off_t) 0); if (addr == MAP_FAILED) goto unmap_fail; if (addr != (char *) base_addr + (i ? separation : 0)) { /* `mmap' ignored MAP_FIXED! Should never happen. */ munmap (addr, size); save_errno = EINVAL; goto unmap_fail_se; } } if (close (fd) != 0) goto unmap_fail; /* Success! */ return base_addr; /* Failure. */ unmap_fail: save_errno = errno; unmap_fail_se: munmap (base_addr, separation + size); errno = save_errno; close_fail: save_errno = errno; close (fd); errno = save_errno; fail: return 0; } /* Set up a page alias mapping using SYSV IPC shared memory. Returns the mapped base address on success. Otherwise, 0 is returned and `errno' is set. */ #if HAVE_SYSV_SHM static void * page_alias_using_sysv_shm (size_t size, size_t separation) { void * base_addr, * addr; sigset_t save_signals; int shmid, i, save_errno; /* Map an anonymous region `separation + size' bytes long. This is how we allocate sufficient contiguous address space. We over-map this with the aliased buffer. */ if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto fail; /* Block signals between the shmget() and IPC_RMID, to minimise the chance of accidentally leaving an unwanted shared segment around. */ block_signals (&save_signals); shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600); if (shmid == -1) goto unmap_fail; /* Map the same shared memory repeatedly, at different addresses. */ for (i = 0; i < 2; i++) { /* `shmat' is tried twice. The fist time it can fail if the local implementation of `shmat' refuses to map over a region mapped with `mmap'. In that case, we punch a hole using `munmap' and do it again. If the local `shmat' has this property, the `shmat' calls to fixed addresses might collide with a concurrent thread which is also doing mappings, and will fail. At least it is a safe failure. On the other hand, if the local `shmat' can map over already-mapped regions (in the same way that `mmap' does), it is essential that we do actually use an already-mapped region, so that collisions with a concurrent thread can't possibly result in both of us grabbing the same address range with no indication of error. */ addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0); if (addr == (void *) -1 && errno == EINVAL) { munmap ((char *) base_addr + (i ? separation : 0), size); addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0); } /* Check for errors. */ if (addr == (void *) -1) { save_errno = errno; if (i == 1) shmdt (base_addr); goto remove_shm_fail_se; } else if (addr != (char *) base_addr + (i ? separation : 0)) { /* `shmat' ignored the requested address! */ if (i == 1) shmdt (base_addr); save_errno = EINVAL; goto remove_shm_fail_se; } } if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0) goto remove_shm_fail; unblock_signals (&save_signals); /* Success! */ return base_addr; /* Failure. */ remove_shm_fail: save_errno = errno; remove_shm_fail_se: while (--i >= 0) shmdt ((char *) base_addr + (i ? separation : 0)); shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0); errno = save_errno; unmap_fail: save_errno = errno; unblock_signals (&save_signals); munmap (base_addr, separation + size); errno = save_errno; fail: return 0; } #endif /* HAVE_SYSV_SHM */ /* Map a page-aliased ring buffer. Shared memory of size `size' is mapped twice, with the difference between the two addresses being `separation', which must be at least `size'. The total address range used is `separation + size' bytes long. On success, *METHOD is filled with a number which must be passed to `page_alias_unmap', and the mapped base address is returned. Otherwise, 0 is returned and `errno' is set. */ static void * __page_alias_map (size_t size, size_t separation, int * method) { void * addr; if (((size | separation) & (system_page_size - 1)) != 0 || size > separation) { errno = -EINVAL; return 0; } /* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */ #ifdef SHM_DIR_PREFIX *method = 0; if ((addr = page_alias_using_mmap (size, separation, 0)) != 0) return addr; #endif #if HAVE_SYSV_SHM *method = 1; if ((addr = page_alias_using_sysv_shm (size, separation)) != 0) return addr; #endif *method = 2; return page_alias_using_mmap (size, separation, 1); } /* Unmap a page-aliased ring buffer previously allocated by `page_alias_map'. `address' is the base address, and `size' and `separation' are the arguments previously passed to `__page_alias_map'. `method' is the value previously stored in *METHOD. Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */ static int __page_alias_unmap (void * address, size_t size, size_t separation, int method) { #if HAVE_SYSV_SHM if (method == 1) { shmdt (address); shmdt (address + separation); if (separation > size) munmap (address + size, separation - size); return 0; } #endif return munmap (address, separation + size); } /* Map a page-aliased ring buffer. `size' is the size of the buffer to create; it will be mapped twice to cover a total address range `size * 2' bytes long. On success, *METHOD is filled with a number which must be passed to `page_alias_unmap', and the mapped base address is returned. Otherwise, 0 is returned and `errno' is set. */ void * page_alias_map (size_t size, int * method) { return __page_alias_map (size, size, method); } /* Unmap a page-aliased ring buffer previously allocated by `page_alias_map'. `address' is the base address, and `size' is the size of the buffer (which is half of the total mapped address range). `method' is a value previously stored in *METHOD by `page_alias_map'. Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */ int page_alias_unmap (void * address, size_t size, int method) { return __page_alias_unmap (address, size, size, method); } /* Map some memory which is not aliased, for timing comparisons against aliased pages. We use a combination of mappings similar to page_alias_*(), in case there are resource limitations which would prevent malloc() or a single mmap() working for the larger address range tests. */ static void * page_no_alias (size_t size, size_t separation) { void * base_addr, * addr; int i, save_errno; if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED) goto fail; /* Map anonymous memory at the different addresses. */ for (i = 0; i < 2; i++) { addr = map_address_space ((char *) base_addr + (i ? separation : 0), size, 1); if (addr == MAP_FAILED) goto unmap_fail; if (addr != (char *) base_addr + (i ? separation : 0)) { /* `mmap' ignored MAP_FIXED! Should never happen. */ munmap (addr, size); save_errno = EINVAL; goto unmap_fail_se; } } /* Success! */ return base_addr; /* Failure. */ unmap_fail: save_errno = errno; unmap_fail_se: munmap (base_addr, separation + size); errno = save_errno; fail: return 0; } /* This should be a word size that the architecture can read and write fast in a single instruction. In principle, C's `int' is the natural word size, but in practice it isn't on 64-bit machines. */ #define WORD long /* These GCC-specific asm statements force values into registers, and also act as compiler memory barriers. These are used to force a group of write/write/read instructions as close together as possible, to maximise the detection of store buffer conditions. Despite being asm statements, these will work with any of GCC's target architectures, provided they have >= 4 registers. */ #if __GNUC__ >= 3 #define __noinline __attribute__ ((__noinline__)) #else #define __noinline #endif #ifdef __GNUC__ #define force_into_register(var) \ __asm__ ("" : "=r" (var) : "0" (var) : "memory") #define force_into_registers(var1, var2, var3, var4) \ __asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \ : "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory") #else #define force_into_register(var) do {} while (0) #define force_into_registers(var1, var2, var3, var4) do {} while (0) #endif /* This function tries to test whether a CPU snoops its store buffer for reads within a few instructions, and ignores virtual to physical address translations when doing that. In principle a CPU might do this even if it's L1 cache is physically tagged or indexed, although I have not seen such a system. (A CPU which uses store buffer snooping and with an off-board MMU, which the CPU is unaware of, could have this property). It isn't possible to do this test perfectly; we do our best. The `force_into_register' macros ensure that the write/write/read sequence is as compact as the compiler can make it. */ static WORD __noinline test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2) { register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2; register WORD __reg1 = 1, __reg2 = 0; force_into_registers (__reg1, __reg2, __regptr1, __regptr2); *__regptr1 = __reg1; *__regptr2 = __reg2; __reg1 = *__regptr1; force_into_register (__reg1); return __reg1; } /* This function tests whether writes to one page are seen in another page at a different virtual address, and whether they are nearly as fast as normal writes. The accesses are timed by the caller of this function. Alternate writes go to alternate pages, so that if aliasing is implemented using page faults, it will clearly show up in the timings. */ static int __noinline test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops) { WORD fail = 0; while (--timing_loops >= 0) fail |= test_store_buffer_snoop (ptr1, ptr2); return fail != 0; } /* This function tests L1 cache coherency without checking for store buffer snoop coherency. To do this, we add enough stores that the writes to *PTR1 are flushed (or drain due to the time delay) from the store buffer before we read from *PTR1. The result of this function is not important: it is only used in a diagnostic message. */ static int __noinline test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2) { int i, j; WORD fail = 0; for (i = 0; i < 10; i++) { *ptr1 = 1; /* This loop of volatile writes creates a short time delay. The delay gives the store to *PTR1 time to flush from the store buffer and/or the many writes flush the store buffer. The loop writes to *PTR2 because if we pick another fixed address and write to it, that would be testing 3 cache lines (PTR1, PTR2 and the fixed address) and the fixed address _might_ happen to collide with PTR1 or PTR2 in the L1 cache. If the L1 cache is 2-way set-associative, that would flush it every time, possibly making it appear coherent when it isn't. */ for (j = 0; j < 1000; j++) *ptr2 = 0; fail |= *ptr1; } return fail != 0; } /* Thoroughly test a pair of aliased pages with a fixed address separation, to see if they really behave like memory appearing at two locations, and efficiently. We search through different values of `separation' searching for a suitable "cache colour" on this machine. */ static inline const char * test_one_separation (size_t separation) { void * buffers [2]; long timings [3]; int i, method, timing_loops = 64; /* We measure timings of 3 different tests, each 128 times to find the minimum. 0: Writes and reads to aliased pages. 1: Writes and reads to non-aliased pages, to compare with 1. 2: Doing nothing, to measure the time for `gettimeofday' itself. The measurements are done in a mixed up order. If we did 64 measurements of type 0, then 64 of type 1, then 64 of type 2, the results could be mislead due to synchronisation with other processes occuring on the machine. */ /* A previously generated random shuffle of bit-pairs. Each pair is a number from the set {0,1,2}. Each number occurs exactly 128 times. */ static const unsigned char pattern [96] = { 0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56, 0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49, 0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99, 0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25, 0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19, 0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15, 0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89, 0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85, }; buffers [0] = __page_alias_map (system_page_size, separation, &method); if (buffers [0] == 0) return "alias map failed"; buffers [1] = page_no_alias (system_page_size, separation); if (buffers [1] == 0) { __page_alias_unmap (buffers [0], system_page_size, separation, method); return "non-alias map failed"; } retry: timings [2] = timings [1] = timings [0] = LONG_MAX; for (i = 0; i < 384; i++) { struct timeval time_before, time_after; long time_delta; int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3; volatile WORD * ptr1 = (volatile WORD *) buffers [which_test]; volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation); /* Test whether writes to one page appear immediately in the other, and time how long the memory accesses take. */ gettimeofday (&time_before, (struct timezone *) 0); if (which_test < 2) fail = test_page_alias (ptr1, ptr2, timing_loops); gettimeofday (&time_after, (struct timezone *) 0); if (fail && which_test == 0) { /* Test whether the failure is due to a store buffer bypass which ignores virtual address translation. */ int l1_fail = test_l1_only (ptr1, ptr2); __page_alias_unmap (buffers [0], system_page_size, separation, method); munmap (buffers [1], separation + system_page_size); return l1_fail ? "cache not coherent" : "store buffer not coherent"; } time_delta = ((time_after.tv_usec - time_before.tv_usec) + 1000000 * (time_after.tv_sec - time_before.tv_sec)); /* Find the smallest time taken for each test. Ignore negative glitches due to Linux' tendancy to jump the clock backwards. */ if (time_delta >= 0 && time_delta < timings [which_test]) timings [which_test] = time_delta; } /* Remove the cost of `gettimeofday()' itself from measurements. */ timings [0] -= timings [2]; timings [1] -= timings [2]; /* Keep looping until at least one measurement becomes significant. A very fast CPU will show measurements of zero microseconds for smaller values of `timing_loops'. Also loop until the cost of `gettimeofday()' becomes insignificant. When the program is run under `strace', the latter is a big and this is needed to stabilise the results. */ if (timings [0] <= 10 * (1 + timings [2]) && timings [1] <= 10 * (1 + timings [2])) { timing_loops <<= 1; goto retry; } __page_alias_unmap (buffers [0], system_page_size, separation, method); munmap (buffers [1], separation + system_page_size); printf ("(%d) [%ld,%ld,%ld] ", timing_loops, timings [0], timings [1], timings [2]); /* Reject page aliasing if it is much slower than accessing a single, definitely cached page directly. */ if (timings [0] > 2 * timings [1]) return "too slow"; /* Success! Passed all tests for these parameters. */ return 0; } size_t page_alias_smallest_size; void page_alias_init (void) { size_t size; #ifdef _SC_PAGESIZE system_page_size = sysconf (_SC_PAGESIZE); #elif defined (_SC_PAGE_SIZE) system_page_size = sysconf (_SC_PAGE_SIZE); #else system_page_size = getpagesize (); #endif for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2) { const char * reason = test_one_separation (size); printf ("Test separation: %lu bytes: %s%s\n", (unsigned long) size, reason ? "FAIL - " : "pass", reason ? reason : ""); /* This logic searches for the smallest _contiguous_ range of page sizes for which `page_alias_test' passes. */ if (reason == 0 && page_alias_smallest_size == 0) page_alias_smallest_size = size; else if (reason != 0 && page_alias_smallest_size != 0) { /* Fail, indicating that page-aliasing is not reliable, because there's a maximum size. We don't support that as it seems quite unlikely given our model of cache colouring. */ page_alias_smallest_size = 0; break; } } printf ("VM page alias coherency test: "); if (page_alias_smallest_size == 0) printf ("failed; will use copy buffers instead\n"); else if (page_alias_smallest_size == system_page_size) printf ("all sizes passed\n"); else printf ("minimum fast spacing: %lu (%lu page%s)\n", (unsigned long) page_alias_smallest_size, (unsigned long) (page_alias_smallest_size / system_page_size), (page_alias_smallest_size == system_page_size) ? "" : "s"); } //#ifdef TEST_PAGEALIAS int main () { page_alias_init (); return 0; } //#endif ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:12 ` Jamie Lokier @ 2003-09-01 11:30 ` Geert Uytterhoeven 2003-09-01 14:17 ` Russell King 2003-09-04 17:37 ` Maciej W. Rozycki 2 siblings, 0 replies; 106+ messages in thread From: Geert Uytterhoeven @ 2003-09-01 11:30 UTC (permalink / raw) To: Jamie Lokier Cc: Russell King, Paul J.Y. Lahaie, Linux Kernel Development, Linux/m68k On Mon, 1 Sep 2003, Jamie Lokier wrote: > There is a bug in test_l1_only which I just noticed. It's unlikely, > but if `dummy' happens to have the same L1 cache address as both words > being tested, and it's a 2-way (or less) set-associative cache, then > it will inadvertently flush the cache and say "store buffer not > coherent" when it means to say "cache not coherent". > > Please try the program below, which is the same as before but with > test_l1_only hopefully improved, and it prints some more helpful > numbers. Results for 68040 with the new version: cassandra:/tmp# time ./test2 Test separation: 4096 bytes: FAIL - store buffer not coherent Test separation: 8192 bytes: FAIL - store buffer not coherent Test separation: 16384 bytes: FAIL - store buffer not coherent Test separation: 32768 bytes: FAIL - store buffer not coherent Test separation: 65536 bytes: FAIL - store buffer not coherent Test separation: 131072 bytes: FAIL - store buffer not coherent Test separation: 262144 bytes: FAIL - store buffer not coherent Test separation: 524288 bytes: FAIL - store buffer not coherent Test separation: 1048576 bytes: FAIL - store buffer not coherent Test separation: 2097152 bytes: FAIL - store buffer not coherent Test separation: 4194304 bytes: FAIL - store buffer not coherent Test separation: 8388608 bytes: FAIL - store buffer not coherent Test separation: 16777216 bytes: FAIL - store buffer not coherent VM page alias coherency test: failed; will use copy buffers instead real 0m0.454s user 0m0.090s sys 0m0.210s cassandra:/tmp# cat /proc/cpuinfo CPU: 68040 MMU: 68040 FPU: 68040 Clocking: 24.8MHz BogoMips: 16.53 Calibration: 82688 loops cassandra:/tmp# New m68k binary at http://home.tvd.be/cr26864/Linux/m68k/jamie_test2.gz Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:12 ` Jamie Lokier 2003-09-01 11:30 ` Geert Uytterhoeven @ 2003-09-01 14:17 ` Russell King 2003-09-01 14:51 ` Russell King 2003-09-01 16:52 ` Jamie Lokier 2003-09-04 17:37 ` Maciej W. Rozycki 2 siblings, 2 replies; 106+ messages in thread From: Russell King @ 2003-09-01 14:17 UTC (permalink / raw) To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel On Mon, Sep 01, 2003 at 11:12:24AM +0100, Jamie Lokier wrote: > Russell King wrote: > > This looks like an old kernel on your NetWinder. Later 2.4 kernels > > should get this right (by marking the pages uncacheable in user space.) > > How do they know which pages to mark uncacheable? Surely not all > MAP_SHARED|MAP_FIXED mappings are uncacheable? By looking at the mappings present in the process. If a process maps the same file using MAP_SHARED _and_ we fault the same page of data into two or more mappings, we turn off the cache for those pages. We actually only turn off the cache and leave the write buffer (aka your store buffer) turned on for these regions, which should be sufficient for it to remain coherent between different virtual addresses. I've been doing some further investigation, and I'm now of the opinion that "SA110" StrongARM chips have buggy write buffers, because: - if I turn off the cache, leaving the write buffer on, this program works on StrongARM-1110 CPUs but not some StrongARM-110 CPUs. - if I turn off the cache and write buffer on these twice-mapped pages, StrongARM-110 behaves as expected. I've tested on several silicon revisions of StrongARM-110's: - H appears buggy (reports as rev. 2) - K appears fine (reports as rev. 2) - S appears buggy (reports as rev. 3) Unfortunately, the written documentation makes zero mention of the exact write buffer behaviour. The best that I have to go on for the StrongARM-110 is a block diagram which indicates that the write buffer uses physical addresses, and that the D-cache contains the physical address which the line was fetched from for writeback (via the write buffer.) So it seems your test program finds problems which DaveM's aliastest program fails to detect... Gah. ;( I guess its time to devise a kernel test and alter our behaviour on ARM accordingly. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 14:17 ` Russell King @ 2003-09-01 14:51 ` Russell King 2003-09-01 19:09 ` Guennadi Liakhovetski 2003-09-01 16:52 ` Jamie Lokier 1 sibling, 1 reply; 106+ messages in thread From: Russell King @ 2003-09-01 14:51 UTC (permalink / raw) To: linux-kernel; +Cc: Jamie Lokier, Paul J.Y. Lahaie Ok, here's the results for a SA1110 machine (ie, with non-broken write buffer): Linux assabet2 2.6.0-test4 #1313 Thu Aug 28 21:05:05 BST 2003 armv4l unknown Processor : StrongARM-1110 rev 8 (v4l) BogoMIPS : 147.04 Features : swp half 26bit fastmult CPU implementer : 0x69 CPU architecture: 4 CPU variant : 0x0 CPU part : 0xb11 CPU revision : 8 Hardware : Intel-Assabet Revision : 0000 Serial : 0000000000000000 (64) [21,6,1] Test separation: 4096 bytes: FAIL - too slow (64) [21,6,1] Test separation: 8192 bytes: FAIL - too slow (64) [21,6,1] Test separation: 16384 bytes: FAIL - too slow (64) [21,6,1] Test separation: 32768 bytes: FAIL - too slow (64) [21,6,1] Test separation: 65536 bytes: FAIL - too slow (64) [21,6,1] Test separation: 131072 bytes: FAIL - too slow (64) [21,6,1] Test separation: 262144 bytes: FAIL - too slow (64) [21,6,1] Test separation: 524288 bytes: FAIL - too slow (64) [21,7,1] Test separation: 1048576 bytes: FAIL - too slow (64) [21,7,1] Test separation: 2097152 bytes: FAIL - too slow (64) [21,6,1] Test separation: 4194304 bytes: FAIL - too slow (64) [21,6,1] Test separation: 8388608 bytes: FAIL - too slow (64) [21,7,1] Test separation: 16777216 bytes: FAIL - too slow VM page alias coherency test: failed; will use copy buffers instead -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 14:51 ` Russell King @ 2003-09-01 19:09 ` Guennadi Liakhovetski 0 siblings, 0 replies; 106+ messages in thread From: Guennadi Liakhovetski @ 2003-09-01 19:09 UTC (permalink / raw) To: Russell King; +Cc: linux-kernel, Jamie Lokier, Paul J.Y. Lahaie On Processor : Intel XScale-PXA250 rev 3 (v5l) BogoMIPS : 397.31 Features : swp half thumb fastmult edsp CPU implementor : 0x69 CPU architecture: 5TE CPU variant : 0x0 CPU part : 0x290 CPU revision : 3 Cache type : undefined 5 Cache clean : undefined 5 Cache lockdown : undefined 5 Cache unified : Harvard I size : 32768 I assoc : 32 I line length : 32 I sets : 32 D size : 32768 D assoc : 32 D line length : 32 D sets : 32 and Processor : StrongARM-1100 rev 9 (v4l) BogoMIPS : 127.38 Features : swp half 26bit fastmult version 3 of the test consistently reports "Too slow". Guennadi --- Guennadi Liakhovetski ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 14:17 ` Russell King 2003-09-01 14:51 ` Russell King @ 2003-09-01 16:52 ` Jamie Lokier 2003-09-01 17:11 ` Russell King 1 sibling, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 16:52 UTC (permalink / raw) To: Paul J.Y. Lahaie, linux-kernel Russell King wrote: > On Mon, Sep 01, 2003 at 11:12:24AM +0100, Jamie Lokier wrote: > > Russell King wrote: > > > This looks like an old kernel on your NetWinder. Later 2.4 kernels > > > should get this right (by marking the pages uncacheable in user space.) > > > > How do they know which pages to mark uncacheable? Surely not all > > MAP_SHARED|MAP_FIXED mappings are uncacheable? > > By looking at the mappings present in the process. If a process maps the > same file using MAP_SHARED _and_ we fault the same page of data into two > or more mappings, we turn off the cache for those pages. 1. That's not necessary when the virtual addresses are separated by some multiple, is it? 2. The other architectures with incoherent caches set SHMLBA to the multiple, and they don't do anything special in update_mmu_cache(), so MAP_FIXED can create incoherent mappings. Is there any special reason why ARM is different? > I've tested on several silicon revisions of StrongARM-110's: > - H appears buggy (reports as rev. 2) > - K appears fine (reports as rev. 2) > - S appears buggy (reports as rev. 3) It's possible that all of them are buggy, but the write buffer test doesn't manage to get writes into the buffer with the exact timing needed to trigger it. Unfortunately, while the write buffer test does pretty much guarantee a store/store/load instruction sequence, because it's generic it can't guarantee how those are executed in a superscalar or out of order pipeline. > So it seems your test program finds problems which DaveM's aliastest > program fails to detect... Gah. ;( Well, it's good to know it was useful :/ Thanks, -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 16:52 ` Jamie Lokier @ 2003-09-01 17:11 ` Russell King 2003-09-02 5:34 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Russell King @ 2003-09-01 17:11 UTC (permalink / raw) To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel On Mon, Sep 01, 2003 at 05:52:39PM +0100, Jamie Lokier wrote: > Russell King wrote: > > By looking at the mappings present in the process. If a process maps the > > same file using MAP_SHARED _and_ we fault the same page of data into two > > or more mappings, we turn off the cache for those pages. > > 1. That's not necessary when the virtual addresses are separated > by some multiple, is it? Incorrect - with a VIVT, you have alias hell. There is no multiple which makes it safe. > > I've tested on several silicon revisions of StrongARM-110's: > > - H appears buggy (reports as rev. 2) > > - K appears fine (reports as rev. 2) > > - S appears buggy (reports as rev. 3) > > It's possible that all of them are buggy, but the write buffer test > doesn't manage to get writes into the buffer with the exact timing > needed to trigger it. Well, I've just generated a kernel test which does more or less the same thing (write to one mapping, write to other, read from first.) This indicates the same result. If you take a moment to think about what should be going on - - first write gets translated to physical address, and the address with the data is placed in the write buffer. - second write gets translated to the same physical address, and the address and data is placed into the write buffer such that we store the first write then the second write to the same physical memory. - reading from the first mapping should return the second writes value no matter what. But it doesn't in some cases. > Unfortunately, while the write buffer test does > pretty much guarantee a store/store/load instruction sequence, because > it's generic it can't guarantee how those are executed in a > superscalar or out of order pipeline. ARM doesn't do any of those tricks. > > So it seems your test program finds problems which DaveM's aliastest > > program fails to detect... Gah. ;( > > Well, it's good to know it was useful :/ Well, we now have a kernel test to detect the problem, which alters our behaviour appropriately. Thanks. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 17:11 ` Russell King @ 2003-09-02 5:34 ` Jamie Lokier 2003-09-02 8:15 ` Russell King 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-02 5:34 UTC (permalink / raw) To: Paul J.Y. Lahaie, linux-kernel Russell King wrote: > > 1. That's not necessary when the virtual addresses are separated > > by some multiple, is it? > > Incorrect - with a VIVT, you have alias hell. There is no multiple > which makes it safe. Ok. I guess I was thinking of VIPT, but by now I am just guessing :) > > > I've tested on several silicon revisions of StrongARM-110's: > > > - H appears buggy (reports as rev. 2) > > > - K appears fine (reports as rev. 2) > > > - S appears buggy (reports as rev. 3) > > > > It's possible that all of them are buggy, but the write buffer test > > doesn't manage to get writes into the buffer with the exact timing > > needed to trigger it. > > Well, I've just generated a kernel test which does more or less the > same thing (write to one mapping, write to other, read from first.) > This indicates the same result. > > If you take a moment to think about what should be going on - > > - first write gets translated to physical address, and the address with > the data is placed in the write buffer. > - second write gets translated to the same physical address, and the > address and data is placed into the write buffer such that we store > the first write then the second write to the same physical memory. > - reading from the first mapping should return the second writes value > no matter what. That is an incomplete explanation, because it should never be possible for reads to access data from the write buffer which isn't the most recent. That would break ordinary programs which don't have alias mappings. > > Unfortunately, while the write buffer test does > > pretty much guarantee a store/store/load instruction sequence, because > > it's generic it can't guarantee how those are executed in a > > superscalar or out of order pipeline. > > ARM doesn't do any of those tricks. Don't some of the ARMs executed two instructions concurrently, like the original Pentium? The simple test is only valid if a store/store/load sequence is guaranteed to pass through the buggy part of the pipeline in exactly the same way, no matter which programs it appears in. > > > So it seems your test program finds problems which DaveM's aliastest > > > program fails to detect... Gah. ;( > > > > Well, it's good to know it was useful :/ > > Well, we now have a kernel test to detect the problem, which alters our > behaviour appropriately. Thanks. Fwiw, PA-RISC shows a similar problem. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 5:34 ` Jamie Lokier @ 2003-09-02 8:15 ` Russell King 2003-09-02 11:57 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Russell King @ 2003-09-02 8:15 UTC (permalink / raw) To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel On Tue, Sep 02, 2003 at 06:34:15AM +0100, Jamie Lokier wrote: > Russell King wrote: > > If you take a moment to think about what should be going on - > > > > - first write gets translated to physical address, and the address with > > the data is placed in the write buffer. > > - second write gets translated to the same physical address, and the > > address and data is placed into the write buffer such that we store > > the first write then the second write to the same physical memory. > > - reading from the first mapping should return the second writes value > > no matter what. > > That is an incomplete explanation, because it should never be possible > for reads to access data from the write buffer which isn't the most > recent. Umm, that's what I said. > > ARM doesn't do any of those tricks. > > Don't some of the ARMs executed two instructions concurrently, like > the original Pentium? Nope - they're all single issue CPUs, and, if non-buggy, they guarantee that stores never bypass loads. (In a later architecture revision, this is controllable.) Remember - ARM CPUs aren't a high spec desktop CPU. They're an embedded CPU where power consumption matters. Superscalar/multiple issue/high performance isn't viable in such many embedded environments. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 8:15 ` Russell King @ 2003-09-02 11:57 ` Jamie Lokier 2003-09-02 18:52 ` Russell King 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-02 11:57 UTC (permalink / raw) To: Paul J.Y. Lahaie, linux-kernel Russell King wrote: > > > If you take a moment to think about what should be going on - > > > > > > - first write gets translated to physical address, and the address with > > > the data is placed in the write buffer. > > > - second write gets translated to the same physical address, and the > > > address and data is placed into the write buffer such that we store > > > the first write then the second write to the same physical memory. > > > - reading from the first mapping should return the second writes value > > > no matter what. > > > > That is an incomplete explanation, because it should never be possible > > for reads to access data from the write buffer which isn't the most > > recent. > > Umm, that's what I said. You say that "reading from the first mapping _should_ return the second write value no matter what", but that there's a bug in the write buffer and it isn't doing that. I'm saying that the bug can't be that, because such a bug would affect normal applications. > > Don't some of the ARMs executed two instructions concurrently, like > > the original Pentium? > > Nope - they're all single issue CPUs, and, if non-buggy, they guarantee > that stores never bypass loads. (In a later architecture revision, this > is controllable.) > > Remember - ARM CPUs aren't a high spec desktop CPU. They're an embedded > CPU where power consumption matters. Superscalar/multiple issue/high > performance isn't viable in such many embedded environments. Fair enough. I recall someone mentioning a dual issue ARM once upon a time, that's all. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Russell King wrote: > > > If you take a moment to think about what should be going on - > > > > > > - first write gets translated to physical address, and the address with > > > the data is placed in the write buffer. > > > - second write gets translated to the same physical address, and the > > > address and data is placed into the write buffer such that we store > > > the first write then the second write to the same physical memory. > > > - reading from the first mapping should return the second writes value > > > no matter what. > > > > That is an incomplete explanation, because it should never be possible > > for reads to access data from the write buffer which isn't the most > > recent. > > Umm, that's what I said. You say that "reading from the first mapping _should_ return the second write value no matter what", but that there's a bug in the write buffer and it isn't doing that. I'm saying that the bug can't be that, because such a bug would affect normal applications. > > Don't some of the ARMs executed two instructions concurrently, like > > the original Pentium? > > Nope - they're all single issue CPUs, and, if non-buggy, they guarantee > that stores never bypass loads. (In a later architecture revision, this > is controllable.) > > Remember - ARM CPUs aren't a high spec desktop CPU. They're an embedded > CPU where power consumption matters. Superscalar/multiple issue/high > performance isn't viable in such many embedded environments. Fair enough. I recall someone mentioning a dual issue ARM once upon a time, that's all. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 11:57 ` Jamie Lokier @ 2003-09-02 18:52 ` Russell King 2003-09-02 23:59 ` Larry McVoy 0 siblings, 1 reply; 106+ messages in thread From: Russell King @ 2003-09-02 18:52 UTC (permalink / raw) To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel On Tue, Sep 02, 2003 at 12:57:31PM +0100, Jamie Lokier wrote: > You say that "reading from the first mapping _should_ return the > second write value no matter what", but that there's a bug in the > write buffer and it isn't doing that. > > I'm saying that the bug can't be that, because such a bug would affect > normal applications. I know of no other explaination which fits with the information I have available to me here. If you'd care to speculate further, you may, but I see further speculation as being rather academic, unless it comes from one of the people who designed the chip. All this is, however, immateral - the facts are that the write buffer is buggy, this test detects it, and we can take fairly easy measures to ensure we fix it up. Multiple mappings of the same object rarely occur in my experience, so the resulting performance loss caused by working around the cache and writebuffer is something we can live with. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 18:52 ` Russell King @ 2003-09-02 23:59 ` Larry McVoy 2003-09-03 7:31 ` Russell King 0 siblings, 1 reply; 106+ messages in thread From: Larry McVoy @ 2003-09-02 23:59 UTC (permalink / raw) To: Jamie Lokier, Paul J.Y. Lahaie, linux-kernel On Tue, Sep 02, 2003 at 07:52:22PM +0100, Russell King wrote: > Multiple mappings of the same object rarely occur in my experience, so > the resulting performance loss caused by working around the cache and > writebuffer is something we can live with. Multiple *writable* mappings. Don't forget about libc et al. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-02 23:59 ` Larry McVoy @ 2003-09-03 7:31 ` Russell King 2003-09-03 7:41 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Russell King @ 2003-09-03 7:31 UTC (permalink / raw) To: Larry McVoy, Jamie Lokier, Paul J.Y. Lahaie, linux-kernel On Tue, Sep 02, 2003 at 04:59:00PM -0700, Larry McVoy wrote: > On Tue, Sep 02, 2003 at 07:52:22PM +0100, Russell King wrote: > > Multiple mappings of the same object rarely occur in my experience, so > > the resulting performance loss caused by working around the cache and > > writebuffer is something we can live with. > > Multiple *writable* mappings. Don't forget about libc et al. I mean in the same group of threads with the same struct mm, not the whole system. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 7:31 ` Russell King @ 2003-09-03 7:41 ` Jamie Lokier 2003-09-03 18:05 ` Russell King 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-03 7:41 UTC (permalink / raw) To: Larry McVoy, Paul J.Y. Lahaie, linux-kernel Russell King wrote: > > > Multiple mappings of the same object rarely occur in my experience, so > > > the resulting performance loss caused by working around the cache and > > > writebuffer is something we can live with. > > > > Multiple *writable* mappings. Don't forget about libc et al. > > I mean in the same group of threads with the same struct mm, not the whole > system. Larry means that it's perfectly normal for libc to map the same file more than once: you have the code section and the data section. I don't know if ARM's ELF is like the x86, but on the x86 the final partial page of code or read-only data will be mapped twice, as the latter part of the page can contain writable data. This avoids wasting up to a page's worth of bytes in the ELF file. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 7:41 ` Jamie Lokier @ 2003-09-03 18:05 ` Russell King 2003-09-04 22:20 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Russell King @ 2003-09-03 18:05 UTC (permalink / raw) To: Jamie Lokier; +Cc: Larry McVoy, Paul J.Y. Lahaie, linux-kernel On Wed, Sep 03, 2003 at 08:41:34AM +0100, Jamie Lokier wrote: > Russell King wrote: > > > > Multiple mappings of the same object rarely occur in my experience, so > > > > the resulting performance loss caused by working around the cache and > > > > writebuffer is something we can live with. > > > > > > Multiple *writable* mappings. Don't forget about libc et al. > > > > I mean in the same group of threads with the same struct mm, not the whole > > system. > > Larry means that it's perfectly normal for libc to map the same file > more than once: you have the code section and the data section. Code is read-only, data is read-write and is copy on write. Therefore its a different scenario. Practical tests indicate that the vast majority of applications do not trip the test. You're right in theory, but I don't particularly care about theory when its real life which matters. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-03 18:05 ` Russell King @ 2003-09-04 22:20 ` Jamie Lokier 0 siblings, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-04 22:20 UTC (permalink / raw) To: Larry McVoy, Paul J.Y. Lahaie, linux-kernel Russell King wrote: > > Larry means that it's perfectly normal for libc to map the same file > > more than once: you have the code section and the data section. > > Code is read-only, data is read-write and is copy on write. Therefore > its a different scenario. Yes, a thinko on my part :) -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 10:12 ` Jamie Lokier 2003-09-01 11:30 ` Geert Uytterhoeven 2003-09-01 14:17 ` Russell King @ 2003-09-04 17:37 ` Maciej W. Rozycki 2 siblings, 0 replies; 106+ messages in thread From: Maciej W. Rozycki @ 2003-09-04 17:37 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Mon, 1 Sep 2003, Jamie Lokier wrote: > Please try the program below, which is the same as before but with > test_l1_only hopefully improved, and it prints some more helpful > numbers. A few MIPS systems: 1. An R3400-based DECstation 5000/240 -- the CPU has a 64kB I-cache and a 64kB D-cache, both are direct mapped, PIPT: $ uname -a Linux 3maxp 2.4.21 #3 Thu Aug 14 04:14:33 CEST 2003 mips unknown unknown GNU/Linux $ time ./test (256) [155,155,7] Test separation: 4096 bytes: pass (256) [155,155,7] Test separation: 8192 bytes: pass (256) [155,155,7] Test separation: 16384 bytes: pass (256) [155,155,7] Test separation: 32768 bytes: pass (256) [155,155,7] Test separation: 65536 bytes: pass (256) [155,155,7] Test separation: 131072 bytes: pass (256) [155,155,7] Test separation: 262144 bytes: pass (256) [155,155,7] Test separation: 524288 bytes: pass (256) [155,155,7] Test separation: 1048576 bytes: pass (256) [155,155,7] Test separation: 2097152 bytes: pass (256) [155,155,7] Test separation: 4194304 bytes: pass (256) [155,155,7] Test separation: 8388608 bytes: pass (256) [155,155,7] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed 1.01user 0.27system 0:01.33elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (135major+44minor)pagefaults 0swaps $ cat /proc/cpuinfo system type : Digital DECstation 5000/2x0 processor : 0 cpu model : R3000A V3.0 FPU V4.0 BogoMIPS : 39.90 wait instruction : no microsecond timers : no tlb_entries : 64 extra interrupt vector : no hardware watchpoint : no VCED exceptions : not available VCEI exceptions : not available 2. An R4400SC-based DECstation 5000/260 -- the CPU has a 16kB primary I-cache and a 16kB primary D-cache, both are direct mapped, VIPT, and a 1024kB secondary joint (I+D) cache, direct mapped, PIPT: $ uname -a Linux 4maxp64 2.4.21 #19 Mon Aug 25 00:16:25 CEST 2003 mips64 unknown unknown GNU/Linux $ time ./test (64) [331,17,3] Test separation: 4096 bytes: FAIL - too slow (64) [331,17,3] Test separation: 8192 bytes: FAIL - too slow (128) [38,63,3] Test separation: 16384 bytes: pass (128) [38,63,3] Test separation: 32768 bytes: pass (128) [38,63,3] Test separation: 65536 bytes: pass (128) [38,63,3] Test separation: 131072 bytes: pass (128) [38,63,3] Test separation: 262144 bytes: pass (128) [38,63,3] Test separation: 524288 bytes: pass (128) [38,63,3] Test separation: 1048576 bytes: pass (128) [38,63,3] Test separation: 2097152 bytes: pass (128) [38,63,3] Test separation: 4194304 bytes: pass (128) [38,63,3] Test separation: 8388608 bytes: pass (128) [38,63,3] Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) 0.34user 0.14system 0:00.53elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (135major+250minor)pagefaults 0swaps $ cat /proc/cpuinfo system type : Digital DECstation 5000/2x0 processor : 0 cpu model : R4400SC V4.0 FPU V0.0 BogoMIPS : 59.86 wait instruction : no microsecond timers : yes tlb_entries : 48 extra interrupt vector : no hardware watchpoint : yes VCED exceptions : 464662 VCEI exceptions : 667534 3. A MIPS 5Kc-based Malta -- the CPU has a 16kB I-cache and a 16kB D-cache, both are 4-way set associative, VIPT: $ uname -a Linux malta 2.4.21 #5 Sun Aug 3 21:51:32 CEST 2003 mips unknown unknown GNU/Linux $ time ./test (128) [25,23,1] Test separation: 4096 bytes: pass (128) [25,23,1] Test separation: 8192 bytes: pass (128) [25,23,1] Test separation: 16384 bytes: pass (128) [25,23,1] Test separation: 32768 bytes: pass (256) [49,46,1] Test separation: 65536 bytes: pass (128) [25,23,1] Test separation: 131072 bytes: pass (128) [25,23,1] Test separation: 262144 bytes: pass (256) [49,46,1] Test separation: 524288 bytes: pass (256) [49,46,1] Test separation: 1048576 bytes: pass (256) [49,46,1] Test separation: 2097152 bytes: pass (256) [48,45,2] Test separation: 4194304 bytes: pass (256) [49,46,1] Test separation: 8388608 bytes: pass (128) [25,23,1] Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed 0.22user 0.06system 0:00.30elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (135major+44minor)pagefaults 0swaps $ cat /proc/cpuinfo system type : MIPS Malta processor : 0 cpu model : MIPS 5Kc V0.1 BogoMIPS : 159.74 wait instruction : yes microsecond timers : yes tlb_entries : 32 extra interrupt vector : yes hardware watchpoint : yes VCED exceptions : not available VCEI exceptions : not available The slowdown for the R4400SC processor is surely the result of Virtual Coherency Exceptions (reported in cpuinfo for both primary caches) -- the secondary cache (S-cache) remembers a few bits of the virtual address (VA) and if there is a hit in the S-cache, but the VA bits don't match, an exception is taken to write back and invalidate the old entry from the respective primary cache (P-cache) and reset the VA bits to the new value. Then a reexecution of the faulting instruction does a refill to the P-cache from the S-cache. This problem doesn't happen for the two other processors as neither has an S-cache and also the R3400's P-cache is PIPT. We avoid the hit resulting from cache aliasing for MIPS by aligning maps appropriately. Maciej -- + Maciej W. Rozycki, Technical University of Gdansk, Poland + +--------------------------------------------------------------+ + e-mail: macro@ds2.pg.gda.pl, PGP key available + ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (17 preceding siblings ...) 2003-08-29 20:26 ` Paul J.Y. Lahaie @ 2003-08-29 22:35 ` Kenneth Johansson 2003-08-29 23:47 ` Kurt Wall ` (3 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Kenneth Johansson @ 2003-08-29 22:35 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, 2003-08-29 at 07:35, Jamie Lokier wrote: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.473s user 0m0.280s sys 0m0.100s >cat /proc/cpuinfo cpu : 405CR clock : 200MHz revision : 1.69 (pvr 4011 0145) bogomips : 199.88 machine : Ericsson ELN 2XX plb bus clock : 100MHz ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (18 preceding siblings ...) 2003-08-29 22:35 ` Kenneth Johansson @ 2003-08-29 23:47 ` Kurt Wall 2003-09-01 0:24 ` Paul Mundt ` (2 subsequent siblings) 22 siblings, 0 replies; 106+ messages in thread From: Kurt Wall @ 2003-08-29 23:47 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel Quoth Jamie Lokier: > Dear All, > > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. [snip] ----- system one --- $ time ./mmap Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.475s user 0m0.250s sys 0m0.020s $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 5 model name : Pentium II (Deschutes) stepping : 2 cpu MHz : 349.200 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 696.32 ----- ----- system two --- [kwall]$ time ./mmap Test separation: 4096 bytes: pass Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: all sizes passed real 0m0.134s user 0m0.120s sys 0m0.010s ]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 3 cpu MHz : 801.830 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1599.07 ----- ---- system three ----- $ time ./mmap Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 32768 (8 pages) real 0m0.101s user 0m0.090s sys 0m0.010s root@advent:~# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1210.825 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 2418.27 ----- Now, that was interesting. The AMD is my fastest machine... Kurt -- "I have the world's largest collection of seashells. I keep it scattered around the beaches of the world ... Perhaps you've seen it. -- Steven Wright ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (19 preceding siblings ...) 2003-08-29 23:47 ` Kurt Wall @ 2003-09-01 0:24 ` Paul Mundt 2003-09-01 0:37 ` Jamie Lokier 2003-09-01 1:13 ` dean gaudet 2003-09-02 10:08 ` Jan Rychter 22 siblings, 1 reply; 106+ messages in thread From: Paul Mundt @ 2003-09-01 0:24 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 2085 bytes --] On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote: > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. > sh (VIPT cache): Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: FAIL - cache not coherent Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 16384 (4 pages) $ cat /proc/cpuinfo machine : Sega Dreamcast processor : 0 cpu family : sh4 cpu type : SH7750 cache size : 8K-bytes/16K-bytes bogomips : 199.06 cpu clock : 199.49MHz bus clock : 99.74MHz module clock : 49.87MHz and on sh64 (which is sort of VIPT/VIVT, as it looks at physical tags if there's no match on virtual): Test separation: 4096 bytes: FAIL - cache not coherent Test separation: 8192 bytes: pass Test separation: 16384 bytes: pass Test separation: 32768 bytes: pass Test separation: 65536 bytes: pass Test separation: 131072 bytes: pass Test separation: 262144 bytes: pass Test separation: 524288 bytes: pass Test separation: 1048576 bytes: pass Test separation: 2097152 bytes: pass Test separation: 4194304 bytes: pass Test separation: 8388608 bytes: pass Test separation: 16777216 bytes: pass VM page alias coherency test: minimum fast spacing: 8192 (2 pages) -sh-2.05b$ cat /proc/cpuinfo machine : Hitachi Cayman processor : 0 cpu family : SH-5 cpu type : SH5-101 icache size : 32K-bytes dcache size : 32K-bytes itlb entries : 64 dtlb entries : 64 cpu clock : 314.73MHz bus clock : 157.36MHz module clock : 26.22MHz bogomips : 313.75 [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 0:24 ` Paul Mundt @ 2003-09-01 0:37 ` Jamie Lokier 2003-09-01 1:00 ` Paul Mundt 0 siblings, 1 reply; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 0:37 UTC (permalink / raw) To: linux-kernel Paul Mundt wrote: > sh (VIPT cache): > > Test separation: 4096 bytes: FAIL - cache not coherent > Test separation: 8192 bytes: FAIL - cache not coherent > Test separation: 16384 bytes: pass A VIVT cache can do that, but I think a VIPT cache should always be coherent. Do I misunderstand? -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 0:37 ` Jamie Lokier @ 2003-09-01 1:00 ` Paul Mundt 2003-09-01 1:58 ` Jamie Lokier 0 siblings, 1 reply; 106+ messages in thread From: Paul Mundt @ 2003-09-01 1:00 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 515 bytes --] On Mon, Sep 01, 2003 at 01:37:50AM +0100, Jamie Lokier wrote: > > sh (VIPT cache): > > > > Test separation: 4096 bytes: FAIL - cache not coherent > > Test separation: 8192 bytes: FAIL - cache not coherent > > Test separation: 16384 bytes: pass > > A VIVT cache can do that, but I think a VIPT cache should always be coherent. > Do I misunderstand? > There's nothing stating that VIPT == automatic coherency, as is obviously the case for sh, where we are completely VIPT, but are also non coherent. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 1:00 ` Paul Mundt @ 2003-09-01 1:58 ` Jamie Lokier 0 siblings, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 1:58 UTC (permalink / raw) To: linux-kernel Paul Mundt wrote: > On Mon, Sep 01, 2003 at 01:37:50AM +0100, Jamie Lokier wrote: > > > sh (VIPT cache): > > > > > > Test separation: 4096 bytes: FAIL - cache not coherent > > > Test separation: 8192 bytes: FAIL - cache not coherent > > > Test separation: 16384 bytes: pass > > > > A VIVT cache can do that, but I think a VIPT cache should always be coherent. > > Do I misunderstand? > > > There's nothing stating that VIPT == automatic coherency, > as is obviously the case for sh, where we are completely VIPT, but > are also non coherent. Ah. A VIPT cache needn't be coherent with itself if isn't coherent w.r.t. external devices. Thanks. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (20 preceding siblings ...) 2003-09-01 0:24 ` Paul Mundt @ 2003-09-01 1:13 ` dean gaudet 2003-09-01 4:29 ` Jamie Lokier 2003-09-02 10:08 ` Jan Rychter 22 siblings, 1 reply; 106+ messages in thread From: dean gaudet @ 2003-09-01 1:13 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, 29 Aug 2003, Jamie Lokier wrote: > I already got a surprise (to me): my Athlon MP is much slower > accessing multiple mappings which are within 32k of each other, than > mappings which are further apart, although it is coherent. The L1 > data cache is 64k. (The explanation is easy: virtually indexed, > physically tagged cache moves data among cache lines, possibly via L2). opteron has 64KiB / 2-way L1 which means 15-bits of indexing... which totally predicts the 32KiB spacing i saw someone else post about. tm8000 also has some virtual aliasing and your test detects it properly... but i'm probably not supposed to say anything about that :) there's a real oddity i found on p4 just yesterday. i was doing some pointer-chasing experiments, and i set up two 8192B shared mappings to the same file, for example: 0x50000000 => /var/tmp/foo offset 0 0x50002000 => /var/tmp/foo offset 0 then i set up a 4 element cycle: 0x50000000 => 0x50001004 => 0x50002008 => 0x5000300c => 0x50000000 when i do this it seems to trip up a p4 badly ... i'm seeing 3000 cycles per load on a 2.4GHz p4, and 300 cycles per load on a 2.4GHz xeon. the crazy thing is that small variations in the experiment (such as longer cycles) make the oddity go away! i've placed my hack here <http://arctic.org/~dean/noah/chase.c>. > This suggests scope for improving x86 kernel performance in the areas > of kmap() and shared library / executable mappings, by good choice of > _virtual_ addresses. This doesn't require a cache colouring > page allocator, so maybe it's a new avenue? i was trying to use wli's pgcl patch to test out larger clustering, but it still has some perf problems which i never got enough time to dig into further :) this approach might be better than just colouring. here's what i've found tripping up virtual aliasing on processors which have this "feature": - shared use empty_zero_page trips up virtual aliasing for things like BSS -- especially if the program for some reason doesn't typically have to write before reading. this is pretty easy to fix (there's even an example fix in the mips architecture, i believe R4000 or something) - kernel and user mappings differ in the virtual index bits. this means CoW will trip up virtual aliases amongst other things. i imagine it means network checksum calculation on write(2) data will trip up virtual aliases. this is more of a pain to fix in a way which is nice on SMP. - physical pages change their virtual index bits each alloc/free. mind you overall i'm not sure that i'm seeing any perf loss due to this sort of thing... -dean ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-09-01 1:13 ` dean gaudet @ 2003-09-01 4:29 ` Jamie Lokier 0 siblings, 0 replies; 106+ messages in thread From: Jamie Lokier @ 2003-09-01 4:29 UTC (permalink / raw) To: dean gaudet; +Cc: linux-kernel dean gaudet wrote: > On Fri, 29 Aug 2003, Jamie Lokier wrote: > > I already got a surprise (to me): my Athlon MP is much slower > > accessing multiple mappings which are within 32k of each other, than > > mappings which are further apart, although it is coherent. The L1 > > data cache is 64k. (The explanation is easy: virtually indexed, > > physically tagged cache moves data among cache lines, possibly via L2). > > opteron has 64KiB / 2-way L1 which means 15-bits of indexing... which > totally predicts the 32KiB spacing i saw someone else post about. Aha, thanks! All Athlons are the same with 64KiB L1 and 32KiB threshold, and K6 is the same but with 16KiB threshold instead. > there's a real oddity i found on p4 just yesterday. i was doing some > pointer-chasing experiments, and i set up two 8192B shared mappings to the > same file, for example: > > 0x50000000 => /var/tmp/foo offset 0 > 0x50002000 => /var/tmp/foo offset 0 > > then i set up a 4 element cycle: > > 0x50000000 => 0x50001004 => 0x50002008 => 0x5000300c => 0x50000000 > > when i do this it seems to trip up a p4 badly ... i'm seeing 3000 cycles > per load on a 2.4GHz p4, and 300 cycles per load on a 2.4GHz xeon. the > crazy thing is that small variations in the experiment (such as longer > cycles) make the oddity go away! I have no idea of the explanation, unless P4 is doing the same as the Athlon, 3000 cycles is the cost of an L1/L2 miss, and P4 has virtual aliasing in both L1 and L2. Hmm. I would certainly like to detect that if it occurs with typical instruction streams, otherwise it'll clobber my application's performance on a P4. I don't have a P4 to test on, btw. If you can investigate further that would be very good. -- Jamie ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this 2003-08-29 5:35 Jamie Lokier ` (21 preceding siblings ...) 2003-09-01 1:13 ` dean gaudet @ 2003-09-02 10:08 ` Jan Rychter 22 siblings, 0 replies; 106+ messages in thread From: Jan Rychter @ 2003-09-02 10:08 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1609 bytes --] > I'd appreciate if folks would run the program below on various > machines, especially those whose caches aren't automatically coherent > at the hardware level. > > It searches for that address multiple which an application can use to > get coherent multiple mappings of shared memory, with good performance. From a Sharp Zaurus C-760. Not very interesting, I'm afraid: Test separation: 4096 bytes: FAIL - too slow Test separation: 8192 bytes: FAIL - too slow Test separation: 16384 bytes: FAIL - too slow Test separation: 32768 bytes: FAIL - too slow Test separation: 65536 bytes: FAIL - too slow Test separation: 131072 bytes: FAIL - too slow Test separation: 262144 bytes: FAIL - too slow Test separation: 524288 bytes: FAIL - too slow Test separation: 1048576 bytes: FAIL - too slow Test separation: 2097152 bytes: FAIL - too slow Test separation: 4194304 bytes: FAIL - too slow Test separation: 8388608 bytes: FAIL - too slow Test separation: 16777216 bytes: FAIL - too slow VM page alias coherency test: failed; will use copy buffers instead Processor : Intel XScale-PXA255 rev 6 (v5l) BogoMIPS : 397.31 Features : swp half thumb fastmult edsp CPU implementor : 0x69 CPU architecture: 5TE CPU variant : 0x0 CPU part : 0x2d0 CPU revision : 6 Cache type : undefined 5 Cache clean : undefined 5 Cache lockdown : undefined 5 Cache unified : harvard I size : 16384 I assoc : 16 I line length : 32 I sets : 32 D size : 16384 D assoc : 16 D line length : 32 D sets : 32 Hardware : SHARP Shepherd Revision : 0000 Serial : 0000000000000000 --J. [-- Attachment #2: Type: application/pgp-signature, Size: 188 bytes --] ^ permalink raw reply [flat|nested] 106+ messages in thread
end of thread, other threads:[~2003-09-07 17:57 UTC | newest] Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20030829053510.GA12663@mail.jlokier.co.uk.suse.lists.linux.kernel> 2003-08-29 11:08 ` x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this Andi Kleen 2003-08-29 11:17 ` Russell King 2003-09-01 5:03 ` Jamie Lokier 2003-08-29 5:35 Jamie Lokier 2003-08-29 10:03 ` J.A. Magallon 2003-08-29 10:36 ` Alan Cox 2003-09-01 4:49 ` Jamie Lokier 2003-08-29 10:04 ` Sergey S. Kostyliov 2003-08-29 10:15 ` J.A. Magallon 2003-08-29 10:21 ` J.A. Magallon 2003-08-29 10:34 ` CaT 2003-08-29 10:37 ` CaT 2003-08-29 10:49 ` Mikael Pettersson 2003-08-29 11:41 ` Gianni Tedesco 2003-08-29 11:51 ` James Morris 2003-08-29 15:41 ` Larry McVoy 2003-08-29 23:05 ` Mike Fedyk 2003-08-31 5:10 ` David S. Miller 2003-08-31 22:49 ` Jamie Lokier 2003-09-01 5:31 ` David S. Miller 2003-09-01 6:42 ` Jamie Lokier 2003-09-01 7:06 ` David S. Miller 2003-09-01 8:29 ` Jamie Lokier 2003-09-01 9:02 ` David S. Miller 2003-09-01 10:04 ` Jamie Lokier 2003-09-01 10:02 ` David S. Miller 2003-09-03 17:36 ` bill davidsen 2003-09-04 22:50 ` Jamie Lokier 2003-09-01 5:44 ` Jamie Lokier 2003-09-01 14:43 ` Larry McVoy 2003-09-01 16:33 ` Jamie Lokier 2003-09-01 16:58 ` Larry McVoy 2003-09-02 20:29 ` Jamie Lokier 2003-08-29 15:47 ` Herbert Poetzl 2003-08-30 1:48 ` Stuart Longland 2003-08-29 16:27 ` Geert Uytterhoeven 2003-09-01 5:58 ` Jamie Lokier 2003-09-01 8:34 ` Geert Uytterhoeven 2003-09-01 9:09 ` Kars de Jong 2003-09-01 10:08 ` Jamie Lokier 2003-09-01 11:13 ` Roman Zippel 2003-09-02 20:42 ` Kars de Jong 2003-09-02 21:39 ` Jamie Lokier 2003-09-03 7:59 ` Geert Uytterhoeven 2003-09-03 9:13 ` Jamie Lokier 2003-09-03 9:26 ` Geert Uytterhoeven 2003-09-03 12:17 ` Roman Zippel 2003-09-03 12:36 ` Geert Uytterhoeven 2003-09-03 13:29 ` Jamie Lokier 2003-09-03 16:07 ` Nagendra Singh Tomar 2003-09-04 5:03 ` Davide Libenzi 2003-09-03 18:03 ` Nagendra Singh Tomar 2003-09-04 6:38 ` Davide Libenzi 2003-09-04 11:19 ` Alan Cox 2003-09-05 21:24 ` Pavel Machek 2003-09-06 23:09 ` Jamie Lokier 2003-09-07 13:10 ` Pavel Machek 2003-09-07 13:35 ` Jamie Lokier 2003-09-07 13:40 ` Pavel Machek 2003-09-07 13:53 ` Jamie Lokier 2003-09-07 17:56 ` Alan Cox 2003-09-03 12:13 ` Jan-Benedict Glaw 2003-09-01 10:35 ` Sam Creasey 2003-09-01 10:48 ` Jamie Lokier 2003-09-01 12:23 ` Sam Creasey 2003-09-03 8:00 ` Kars de Jong 2003-09-03 8:05 ` Geert Uytterhoeven 2003-09-03 9:24 ` Kars de Jong 2003-08-29 16:31 ` Brian Jackson 2003-08-29 17:39 ` Matt Porter 2003-09-01 6:00 ` Jamie Lokier 2003-09-01 11:17 ` Alan Cox 2003-09-01 17:22 ` Roland Dreier 2003-09-02 2:16 ` Matt Porter 2003-09-02 5:40 ` Jamie Lokier 2003-08-29 19:37 ` Thorsten Kranzkowski 2003-08-29 20:03 ` Sean Neakums 2003-08-29 20:14 ` Iulian Musat 2003-08-29 20:26 ` Paul J.Y. Lahaie 2003-09-01 8:15 ` Russell King 2003-09-01 10:12 ` Jamie Lokier 2003-09-01 11:30 ` Geert Uytterhoeven 2003-09-01 14:17 ` Russell King 2003-09-01 14:51 ` Russell King 2003-09-01 19:09 ` Guennadi Liakhovetski 2003-09-01 16:52 ` Jamie Lokier 2003-09-01 17:11 ` Russell King 2003-09-02 5:34 ` Jamie Lokier 2003-09-02 8:15 ` Russell King 2003-09-02 11:57 ` Jamie Lokier 2003-09-02 18:52 ` Russell King 2003-09-02 23:59 ` Larry McVoy 2003-09-03 7:31 ` Russell King 2003-09-03 7:41 ` Jamie Lokier 2003-09-03 18:05 ` Russell King 2003-09-04 22:20 ` Jamie Lokier 2003-09-04 17:37 ` Maciej W. Rozycki 2003-08-29 22:35 ` Kenneth Johansson 2003-08-29 23:47 ` Kurt Wall 2003-09-01 0:24 ` Paul Mundt 2003-09-01 0:37 ` Jamie Lokier 2003-09-01 1:00 ` Paul Mundt 2003-09-01 1:58 ` Jamie Lokier 2003-09-01 1:13 ` dean gaudet 2003-09-01 4:29 ` Jamie Lokier 2003-09-02 10:08 ` Jan Rychter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).