* system call for process information? @ 2001-03-12 17:08 Guennadi Liakhovetski 2001-03-12 18:27 ` Alexander Viro 0 siblings, 1 reply; 36+ messages in thread From: Guennadi Liakhovetski @ 2001-03-12 17:08 UTC (permalink / raw) To: linux-kernel Hello I asked this question on kernel-newbies - no reply, hope to be luckier here:-) I need to collect some info on processes. One way is to read /proc tree. But isn't there a system call (ioctl) for this? And what are those task[], task_struct, etc. about? Thanks Guennadi ___ Dr. Guennadi V. Liakhovetski Department of Applied Mathematics University of Sheffield, U.K. email: G.Liakhovetski@sheffield.ac.uk - Kernelnewbies: Help each other learn about the Linux kernel. Archive: http://mail.nl.linux.org/ IRC Channel: irc.openprojects.net / #kernelnewbies Web Page: http://www.surriel.com/kernelnewbies.shtml ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-12 17:08 system call for process information? Guennadi Liakhovetski @ 2001-03-12 18:27 ` Alexander Viro 2001-03-12 21:21 ` Guennadi Liakhovetski 2001-03-14 19:53 ` Szabolcs Szakacsits 0 siblings, 2 replies; 36+ messages in thread From: Alexander Viro @ 2001-03-12 18:27 UTC (permalink / raw) To: Guennadi Liakhovetski; +Cc: linux-kernel On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote: > Hello > > I asked this question on kernel-newbies - no reply, hope to be luckier > here:-) > > I need to collect some info on processes. One way is to read /proc > tree. But isn't there a system call (ioctl) for this? And what are those Occam's Razor. Why invent new syscall when read() works? > task[], task_struct, etc. about? What branch? (2.0, 2.2, 2.4?) Cheers, Al ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-12 18:27 ` Alexander Viro @ 2001-03-12 21:21 ` Guennadi Liakhovetski 2001-03-13 2:56 ` Nathan Paul Simons 2001-03-14 19:53 ` Szabolcs Szakacsits 1 sibling, 1 reply; 36+ messages in thread From: Guennadi Liakhovetski @ 2001-03-12 21:21 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel On Mon, 12 Mar 2001, Alexander Viro wrote: > On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote: > > > I need to collect some info on processes. One way is to read /proc > > tree. But isn't there a system call (ioctl) for this? And what are those > > Occam's Razor. Why invent new syscall when read() works? CPU utilisation. Each new application has to calculate it (ps, top, qps, kps, various sysmons, procmons, etc.). Wouldn't it be worth it having a syscall for that? Wouldn't it be more optimal? > > task[], task_struct, etc. about? > > What branch? (2.0, 2.2, 2.4?) Well, what I mean was - don't these structures contain the information I am looking for? Let's start from the end - 2.4, then what's the difference with 2.2 and finally 2.0? Thanks Guennadi ___ Dr. Guennadi V. Liakhovetski Department of Applied Mathematics University of Sheffield, U.K. email: G.Liakhovetski@sheffield.ac.uk ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-12 21:21 ` Guennadi Liakhovetski @ 2001-03-13 2:56 ` Nathan Paul Simons 2001-03-13 3:20 ` Alexander Viro 2001-03-13 21:05 ` Albert D. Cahalan 0 siblings, 2 replies; 36+ messages in thread From: Nathan Paul Simons @ 2001-03-13 2:56 UTC (permalink / raw) To: Guennadi Liakhovetski; +Cc: Alexander Viro, linux-kernel On Mon, Mar 12, 2001 at 09:21:37PM +0000, Guennadi Liakhovetski wrote: > CPU utilisation. Each new application has to calculate it (ps, top, qps, > kps, various sysmons, procmons, etc.). Wouldn't it be worth it having a > syscall for that? Wouldn't it be more optimal? No, it wouldn't be worth it because you're talking about sacrificing simplicity and kernel speed in favor of functionality. This has been know to lead to "bloat-ware". Every new syscall you add takes just a little bit more time and space in the kernel, and when only a small number of programs will be using it, it's really not worth it. This time and space may not be large, but once you get _your_ syscall added, why can't everyone else get theirs added as well? And so, after making about a thousand specialized syscalls standard in the kernel, you end up with IRIX (from what I've heard). Don't even get me started about opening security holes, and increasing code complexity. Please do a search for every other syscall that has ever been proposed on this list, read them all and the arguments for them, then think long and hard about why yours should be accepted. Because I'm sure that I'm not the only person who's going to want a good explanation as to why this syscall is essential. ps - CPU time is cheap, that's why they don't charge for it anymore. Programmer time is _not_. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 2:56 ` Nathan Paul Simons @ 2001-03-13 3:20 ` Alexander Viro 2001-03-13 9:55 ` Guennadi Liakhovetski 2001-03-13 21:05 ` Albert D. Cahalan 1 sibling, 1 reply; 36+ messages in thread From: Alexander Viro @ 2001-03-13 3:20 UTC (permalink / raw) To: Nathan Paul Simons; +Cc: Guennadi Liakhovetski, linux-kernel On Mon, 12 Mar 2001, Nathan Paul Simons wrote: > On Mon, Mar 12, 2001 at 09:21:37PM +0000, Guennadi Liakhovetski wrote: > > CPU utilisation. Each new application has to calculate it (ps, top, qps, > > kps, various sysmons, procmons, etc.). Wouldn't it be worth it having a > > syscall for that? Wouldn't it be more optimal? The first rule of optimization: don't. I.e. optimizing something that is not a bottleneck is pointless. > No, it wouldn't be worth it because you're talking about > sacrificing simplicity and kernel speed in favor of functionality. Or, in that case, in favour of nothing. It doesn't add any functionality. > This has been know to lead to "bloat-ware". Every new syscall you > add takes just a little bit more time and space in the kernel, and > when only a small number of programs will be using it, it's really > not worth it. This time and space may not be large, but once you > get _your_ syscall added, why can't everyone else get theirs added > as well? And so, after making about a thousand specialized syscalls > standard in the kernel, you end up with IRIX (from what I've heard). In that case there is much simpler argument. If your program checks the system load so often that converting results from ASCII to integers takes noticable time - all you are measuring is the load created by that program. In other words, any program that would get any speedup from such syscall is absolutely worthless, since the load created by measurement will drown the load you are trying to measure. End of story. It's not only unnecessary and tasteless, it's useless. Cheers, Al ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 3:20 ` Alexander Viro @ 2001-03-13 9:55 ` Guennadi Liakhovetski 0 siblings, 0 replies; 36+ messages in thread From: Guennadi Liakhovetski @ 2001-03-13 9:55 UTC (permalink / raw) To: Alexander Viro; +Cc: Nathan Paul Simons, linux-kernel Hi Alexander, Nathan and all! Thanks for your great answers! First of all - I was not REALLY proposing to include this system call in the kernel - I just wanted to hear some pro and contra - so, thanks again for your explanations! I started yesterday sketching the required functions, will have to retreat to reading top & ps sources, btw, apart from these 2 obvious sources, what else would you suggest to look through for a good implementation of CPU-utilization calculator as well as other process (multithreaded, SMP,...) statistics? Portable (POSIX), maybe some documentation, not just sources? Thanks Guennadi On Mon, 12 Mar 2001, Alexander Viro wrote: > On Mon, 12 Mar 2001, Nathan Paul Simons wrote: > > > On Mon, Mar 12, 2001 at 09:21:37PM +0000, Guennadi Liakhovetski wrote: > > > CPU utilisation. Each new application has to calculate it (ps, top, qps, > > > kps, various sysmons, procmons, etc.). Wouldn't it be worth it having a > > > syscall for that? Wouldn't it be more optimal? > > The first rule of optimization: don't. I.e. optimizing something that > is not a bottleneck is pointless. > > > No, it wouldn't be worth it because you're talking about > > sacrificing simplicity and kernel speed in favor of functionality. > > Or, in that case, in favour of nothing. It doesn't add any functionality. > > > This has been know to lead to "bloat-ware". Every new syscall you > > add takes just a little bit more time and space in the kernel, and > > when only a small number of programs will be using it, it's really > > not worth it. This time and space may not be large, but once you > > get _your_ syscall added, why can't everyone else get theirs added > > as well? And so, after making about a thousand specialized syscalls > > standard in the kernel, you end up with IRIX (from what I've heard). > > In that case there is much simpler argument. > > If your program checks the system load so often that converting results > from ASCII to integers takes noticable time - all you are measuring > is the load created by that program. In other words, any program that > would get any speedup from such syscall is absolutely worthless, since > the load created by measurement will drown the load you are trying > to measure. > > End of story. It's not only unnecessary and tasteless, it's > useless. > Cheers, > Al > > ___ Dr. Guennadi V. Liakhovetski Department of Applied Mathematics University of Sheffield, U.K. email: G.Liakhovetski@sheffield.ac.uk ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 2:56 ` Nathan Paul Simons 2001-03-13 3:20 ` Alexander Viro @ 2001-03-13 21:05 ` Albert D. Cahalan 2001-03-13 22:02 ` Nathan Paul Simons 2001-03-13 22:52 ` Rik van Riel 1 sibling, 2 replies; 36+ messages in thread From: Albert D. Cahalan @ 2001-03-13 21:05 UTC (permalink / raw) To: npsimons; +Cc: Guennadi Liakhovetski, Alexander Viro, linux-kernel Nathan Paul Simons writes: > On Mon, Mar 12, 2001 at 09:21:37PM +0000, Guennadi Liakhovetski wrote: >> CPU utilisation. Each new application has to calculate it (ps, top, qps, >> kps, various sysmons, procmons, etc.). Wouldn't it be worth it having a >> syscall for that? Wouldn't it be more optimal? > > No, it wouldn't be worth it because you're talking about > sacrificing simplicity and kernel speed in favor of functionality. > This has been know to lead to "bloat-ware". Every new syscall you Bloat removal: being able to run without /proc mounted. We don't have "kernel speed". We have kernel-mode screwing around with text formatting. > add takes just a little bit more time and space in the kernel, and > when only a small number of programs will be using it, it's really > not worth it. This time and space may not be large, but once you > get _your_ syscall added, why can't everyone else get theirs added > as well? And so, after making about a thousand specialized syscalls > standard in the kernel, you end up with IRIX (from what I've heard). This isn't just for him. Many people have wanted it. > Don't even get me started about opening security holes, and > increasing code complexity. Please do a search for every other I'll get you started. Compare: 1. variable-length ASCII strings with undefined ad-hoc syntax 2. array of fixed-size (64-bit) values > ps - CPU time is cheap, that's why they don't charge for it anymore. > Programmer time is _not_. Parsing costs programmer time. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 21:05 ` Albert D. Cahalan @ 2001-03-13 22:02 ` Nathan Paul Simons 2001-03-13 22:50 ` Albert D. Cahalan 2001-03-13 22:52 ` Rik van Riel 1 sibling, 1 reply; 36+ messages in thread From: Nathan Paul Simons @ 2001-03-13 22:02 UTC (permalink / raw) To: Albert D. Cahalan; +Cc: Guennadi Liakhovetski, Alexander Viro, linux-kernel On Tue, Mar 13, 2001 at 04:05:13PM -0500, Albert D. Cahalan wrote: > Bloat removal: being able to run without /proc mounted. > > We don't have "kernel speed". We have kernel-mode screwing around > with text formatting. Or calculating things that really should be taken care of in user space, such as CPU utilization. > This isn't just for him. Many people have wanted it. Yes, but how many people would actually *use* it? How many programs out of the thousands out there would benefit from this? If it's more than 50 widely used packages, I'd be more than happy to see something that speeds them all up added to the kernel. > 1. variable-length ASCII strings with undefined ad-hoc syntax Use enumerated string functions, always. > 2. array of fixed-size (64-bit) values It's an array? That can still be overflowed by sloppy programming. When it comes right down to it, I'd rather have something that could potentially die badly be run on the user side, rather than the kernel side. > Parsing costs programmer time. But it's fairly easy to do in any number of programming languages besides C which can't be easily used in the kernel. Not to mention parsing libraries for C that fit much better on the user side because they would make the kernel huge and slow if compiled into it. Last but not least, I don't want to waste time in kernel scanning through a list of syscalls a mile long, half of them I don't ever use. Or having a kernel that's so big that you can't fit it on embedded systems anymore. And once you start adding every "nifty" syscall that comes along, that's what will happen. So again, I say give us all a really good reason for this syscall, or just hack it into your own kernels and let us have our speedy, small vanilla kernels. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 22:02 ` Nathan Paul Simons @ 2001-03-13 22:50 ` Albert D. Cahalan 0 siblings, 0 replies; 36+ messages in thread From: Albert D. Cahalan @ 2001-03-13 22:50 UTC (permalink / raw) To: npsimons Cc: Albert D. Cahalan, Guennadi Liakhovetski, Alexander Viro, linux-kernel Nathan Paul Simons writes: > On Tue, Mar 13, 2001 at 04:05:13PM -0500, Albert D. Cahalan wrote: >> Bloat removal: being able to run without /proc mounted. >> >> We don't have "kernel speed". We have kernel-mode screwing around >> with text formatting. > > Or calculating things that really should be taken care of in > user space, such as CPU utilization. That can not be done reliably in user space. I know this; the "top" program used to try. >> This isn't just for him. Many people have wanted it. > > Yes, but how many people would actually *use* it? How many > programs out of the thousands out there would benefit from this? > If it's more than 50 widely used packages, I'd be more than happy > to see something that speeds them all up added to the kernel. Oh please. How many programs use the mount() system call? One? Most system calls are rarely used. This is OK. >> 1. variable-length ASCII strings with undefined ad-hoc syntax > > Use enumerated string functions, always. > >> 2. array of fixed-size (64-bit) values > > It's an array? That can still be overflowed by sloppy > programming. No it can't. You fill it like this: tmp[0] = p->pid; tmp[1] = p->uid; /* ... */ Throw in some pretty symbolic names if you like. It's effectively a struct, but a real struct would tempt people to use non-64-bit values. Using an array enforces uniform 64-bit usage. Good design involves NOT tempting people to write irregular hacks. > When it comes right down to it, I'd rather have > something that could potentially die badly be run on the user > side, rather than the kernel side. Good. Thus you'd like the new system call in place of our current pile of crud. Unfortunately the crud will need to remain for at least a decade of transition time. >> Parsing costs programmer time. > > But it's fairly easy to do in any number of programming > languages besides C which can't be easily used in the kernel. > Not to mention parsing libraries for C that fit much better on > the user side because they would make the kernel huge and slow > if compiled into it. Huh? The kernel need not parse its own ASCII output. The kernel natively maintains information in a binary format. The proposed system call would not parse /proc output!!! > Last but not least, I don't want to waste time in kernel > scanning through a list of syscalls a mile long, half of them > I don't ever use. Well, tough luck. Learn to use an editor with search ability. Even "less" and Netscape can search. > Or having a kernel that's so big that you > can't fit it on embedded systems anymore. The proposed system call was implemented for an embedded system. This allowed operation without the /proc filesystem, which is some serious bloat. > And once you start > adding every "nifty" syscall that comes along, that's what > will happen. So again, I say give us all a really good reason > for this syscall, or just hack it into your own kernels and > let us have our speedy, small vanilla kernels. If you think /proc is speedy and small... ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 21:05 ` Albert D. Cahalan 2001-03-13 22:02 ` Nathan Paul Simons @ 2001-03-13 22:52 ` Rik van Riel 2001-03-14 1:53 ` Martin Dalecki 2001-03-14 1:59 ` system call for process information? john slee 1 sibling, 2 replies; 36+ messages in thread From: Rik van Riel @ 2001-03-13 22:52 UTC (permalink / raw) To: Albert D. Cahalan Cc: npsimons, Guennadi Liakhovetski, Alexander Viro, linux-kernel On Tue, 13 Mar 2001, Albert D. Cahalan wrote: > Bloat removal: being able to run without /proc mounted. > > We don't have "kernel speed". We have kernel-mode screwing around > with text formatting. Sounds like you might want to maintain an external patch for the embedded folks... regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 22:52 ` Rik van Riel @ 2001-03-14 1:53 ` Martin Dalecki 2001-03-14 2:28 ` Rik van Riel 2001-03-14 1:59 ` system call for process information? john slee 1 sibling, 1 reply; 36+ messages in thread From: Martin Dalecki @ 2001-03-14 1:53 UTC (permalink / raw) To: Rik van Riel Cc: Albert D. Cahalan, npsimons, Guennadi Liakhovetski, Alexander Viro, linux-kernel Rik van Riel wrote: > > On Tue, 13 Mar 2001, Albert D. Cahalan wrote: > > > Bloat removal: being able to run without /proc mounted. > > > > We don't have "kernel speed". We have kernel-mode screwing around > > with text formatting. > > Sounds like you might want to maintain an external patch > for the embedded folks... Not the embedded folks!!! The server folks laugh histerically all times they go via ssh to a trashing busy box to see what's wrong and then they see top or ps auxe under linux never finishing they job: > > regards, > > Rik > -- > Virtual memory is like a game you can't win; > However, without VM there's truly nothing to lose... > > http://www.surriel.com/ > http://www.conectiva.com/ http://distro.conectiva.com.br/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- - phone: +49 214 8656 283 - job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!) - langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort: ru_RU.KOI8-R ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-14 1:53 ` Martin Dalecki @ 2001-03-14 2:28 ` Rik van Riel 2001-03-14 8:24 ` george anzinger 0 siblings, 1 reply; 36+ messages in thread From: Rik van Riel @ 2001-03-14 2:28 UTC (permalink / raw) To: Martin Dalecki Cc: Albert D. Cahalan, npsimons, Guennadi Liakhovetski, Alexander Viro, linux-kernel On Wed, 14 Mar 2001, Martin Dalecki wrote: > Not the embedded folks!!! The server folks laugh histerically all > times they go via ssh to a trashing busy box to see what's wrong and > then they see top or ps auxe under linux never finishing they job: That's a separate issue. I guess the pagefault path should have _2_ locks. One mmap_sem protecting read-only access to the address space and another one for write access to the adress space (to stop races with swapout, other page faults, ...). At the point where the pagefault sleeps on IO, it could release the read-only lock, so vmstat, top, etc can get the statistics they need. Only during the time the pagefaulting code is actually messing with the address space could it block read access (to prevent others from seeing an inconsistent state). regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-14 2:28 ` Rik van Riel @ 2001-03-14 8:24 ` george anzinger 2001-03-14 19:19 ` Rik van Riel 0 siblings, 1 reply; 36+ messages in thread From: george anzinger @ 2001-03-14 8:24 UTC (permalink / raw) To: Rik van Riel Cc: Martin Dalecki, Albert D. Cahalan, npsimons, Guennadi Liakhovetski, Alexander Viro, linux-kernel Rik van Riel wrote: > > On Wed, 14 Mar 2001, Martin Dalecki wrote: > > > Not the embedded folks!!! The server folks laugh histerically all > > times they go via ssh to a trashing busy box to see what's wrong and > > then they see top or ps auxe under linux never finishing they job: > > That's a separate issue. > > I guess the pagefault path should have _2_ locks. > > One mmap_sem protecting read-only access to the address space > and another one for write access to the adress space (to stop > races with swapout, other page faults, ...). > > At the point where the pagefault sleeps on IO, it could release > the read-only lock, so vmstat, top, etc can get the statistics > they need. Only during the time the pagefaulting code is actually > messing with the address space could it block read access (to > prevent others from seeing an inconsistent state). > Is it REALLY necessary to prevent them from seeing an inconsistent state? Seems to me that in the total picture (i.e. system wide) they will never see a consistent state, so why be concerned with a small corner of the system. Let them figure it out, possibly by consistency checks, if they care. It just seems unhealthy to demand consistency at the cost of delays that will only make other data even more inconsistent. And if the delay is _forever_ from a tool that may be used to diagnose system problems... I would rather a tool that repeatedly showed the same inconsistent state than one that hangs because it can not get a consistent one. George ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-14 8:24 ` george anzinger @ 2001-03-14 19:19 ` Rik van Riel 2001-03-14 16:27 ` george anzinger 2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel 0 siblings, 2 replies; 36+ messages in thread From: Rik van Riel @ 2001-03-14 19:19 UTC (permalink / raw) To: george anzinger Cc: Martin Dalecki, Albert D. Cahalan, npsimons, Guennadi Liakhovetski, Alexander Viro, linux-kernel On Wed, 14 Mar 2001, george anzinger wrote: > Is it REALLY necessary to prevent them from seeing an > inconsistent state? Seems to me that in the total picture (i.e. > system wide) they will never see a consistent state, so why be > concerned with a small corner of the system. You're right. All we need to make sure of is that the address space we want to print info about doesn't go away while we're reading the stats ... (I think ... but we'll need to look at the procfs code in more detail) regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-14 19:19 ` Rik van Riel @ 2001-03-14 16:27 ` george anzinger 2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel 1 sibling, 0 replies; 36+ messages in thread From: george anzinger @ 2001-03-14 16:27 UTC (permalink / raw) To: Rik van Riel Cc: Martin Dalecki, Albert D. Cahalan, npsimons, Guennadi Liakhovetski, Alexander Viro, linux-kernel Rik van Riel wrote: > > On Wed, 14 Mar 2001, george anzinger wrote: > > > Is it REALLY necessary to prevent them from seeing an > > inconsistent state? Seems to me that in the total picture (i.e. > > system wide) they will never see a consistent state, so why be > > concerned with a small corner of the system. > > You're right. All we need to make sure of is that the address > space we want to print info about doesn't go away while we're > reading the stats ... > > (I think ... but we'll need to look at the procfs code in more > detail) > For what its worth: On the last system I worked on we had a status program that maintained a screen with interesting things such as context switches per sec, disc i/o/sec, lan traffic/sec, ready queue length, next task (printed as current task) and... well a whole 26X80 screen full of stuff. The program gathered all the data by reading system tables as quickly as possible and THEN did the formatting/ screen update. Having to deal with pre formatted data would have a.) widened the capture window and b.) been a real drag to reformat and move to the right screen location. We allowed programs that had the savvy to have read only access to the kernel area to make this as fast as possible. George ^ permalink raw reply [flat|nested] 36+ messages in thread
* changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-14 19:19 ` Rik van Riel 2001-03-14 16:27 ` george anzinger @ 2001-03-15 12:24 ` Rik van Riel 2001-03-16 9:49 ` Stephen C. Tweedie 1 sibling, 1 reply; 36+ messages in thread From: Rik van Riel @ 2001-03-15 12:24 UTC (permalink / raw) To: george anzinger; +Cc: Alexander Viro, linux-mm, bcrl, linux-kernel On Wed, 14 Mar 2001, Rik van Riel wrote: > On Wed, 14 Mar 2001, george anzinger wrote: > > > Is it REALLY necessary to prevent them from seeing an > > inconsistent state? Seems to me that in the total picture (i.e. > > system wide) they will never see a consistent state, so why be > > concerned with a small corner of the system. > > You're right. Mmmm, I've looked at the code today and it turned out that we're NOT right ;) The mmap_sem is used in procfs to prevent the list of VMAs from changing. In the page fault code it seems to be used to prevent other page faults to happen at the same time with the current page fault (and to prevent VMAs from changing while a page fault is underway). Maybe we should change the mmap_sem into a R/W semaphore ? Since page faults seem to be the "common cause" of blocking procfs access *and* since both page faults and procfs only need to prevent the VMA list from changing, a read lock would help here. Write locks would be used in the code where we actually want to change the VMA list and page faults would use an extra lock to protect against each other (possibly a per-pagetable lock so multithreaded apps can pagefault in different memory regions at the same time ???). regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel @ 2001-03-16 9:49 ` Stephen C. Tweedie 2001-03-16 11:50 ` Rik van Riel 0 siblings, 1 reply; 36+ messages in thread From: Stephen C. Tweedie @ 2001-03-16 9:49 UTC (permalink / raw) To: Rik van Riel Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel Hi, On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote: > On Wed, 14 Mar 2001, Rik van Riel wrote: > The mmap_sem is used in procfs to prevent the list of VMAs > from changing. In the page fault code it seems to be used > to prevent other page faults to happen at the same time with > the current page fault (and to prevent VMAs from changing > while a page fault is underway). The page table spinlock should be quite sufficient to let us avoid races in the page fault code. We've had to deal with this before there was ever a mmap_sem anyway: in ancient times, every page fault had to do things like check to see if the pte had changed after IO was complete and once the BKL had been retaken. We can do the same with the page fault spinlock without much pain. > Maybe we should change the mmap_sem into a R/W semaphore ? Definitely. > Write locks would be used in the code where we actually want > to change the VMA list and page faults would use an extra lock > to protect against each other (possibly a per-pagetable lock Why do we need another lock? The critical section where we do the final update on the pte _already_ takes the page table spinlock to avoid races against the swapper. Cheers, Stephen ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-16 9:49 ` Stephen C. Tweedie @ 2001-03-16 11:50 ` Rik van Riel 2001-03-16 12:53 ` Stephen C. Tweedie 0 siblings, 1 reply; 36+ messages in thread From: Rik van Riel @ 2001-03-16 11:50 UTC (permalink / raw) To: Stephen C. Tweedie Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel On Fri, 16 Mar 2001, Stephen C. Tweedie wrote: > On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote: > > On Wed, 14 Mar 2001, Rik van Riel wrote: > > > The mmap_sem is used in procfs to prevent the list of VMAs > > from changing. In the page fault code it seems to be used > > to prevent other page faults to happen at the same time with > > the current page fault (and to prevent VMAs from changing > > while a page fault is underway). > > The page table spinlock should be quite sufficient to let us avoid > races in the page fault code. > > Write locks would be used in the code where we actually want > > to change the VMA list and page faults would use an extra lock > > to protect against each other (possibly a per-pagetable lock > > Why do we need another lock? The critical section where we do the > final update on the pte _already_ takes the page table spinlock to > avoid races against the swapper. The problem is that mmap_sem seems to be protecting the list of VMAs, so taking _only_ the page_table_lock could let a VMA change under us while a page fault is underway ... Then again, I guess just making mmap_sem a R/W lock should fix our problems ... and maybe even make it possible (in 2.5?) to let multithreaded programs have pagefaults at the same time, instead of having all threads queue up behind mmap_sem. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-16 11:50 ` Rik van Riel @ 2001-03-16 12:53 ` Stephen C. Tweedie 2001-03-18 7:23 ` Rik van Riel 0 siblings, 1 reply; 36+ messages in thread From: Stephen C. Tweedie @ 2001-03-16 12:53 UTC (permalink / raw) To: Rik van Riel Cc: Stephen C. Tweedie, george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel Hi, On Fri, Mar 16, 2001 at 08:50:25AM -0300, Rik van Riel wrote: > On Fri, 16 Mar 2001, Stephen C. Tweedie wrote: > > > > Write locks would be used in the code where we actually want > > > to change the VMA list and page faults would use an extra lock > > > to protect against each other (possibly a per-pagetable lock > > > > Why do we need another lock? The critical section where we do the > > final update on the pte _already_ takes the page table spinlock to > > avoid races against the swapper. > > The problem is that mmap_sem seems to be protecting the list > of VMAs, so taking _only_ the page_table_lock could let a VMA > change under us while a page fault is underway ... Right, I'm not suggesting removing that: making the mmap_sem read/write is fine, but yes, we still need that semaphore. But as for the "page faults would use an extra lock to protect against each other" bit --- we already have another lock, the page table lock, which can be used in this way, so ANOTHER lock should be unnecessary. --Stephen ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-16 12:53 ` Stephen C. Tweedie @ 2001-03-18 7:23 ` Rik van Riel 2001-03-18 9:56 ` Mike Galbraith 0 siblings, 1 reply; 36+ messages in thread From: Rik van Riel @ 2001-03-18 7:23 UTC (permalink / raw) To: Stephen C. Tweedie Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel On Fri, 16 Mar 2001, Stephen C. Tweedie wrote: > Right, I'm not suggesting removing that: making the mmap_sem > read/write is fine, but yes, we still need that semaphore. Initial patch (against 2.4.2-ac20) is available at http://www.surriel.com/patches/ > But as for the "page faults would use an extra lock to protect against > each other" bit --- we already have another lock, the page table lock, > which can be used in this way, so ANOTHER lock should be unnecessary. Tomorrow I'll take a look at the various ->nopage functions and do_swap_page to see if these functions would be able to take simultaneous faults at the same address (from multiple threads). If not, either we'll need to modify these functions, or we could add a (few?) extra lock to prevent these functions from faulting at the same address at the same time in multiple threads. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-18 7:23 ` Rik van Riel @ 2001-03-18 9:56 ` Mike Galbraith 2001-03-18 10:46 ` Rik van Riel 0 siblings, 1 reply; 36+ messages in thread From: Mike Galbraith @ 2001-03-18 9:56 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel On Sun, 18 Mar 2001, Rik van Riel wrote: > On Fri, 16 Mar 2001, Stephen C. Tweedie wrote: > > > Right, I'm not suggesting removing that: making the mmap_sem > > read/write is fine, but yes, we still need that semaphore. > > Initial patch (against 2.4.2-ac20) is available at > http://www.surriel.com/patches/ > > > But as for the "page faults would use an extra lock to protect against > > each other" bit --- we already have another lock, the page table lock, > > which can be used in this way, so ANOTHER lock should be unnecessary. > > Tomorrow I'll take a look at the various ->nopage > functions and do_swap_page to see if these functions > would be able to take simultaneous faults at the same > address (from multiple threads). If not, either we'll > need to modify these functions, or we could add a (few?) > extra lock to prevent these functions from faulting at > the same address at the same time in multiple threads. Hi Rik, I gave this patch a try, and the initial results are extremely encouraging. Not only do I have vmstat (SCHED_RR) info in realtime with zero delays :)) I also have a _nice_ throughput improvement. There are some worrisome warnings below along with the compile changes I made here, but for an initial patch, things look pretty darn wonderful. Cheers, -Mike --- ./include/linux/sched.h.org Sun Mar 18 10:20:42 2001 +++ ./include/linux/sched.h Sun Mar 18 10:27:48 2001 @@ -238,7 +238,7 @@ mm_users: ATOMIC_INIT(2), \ mm_count: ATOMIC_INIT(1), \ map_count: 1, \ - mmap_sem: __MUTEX_INITIALIZER(name.mmap_sem), \ + mmap_sem: __RWSEM_INITIALIZER(name.mmap_sem, RW_LOCK_BIAS), \ page_table_lock: SPIN_LOCK_UNLOCKED, \ mmlist: LIST_HEAD_INIT(name.mmlist), \ } --- ./include/linux/mm.h.org Sun Mar 18 09:56:55 2001 +++ ./include/linux/mm.h Sun Mar 18 10:27:59 2001 @@ -533,13 +533,13 @@ if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur || ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > current->rlim[RLIMIT_AS].rlim_cur) return -ENOMEM; - spin_lock(&mm->page_table_lock); + spin_lock(&vma->vm_mm->page_table_lock); vma->vm_start = address; vma->vm_pgoff -= grow; vma->vm_mm->total_vm += grow; if (vma->vm_flags & VM_LOCKED) vma->vm_mm->locked_vm += grow; - spin_unlock(&mm->page_table_lock); + spin_unlock(&vma->vm_mm->page_table_lock); return 0; } ... VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 196k freed Adding Swap: 265064k swap-space (priority 2) VM: Bad swap entry 00011e00 VM: Bad swap entry 00058d00 Unused swap offset entry in swap_dup 00058d00 Unused swap offset entry in swap_dup 00011e00 VM: Bad swap entry 00011e00 VM: Bad swap entry 00058d00 Unused swap offset entry in swap_dup 00058d00 VM: Bad swap entry 00058d00 Unused swap offset entry in swap_dup 00011e00 Unused swap offset entry in swap_dup 00058d00 VM: Bad swap entry 00011e00 VM: Bad swap entry 00058d00 Unused swap offset entry in swap_dup 00011e00 Unused swap offset entry in swap_dup 00058d00 VM: Bad swap entry 00011e00 VM: Bad swap entry 00058d00 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 Unused swap offset entry in swap_dup 006ef700 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_count 00011e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 008f4e00 Unused swap offset entry in swap_dup 006ef700 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 008f4e00 Unused swap offset entry in swap_dup 006ef700 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 Unused swap offset entry in swap_dup 00011e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 00011e00 Unused swap offset entry in swap_count 00011e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 006ef700 Unused swap offset entry in swap_dup 008f4e00 VM: Bad swap entry 006ef700 VM: Bad swap entry 008f4e00 Unused swap offset entry in swap_dup 008f4e00 Unused swap offset entry in swap_dup 006ef700 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-18 9:56 ` Mike Galbraith @ 2001-03-18 10:46 ` Rik van Riel 2001-03-18 12:33 ` Mike Galbraith 0 siblings, 1 reply; 36+ messages in thread From: Rik van Riel @ 2001-03-18 10:46 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-mm, linux-kernel On Sun, 18 Mar 2001, Mike Galbraith wrote: > I gave this patch a try, and the initial results are extremely encouraging. > Not only do I have vmstat (SCHED_RR) info in realtime with zero delays :)) > I also have a _nice_ throughput improvement. There are some worrisome > warnings below along with the compile changes I made here, but for an > initial patch, things look pretty darn wonderful. [snip compile fixes .. integrated] > VFS: Mounted root (ext2 filesystem) readonly. > Freeing unused kernel memory: 196k freed > Adding Swap: 265064k swap-space (priority 2) > VM: Bad swap entry 00011e00 > VM: Bad swap entry 00058d00 > Unused swap offset entry in swap_dup 00058d00 > Unused swap offset entry in swap_dup 00011e00 > VM: Bad swap entry 00011e00 > VM: Bad swap entry 00058d00 Heh, I guess do_swap_page isn't too happy when multiple threads of the same program take a page fault at the same address at the same time. I take it you were testing something like mysql, jvm or apache2 ? regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-18 10:46 ` Rik van Riel @ 2001-03-18 12:33 ` Mike Galbraith 0 siblings, 0 replies; 36+ messages in thread From: Mike Galbraith @ 2001-03-18 12:33 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel On Sun, 18 Mar 2001, Rik van Riel wrote: > > VFS: Mounted root (ext2 filesystem) readonly. > > Freeing unused kernel memory: 196k freed > > Adding Swap: 265064k swap-space (priority 2) > > VM: Bad swap entry 00011e00 > > VM: Bad swap entry 00058d00 > > Unused swap offset entry in swap_dup 00058d00 > > Unused swap offset entry in swap_dup 00011e00 > > VM: Bad swap entry 00011e00 > > VM: Bad swap entry 00058d00 > > Heh, I guess do_swap_page isn't too happy when multiple threads > of the same program take a page fault at the same address at the > same time. > > I take it you were testing something like mysql, jvm or apache2 ? No, this was make -j30 bzImage. (nscd was running though...) -Mike ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-13 22:52 ` Rik van Riel 2001-03-14 1:53 ` Martin Dalecki @ 2001-03-14 1:59 ` john slee 1 sibling, 0 replies; 36+ messages in thread From: john slee @ 2001-03-14 1:59 UTC (permalink / raw) To: linux-kernel; +Cc: Rik van Riel On Tue, Mar 13, 2001 at 07:52:41PM -0300, Rik van Riel wrote: > On Tue, 13 Mar 2001, Albert D. Cahalan wrote: > > > Bloat removal: being able to run without /proc mounted. > > > > We don't have "kernel speed". We have kernel-mode screwing around > > with text formatting. > > Sounds like you might want to maintain an external patch > for the embedded folks... or perhaps a patch to remove the non-procfs stuff from proc - leaving just /proc/[0-9]+ and /proc/self... that way top/ps/ still mostly work without patches and you dont have all the other stuff that you don't need (perhaps make a separate kernfs?). i *am* aware of the previous flamewars over this :-) but it does appear to me a more generally useful compromise in the anti-bloat case. j. (who likes proc as it is now) ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-12 18:27 ` Alexander Viro 2001-03-12 21:21 ` Guennadi Liakhovetski @ 2001-03-14 19:53 ` Szabolcs Szakacsits 2001-03-14 19:55 ` Alexander Viro 1 sibling, 1 reply; 36+ messages in thread From: Szabolcs Szakacsits @ 2001-03-14 19:53 UTC (permalink / raw) To: Alexander Viro; +Cc: Guennadi Liakhovetski, linux-kernel On Mon, 12 Mar 2001, Alexander Viro wrote: > On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote: > > I need to collect some info on processes. One way is to read /proc > > tree. But isn't there a system call (ioctl) for this? And what are those > Occam's Razor. Why invent new syscall when read() works? read() doesn't really work for this purpose, it blocks way too many times to be very annoying. When finally data arrives it's useless. Szaka ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-14 19:53 ` Szabolcs Szakacsits @ 2001-03-14 19:55 ` Alexander Viro 2001-03-14 20:23 ` Szabolcs Szakacsits 0 siblings, 1 reply; 36+ messages in thread From: Alexander Viro @ 2001-03-14 19:55 UTC (permalink / raw) To: Szabolcs Szakacsits; +Cc: Guennadi Liakhovetski, linux-kernel On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote: > > On Mon, 12 Mar 2001, Alexander Viro wrote: > > On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote: > > > I need to collect some info on processes. One way is to read /proc > > > tree. But isn't there a system call (ioctl) for this? And what are those > > Occam's Razor. Why invent new syscall when read() works? > > read() doesn't really work for this purpose, it blocks way too many > times to be very annoying. When finally data arrives it's useless. Huh? Take code of your non-blocking syscall. Make it ->read() for relevant file on /proc or wherever else you want it. See read() not blocking... Whether code blocks or not depends on the code, not on the calling conventions. And definitely not on ASCII vs. binary - conversion between these formats _is_ doable without blocking operations. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-14 19:55 ` Alexander Viro @ 2001-03-14 20:23 ` Szabolcs Szakacsits 2001-03-14 20:21 ` Alexander Viro 0 siblings, 1 reply; 36+ messages in thread From: Szabolcs Szakacsits @ 2001-03-14 20:23 UTC (permalink / raw) To: Alexander Viro; +Cc: Guennadi Liakhovetski, linux-kernel On Wed, 14 Mar 2001, Alexander Viro wrote: > On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote: > > read() doesn't really work for this purpose, it blocks way too many > > times to be very annoying. When finally data arrives it's useless. > Huh? Take code of your non-blocking syscall. Make it ->read() for > relevant file on /proc or wherever else you want it. See read() not > blocking... Sorry I should have quoted "blocks". Problem isn't with blocking but *no* data, no information. In the end you can conclude you know *nothing* what happend in the last t time interval - this can be second, minutes even with an RT, mlocked, etc process when the load is around 0. Szaka ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: system call for process information? 2001-03-14 20:23 ` Szabolcs Szakacsits @ 2001-03-14 20:21 ` Alexander Viro 0 siblings, 0 replies; 36+ messages in thread From: Alexander Viro @ 2001-03-14 20:21 UTC (permalink / raw) To: Szabolcs Szakacsits; +Cc: Guennadi Liakhovetski, linux-kernel On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote: > > On Wed, 14 Mar 2001, Alexander Viro wrote: > > On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote: > > > read() doesn't really work for this purpose, it blocks way too many > > > times to be very annoying. When finally data arrives it's useless. > > Huh? Take code of your non-blocking syscall. Make it ->read() for > > relevant file on /proc or wherever else you want it. See read() not > > blocking... > > Sorry I should have quoted "blocks". Problem isn't with blocking but > *no* data, no information. In the end you can conclude you know > *nothing* what happend in the last t time interval - this can be second, > minutes even with an RT, mlocked, etc process when the load is around 0. And how will a new syscall avoid the same problems you have with read()? Again, they can share the payload code - it's a matter of calling conventions and layout of the output. _That_ part doesn't take long. If reading is too slow - too bad, changing the syscall number won't help. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) @ 2001-03-18 9:34 Manfred Spraul 2001-03-18 10:56 ` Rik van Riel 2001-03-19 12:54 ` Stephen C. Tweedie 0 siblings, 2 replies; 36+ messages in thread From: Manfred Spraul @ 2001-03-18 9:34 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, Stephen C. Tweedie > > The problem is that mmap_sem seems to be protecting the list > of VMAs, so taking _only_ the page_table_lock could let a VMA > change under us while a page fault is underway ... No, that can't happen. VMA changes only happen if both the mmap_sem and the page table lock is acquired. (check insert_vm() at the end of mm/mmap.c) The page fault path uses the map_sem, kswaps uses page_table_lock. << from your patch: --- linux-2.4.2-ac20-vm/mm/vmscan.c.orig Sat Mar 17 11:30:49 2001 +++ linux-2.4.2-ac20-vm/mm/vmscan.c Sat Mar 17 20:53:10 2001 @@ -231,6 +231,7 @@ * Find the proper vm-area after freezing the vma chain * and ptes. */ + down_read(&mm->mmap_sem); spin_lock(&mm->page_table_lock); >>>> Why do you acquire the mmap semaphore in swapout_mm()? The old rule was that kswapd should never sleep on the mmap semaphore. Isn't there a deadlock if mmap sem is already acquired? I don't remember the details. > > The problem is that mmap_sem seems to be protecting the list > of VMAs, so taking _only_ the page_table_lock could let a VMA > change under us while a page fault is underway ... I remember that the pmd_alloc() and pte_alloc() functions need additional locking. -- Manfred ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-18 9:34 changing mm->mmap_sem (was: Re: system call for process information?) Manfred Spraul @ 2001-03-18 10:56 ` Rik van Riel 2001-03-19 12:54 ` Stephen C. Tweedie 1 sibling, 0 replies; 36+ messages in thread From: Rik van Riel @ 2001-03-18 10:56 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-kernel, Stephen C. Tweedie On Sun, 18 Mar 2001, Manfred Spraul wrote: > > The problem is that mmap_sem seems to be protecting the list > > of VMAs, so taking _only_ the page_table_lock could let a VMA > > change under us while a page fault is underway ... > > No, that can't happen. > VMA changes only happen if both the mmap_sem and the page table lock is > acquired. (check insert_vm() at the end of mm/mmap.c) > The page fault path uses the map_sem, kswaps uses page_table_lock. You're right here, I missed this "little detail"... > << from your patch: > --- linux-2.4.2-ac20-vm/mm/vmscan.c.orig Sat Mar 17 11:30:49 2001 > +++ linux-2.4.2-ac20-vm/mm/vmscan.c Sat Mar 17 20:53:10 2001 > @@ -231,6 +231,7 @@ > * Find the proper vm-area after freezing the vma chain > * and ptes. > */ > + down_read(&mm->mmap_sem); > spin_lock(&mm->page_table_lock); > >>>> > > Why do you acquire the mmap semaphore in swapout_mm()? The old rule was > that kswapd should never sleep on the mmap semaphore. Isn't there a > deadlock if mmap sem is already acquired? I don't remember the details. You're right, kswapd shouldn't do this. I have this removed from my code right now... > > The problem is that mmap_sem seems to be protecting the list > > of VMAs, so taking _only_ the page_table_lock could let a VMA > > change under us while a page fault is underway ... > > I remember that the pmd_alloc() and pte_alloc() functions need > additional locking. Isn't this what the page_table_lock is for ? (too bad they're not using it...) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-18 9:34 changing mm->mmap_sem (was: Re: system call for process information?) Manfred Spraul 2001-03-18 10:56 ` Rik van Riel @ 2001-03-19 12:54 ` Stephen C. Tweedie 1 sibling, 0 replies; 36+ messages in thread From: Stephen C. Tweedie @ 2001-03-19 12:54 UTC (permalink / raw) To: Manfred Spraul; +Cc: Rik van Riel, linux-kernel, Stephen C. Tweedie Hi, On Sun, Mar 18, 2001 at 10:34:38AM +0100, Manfred Spraul wrote: > > The problem is that mmap_sem seems to be protecting the list > > of VMAs, so taking _only_ the page_table_lock could let a VMA > > change under us while a page fault is underway ... > > No, that can't happen. It can. Page faults often need to block, so they have to be able to drop the page_table_lock. Holding the mmap_sem is all that keeps the vma intact until the IO is complete. Cheers, Stephen ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <Pine.LNX.4.33.0103181407520.1426-100000@mikeg.weiden.de>]
* Re: changing mm->mmap_sem (was: Re: system call for process information?) [not found] <Pine.LNX.4.33.0103181407520.1426-100000@mikeg.weiden.de> @ 2001-03-18 14:43 ` Rik van Riel 2001-03-18 18:13 ` Linus Torvalds 0 siblings, 1 reply; 36+ messages in thread From: Rik van Riel @ 2001-03-18 14:43 UTC (permalink / raw) To: Mike Galbraith; +Cc: sct, linux-kernel On Sun, 18 Mar 2001, Mike Galbraith wrote: > > No, this was make -j30 bzImage. (nscd was running though...) > > I rebooted, shut down nscd prior to testing and did 5 builds in a row > without a single gripe. Started nscd for sixth run and instantly the > kernel griped. Yup.. threaded apps pushing swap. OK, I'll write some code to prevent multiple threads from stepping all over each other when they pagefault at the same address. What would be the preferred method of fixing this ? - fixing do_swap_page and all ->nopage functions - hacking handle_mm_fault to make sure no overlapping pagefaults will be served at the same time regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-18 14:43 ` Rik van Riel @ 2001-03-18 18:13 ` Linus Torvalds 0 siblings, 0 replies; 36+ messages in thread From: Linus Torvalds @ 2001-03-18 18:13 UTC (permalink / raw) To: linux-kernel In article <Pine.LNX.4.21.0103181122480.13050-100000@imladris.rielhome.conectiva>, Rik van Riel <riel@conectiva.com.br> wrote: > >OK, I'll write some code to prevent multiple threads from >stepping all over each other when they pagefault at the >same address. > >What would be the preferred method of fixing this ? > >- fixing do_swap_page and all ->nopage functions There is no need to fix gthe "nopage" functions. They never see the page table directly anyway. So the only thing that _should_ be needed is to make sure that do_no_page(), do_swap_page() and do_anonymous_page() will re-aquire the mm->page_table_lock and undo their work if it turns out that the page table entry is no longer empty.. (do_wp_page() should already be ok in this regard - it already does this exactly because present pagetable entries can already race with kswapd. What we're adding is that _nonpresent_ page table entries can race with multiple invocations of concurrent page faults) >- hacking handle_mm_fault to make sure no overlapping > pagefaults will be served at the same time No. The whole reason the rw_semaphores were done in the first place was to allow page faults to happen concurrently to allow threaded applictions to scale up even when faulting. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <200103181813.KAA22153@penguin.transmeta.com>]
* Re: changing mm->mmap_sem (was: Re: system call for process information?) [not found] <200103181813.KAA22153@penguin.transmeta.com> @ 2001-03-18 20:59 ` Rik van Riel 2001-03-19 1:21 ` Linus Torvalds 0 siblings, 1 reply; 36+ messages in thread From: Rik van Riel @ 2001-03-18 20:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Sun, 18 Mar 2001, Linus Torvalds wrote: > In article <Pine.LNX.4.21.0103181122480.13050-100000@imladris.rielhome.conectiva>, > Rik van Riel <riel@conectiva.com.br> wrote: > > > >OK, I'll write some code to prevent multiple threads from > >stepping all over each other when they pagefault at the > >same address. > > > >What would be the preferred method of fixing this ? > > > >- fixing do_swap_page and all ->nopage functions > > There is no need to fix gthe "nopage" functions. They never see the > page table directly anyway. > > So the only thing that _should_ be needed is to make sure that > do_no_page(), do_swap_page() and do_anonymous_page() will re-aquire > the mm->page_table_lock and undo their work if it turns out that the > page table entry is no longer empty.. ... in which case concurrency is maximised, but there is a possibility of doing double work... > >- hacking handle_mm_fault to make sure no overlapping > > pagefaults will be served at the same time > > No. The whole reason the rw_semaphores were done in the first place > was to allow page faults to happen concurrently to allow threaded > applictions to scale up even when faulting. Indeed, having threaded apps do multiple page faults at the same time is the main goal of this patch. However, I don't see how it would be good for scalability to have multiple threads fault in the same page at the same time, when they could just wait for one of them to do the work. Only faults for different addresses would proceed, not faults for the same address... regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-18 20:59 ` Rik van Riel @ 2001-03-19 1:21 ` Linus Torvalds 2001-03-19 2:59 ` Rik van Riel 0 siblings, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2001-03-19 1:21 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel On Sun, 18 Mar 2001, Rik van Riel wrote: > > Indeed, having threaded apps do multiple page faults at the > same time is the main goal of this patch. However, I don't > see how it would be good for scalability to have multiple > threads fault in the same page at the same time, when they > could just wait for one of them to do the work. But they will. That's what lock_page() etc are there for - there's no need for the VM to synchronize because we already have the synchronization primitives at a lower level. And there isn't any other lock that could work anyway. It's either the whole MM or a page. There's nothing in between. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: changing mm->mmap_sem (was: Re: system call for process information?) 2001-03-19 1:21 ` Linus Torvalds @ 2001-03-19 2:59 ` Rik van Riel 0 siblings, 0 replies; 36+ messages in thread From: Rik van Riel @ 2001-03-19 2:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Sun, 18 Mar 2001, Linus Torvalds wrote: > On Sun, 18 Mar 2001, Rik van Riel wrote: > > > > Indeed, having threaded apps do multiple page faults at the > > same time is the main goal of this patch. However, I don't > > see how it would be good for scalability to have multiple > > threads fault in the same page at the same time, when they > > could just wait for one of them to do the work. > > But they will. > > That's what lock_page() etc are there for - there's no need for the VM > to synchronize because we already have the synchronization primitives > at a lower level. Indeed. I'll go multithread the do_no_page and do_swap_page functions tomorrow (maybe even tonight ;)). regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2001-03-19 12:58 UTC | newest] Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-03-12 17:08 system call for process information? Guennadi Liakhovetski 2001-03-12 18:27 ` Alexander Viro 2001-03-12 21:21 ` Guennadi Liakhovetski 2001-03-13 2:56 ` Nathan Paul Simons 2001-03-13 3:20 ` Alexander Viro 2001-03-13 9:55 ` Guennadi Liakhovetski 2001-03-13 21:05 ` Albert D. Cahalan 2001-03-13 22:02 ` Nathan Paul Simons 2001-03-13 22:50 ` Albert D. Cahalan 2001-03-13 22:52 ` Rik van Riel 2001-03-14 1:53 ` Martin Dalecki 2001-03-14 2:28 ` Rik van Riel 2001-03-14 8:24 ` george anzinger 2001-03-14 19:19 ` Rik van Riel 2001-03-14 16:27 ` george anzinger 2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel 2001-03-16 9:49 ` Stephen C. Tweedie 2001-03-16 11:50 ` Rik van Riel 2001-03-16 12:53 ` Stephen C. Tweedie 2001-03-18 7:23 ` Rik van Riel 2001-03-18 9:56 ` Mike Galbraith 2001-03-18 10:46 ` Rik van Riel 2001-03-18 12:33 ` Mike Galbraith 2001-03-14 1:59 ` system call for process information? john slee 2001-03-14 19:53 ` Szabolcs Szakacsits 2001-03-14 19:55 ` Alexander Viro 2001-03-14 20:23 ` Szabolcs Szakacsits 2001-03-14 20:21 ` Alexander Viro 2001-03-18 9:34 changing mm->mmap_sem (was: Re: system call for process information?) Manfred Spraul 2001-03-18 10:56 ` Rik van Riel 2001-03-19 12:54 ` Stephen C. Tweedie [not found] <Pine.LNX.4.33.0103181407520.1426-100000@mikeg.weiden.de> 2001-03-18 14:43 ` Rik van Riel 2001-03-18 18:13 ` Linus Torvalds [not found] <200103181813.KAA22153@penguin.transmeta.com> 2001-03-18 20:59 ` Rik van Riel 2001-03-19 1:21 ` Linus Torvalds 2001-03-19 2:59 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).