On Fri, Mar 26, 2021, 22:05 Maximilian Böther < maximilian.boether@student.hpi.de> wrote: > Hello! > > I am investigating an application that writes random data in fixed-size > chunks (e.g. 4k) to random locations in a large buffer file. I have > several processes (not threads) doing that, each process has its own > buffer file assigned. > > If I use mmap+msync to write and persist data to disk, I see a > performance spike for 16 processes, and a performance drop for more > threads (32 processes). The CPU has 32 logical cores in total, and we > are not CPU bound. > > If I use open+write+fsync, I do not see such a spike, instead a > performance plateau (and mmap is slower than open/write). > > I've read multiple times [1,2] that both mmap and msync can take locks. > With vtune, I analyzed that we are indeed spinlocking, and spending the > most time in clear_page_erms and xas_load functions. > > However, when reading the source code for msync [3], I cannot understand > whether these locks are global or per-file. The paper [2] states that > the locks are on radix-trees within the kernel that are per-file, > however, as I do observe some spinlocks in the kernel, I believe that > some locks may be global, as I have one file per process. > > Do you have an explanation on why we have such a spike at 16 processes > for mmap and input on the locking behavior of msync? > > Thank you! > > Best, > Maximilian Böther > > [1] > > https://kb.pmem.io/development/100000025-Why-msync-is-less-optimal-for-persistent-memory/ > - I know it's about PMem, but the lock argument is important > > [2] Optimizing Memory-mapped I/O for Fast Storage Devices, Papagiannis > et al., ATC '20 > > [3] https://elixir.bootlin.com/linux/latest/source/mm/msync.c > > _______________________________________________ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbi > es > Is it NUMA?