> > But why? Are we going to get rid of cpumask_t (which is a fixed sized > > struct to direct assignment is perfectly fine)? > > > > Also, do we want to convert cpus_allowed to cpumask_var_t? It would save > > quite a lot of memory on distro configs that set NR_CPUS silly high. > > Currently NR_CPUS=4096 configs allocate 512 bytes per task for this > > bitmap, 511 of which will never be used on most machines (510 in the > > near future). > > > > The cost if of course an extra memory dereference in scheduler hot > > paths.. also not nice. Probably, mesurement data is verbose than my poor english... I've made concept proof patch today. The result is better than I expected. Performance counter stats for 'hackbench 10 thread 1000' (10 runs): 1603777813 cache-references # 56.987 M/sec ( +- 1.824% ) (scaled from 25.36%) 13780381 cache-misses # 0.490 M/sec ( +- 1.360% ) (scaled from 25.55%) 24872032348 L1-dcache-loads # 883.770 M/sec ( +- 0.666% ) (scaled from 25.51%) 640394580 L1-dcache-load-misses # 22.755 M/sec ( +- 0.796% ) (scaled from 25.47%) 14.162411769 seconds time elapsed ( +- 0.675% ) Performance counter stats for 'hackbench 10 thread 1000' (10 runs): 1416147603 cache-references # 51.566 M/sec ( +- 4.407% ) (scaled from 25.40%) 10920284 cache-misses # 0.398 M/sec ( +- 5.454% ) (scaled from 25.56%) 24666962632 L1-dcache-loads # 898.196 M/sec ( +- 1.747% ) (scaled from 25.54%) 598640329 L1-dcache-load-misses # 21.798 M/sec ( +- 2.504% ) (scaled from 25.50%) 13.812193312 seconds time elapsed ( +- 1.696% ) * datail data is in result.txt The trick is, - Typical linux userland applications don't use mempolicy and/or cpusets API at all. - Then, 99.99% thread's tsk->cpus_alloed have cpu_all_mask. - cpu_all_mask case, every thread can share the same bitmap. It may help to reduce L1 cache miss in scheduler. What do you think?