* [PATCH 0/2] Extend size of the MCE Records pool @ 2024-02-07 22:56 Avadhut Naik 2024-02-07 22:56 ` [PATCH 1/2] x86/MCE: " Avadhut Naik 2024-02-07 22:56 ` [PATCH 2/2] x86/MCE: Add command line option to extend " Avadhut Naik 0 siblings, 2 replies; 68+ messages in thread From: Avadhut Naik @ 2024-02-07 22:56 UTC (permalink / raw) To: x86, linux-edac; +Cc: bp, tony.luck, linux-kernel, yazen.ghannam, avadnaik This patchset extends size of the existing MCE records pool to ensure that MCE records are not dropped, particularly on systems with higher CPU count. This is a followup of the below discussion: https://lore.kernel.org/linux-edac/20231011163320.79732-1-sironi@amazon.de/ The fist patch extends the size of MCE Records pool in accordance to the CPU count of the system. The second patch adds a new command line parameter to extend the size of MCE Records pool by a predetermined number of pages. Avadhut Naik (2): x86/MCE: Extend size of the MCE Records pool x86/MCE: Add command line option to extend MCE Records pool .../admin-guide/kernel-parameters.txt | 2 + arch/x86/kernel/cpu/mce/core.c | 3 ++ arch/x86/kernel/cpu/mce/genpool.c | 38 +++++++++++++++++++ arch/x86/kernel/cpu/mce/internal.h | 1 + 4 files changed, 44 insertions(+) base-commit: 93e1e1fe2f97859cb079078b6b750542ebbfdea8 -- 2.34.1 ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-07 22:56 [PATCH 0/2] Extend size of the MCE Records pool Avadhut Naik @ 2024-02-07 22:56 ` Avadhut Naik 2024-02-08 0:02 ` Luck, Tony 2024-02-08 21:09 ` Sohil Mehta 2024-02-07 22:56 ` [PATCH 2/2] x86/MCE: Add command line option to extend " Avadhut Naik 1 sibling, 2 replies; 68+ messages in thread From: Avadhut Naik @ 2024-02-07 22:56 UTC (permalink / raw) To: x86, linux-edac; +Cc: bp, tony.luck, linux-kernel, yazen.ghannam, avadnaik Currently, 2 pages are allocated for the MCE Records pool during system bootup. Records of MCEs (struct mce) occurring on the system are added to the pool through mce_gen_pool_add() in MC context. These records are then decoded later, in process context through notifier chains. However, on systems with high CPU count, the 2 pages allocated for the pool might not be sufficient in some instances. Successive MCEs received back-to-back, before they are decoded through mce_gen_pool_process(), will result in the pool getting exhausted. Consequently, some MCE records will be missed. The issue further intensifies since the amount of memory associated with a system typically increases with the CPU count, thereby, increasing the probability of MCEs being received. The limit of 2 pages for the MCE records pool was set more than 8 years ago and has not been revised till date. The CPU count and the amount of memory associated with a system however, have increased enormously since then. Additionally, the size of MCE Records (struct mce) too has increased from 88 bytes to 124 bytes. Extend the size of MCE Records pool to better serve modern systems. The increase in size depends on the CPU count of the system. Currently, since size of struct mce is 124 bytes, each logical CPU of the system will have space for at least 2 MCE records available in the pool. To get around the allocation woes during early boot time, the same is undertaken using late_initcall(). Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> --- arch/x86/kernel/cpu/mce/core.c | 3 +++ arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++++++++ arch/x86/kernel/cpu/mce/internal.h | 1 + 3 files changed, 26 insertions(+) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index b5cc557cfc37..5d6d7994d549 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void) if (mca_cfg.recovery) enable_copy_mc_fragile(); + if (mce_gen_pool_extend()) + pr_info("Couldn't extend MCE records pool!\n"); + mcheck_debugfs_init(); /* diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..aed01612d342 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -20,6 +20,7 @@ * 2 pages to save MCE events for now (~80 MCE records at most). */ #define MCE_POOLSZ (2 * PAGE_SIZE) +#define CPU_GEN_MEMSZ 256 static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce) return 0; } +int mce_gen_pool_extend(void) +{ + unsigned long addr, len; + int ret = -ENOMEM; + u32 num_threads; + + num_threads = num_present_cpus(); + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); + addr = (unsigned long)kzalloc(len, GFP_KERNEL); + + if (!addr) + goto out; + + ret = gen_pool_add(mce_evt_pool, addr, len, -1); + if (ret) + kfree((void *)addr); + +out: + return ret; +} + static int mce_gen_pool_create(void) { struct gen_pool *tmpp; diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h index 01f8f03969e6..81e35ec58ebc 100644 --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -33,6 +33,7 @@ void mce_gen_pool_process(struct work_struct *__unused); bool mce_gen_pool_empty(void); int mce_gen_pool_add(struct mce *mce); int mce_gen_pool_init(void); +int mce_gen_pool_extend(void); struct llist_node *mce_gen_pool_prepare_records(void); int mce_severity(struct mce *a, struct pt_regs *regs, char **msg, bool is_excp); -- 2.34.1 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* RE: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-07 22:56 ` [PATCH 1/2] x86/MCE: " Avadhut Naik @ 2024-02-08 0:02 ` Luck, Tony 2024-02-08 17:41 ` Naik, Avadhut 2024-02-08 21:09 ` Sohil Mehta 1 sibling, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-02-08 0:02 UTC (permalink / raw) To: Avadhut Naik, x86, linux-edac; +Cc: bp, linux-kernel, yazen.ghannam, avadnaik > +#define CPU_GEN_MEMSZ 256 What is this define? Why isn't this "sizeof(struct mce)"? Or 2* that if you are trying to reserve enough space for two records per CPU. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-08 0:02 ` Luck, Tony @ 2024-02-08 17:41 ` Naik, Avadhut 2024-02-08 17:47 ` Naik, Avadhut 2024-02-08 18:39 ` Luck, Tony 0 siblings, 2 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-08 17:41 UTC (permalink / raw) To: Luck, Tony, x86, linux-edac; +Cc: bp, linux-kernel, yazen.ghannam, Avadhut Naik Hi On 2/7/2024 18:02, Luck, Tony wrote: >> +#define CPU_GEN_MEMSZ 256 > > What is this define? > > Why isn't this "sizeof(struct mce)"? > > Or 2* that if you are trying to reserve enough space for two records per CPU. > That's the memory in bytes reserved for each logical CPU in the extended MCE Records pool. By current size of struct mce that equates to around 2 records. Will change it to (2 * sizeof(struct mce)) though. Feels more accurate. Thanks for the suggestion! Do you have any additional concerns/comments on this patchset? > -Tony -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-08 17:41 ` Naik, Avadhut @ 2024-02-08 17:47 ` Naik, Avadhut 2024-02-08 18:39 ` Luck, Tony 1 sibling, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-08 17:47 UTC (permalink / raw) To: Luck, Tony, x86, linux-edac; +Cc: bp, linux-kernel, yazen.ghannam, Avadhut Naik On 2/8/2024 11:41, Naik, Avadhut wrote: > Hi > > On 2/7/2024 18:02, Luck, Tony wrote: >>> +#define CPU_GEN_MEMSZ 256 >> >> What is this define? >> >> Why isn't this "sizeof(struct mce)"? >> >> Or 2* that if you are trying to reserve enough space for two records per CPU. >> > That's the memory in bytes reserved for each logical CPU in the > extended MCE Records pool. By current size of struct mce that > equates to around 2 records. This memory will be reserved for each logical CPU only when the command line parameter being introduced through the second patch "mce-genpool-extend" is not set. > > Will change it to (2 * sizeof(struct mce)) though. Feels more > accurate. Thanks for the suggestion! > > Do you have any additional concerns/comments on this patchset? > >> -Tony > -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-08 17:41 ` Naik, Avadhut 2024-02-08 17:47 ` Naik, Avadhut @ 2024-02-08 18:39 ` Luck, Tony 2024-02-09 19:47 ` Naik, Avadhut 1 sibling, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-02-08 18:39 UTC (permalink / raw) To: Naik, Avadhut, x86, linux-edac Cc: bp, linux-kernel, yazen.ghannam, Avadhut Naik > Will change it to (2 * sizeof(struct mce)) though. Feels more > accurate. Thanks for the suggestion! Thanks. > Do you have any additional concerns/comments on this patchset? Overall this is an excellent addition. Reserved space to log errors does need to scale up with the CPU count. I think part 1 (unconditional increase based on CPU count) is a "must have" enhancement. With the change to CPU_GEN_MEMSZ #define: Reviewed-by: Tony Luck <tony.luck@intel.com> I'm less enthusiastic about part 2 adding a command line option to override the code in part 1 with a bigger (or smaller?) amount. Can you describe some situation where a user would need to use this? -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-08 18:39 ` Luck, Tony @ 2024-02-09 19:47 ` Naik, Avadhut 0 siblings, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-09 19:47 UTC (permalink / raw) To: Luck, Tony, x86, linux-edac; +Cc: bp, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/8/2024 12:39, Luck, Tony wrote: >> Will change it to (2 * sizeof(struct mce)) though. Feels more >> accurate. Thanks for the suggestion! > > Thanks. > >> Do you have any additional concerns/comments on this patchset? > > Overall this is an excellent addition. Reserved space to log errors does need to scale > up with the CPU count. > > I think part 1 (unconditional increase based on CPU count) is a "must have" enhancement. > With the change to CPU_GEN_MEMSZ #define: > > Reviewed-by: Tony Luck <tony.luck@intel.com> > > > I'm less enthusiastic about part 2 adding a command line option to override the code in > part 1 with a bigger (or smaller?) amount. Can you describe some situation where a user > would need to use this? > I added the command-line option to ensure that we have covered all bases and are not enforcing this memory footprint on all users. A system with 512 logical CPUs, by the current proposed logic, will have 32 pages allocated for the pool ((512*256)/4096)). Some users may feel that this is not needed on their systems and they can do with just, maybe, 16 pages. The command line option gives them the flexibility to do so without having to change kernel code, rebuild and deploy. Conversely, some users wanting to err on the side of caution, might feel that the above 32 pages are not enough for the pool and may want to allocate more, maybe, 48 pages. The command line option again, provides them with the flexibility to do so. Sounds reasonable? -- Thanks, Avadhut Naik > -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-07 22:56 ` [PATCH 1/2] x86/MCE: " Avadhut Naik 2024-02-08 0:02 ` Luck, Tony @ 2024-02-08 21:09 ` Sohil Mehta 2024-02-09 19:52 ` Naik, Avadhut 1 sibling, 1 reply; 68+ messages in thread From: Sohil Mehta @ 2024-02-08 21:09 UTC (permalink / raw) To: Avadhut Naik, x86, linux-edac Cc: bp, tony.luck, linux-kernel, yazen.ghannam, avadnaik On 2/7/2024 2:56 PM, Avadhut Naik wrote: > Extend the size of MCE Records pool to better serve modern systems. The > increase in size depends on the CPU count of the system. Currently, since > size of struct mce is 124 bytes, each logical CPU of the system will have > space for at least 2 MCE records available in the pool. To get around the > allocation woes during early boot time, the same is undertaken using > late_initcall(). > I guess making this proportional to the number of CPUs is probably fine assuming CPUs and memory capacity *would* generally increase in sync. But, is there some logic to having 2 MCE records per logical cpu or it is just a heuristic approach? In practice, the pool is shared amongst all MCE sources and can be filled by anyone, right? > Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> > --- > arch/x86/kernel/cpu/mce/core.c | 3 +++ > arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++++++++ > arch/x86/kernel/cpu/mce/internal.h | 1 + > 3 files changed, 26 insertions(+) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index b5cc557cfc37..5d6d7994d549 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void) > if (mca_cfg.recovery) > enable_copy_mc_fragile(); > > + if (mce_gen_pool_extend()) > + pr_info("Couldn't extend MCE records pool!\n"); > + Why do this unconditionally? For a vast majority of low core-count, low memory systems the default 2 pages would be good enough. Should there be a threshold beyond which the extension becomes active? Let's say, for example, a check for num_present_cpus() > 32 (Roughly based on 8Kb memory and 124b*2 estimate per logical CPU). Whatever you choose, a comment above the code would be helpful describing when the extension is expected to be useful. > mcheck_debugfs_init(); > > /* > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > index fbe8b61c3413..aed01612d342 100644 > --- a/arch/x86/kernel/cpu/mce/genpool.c > +++ b/arch/x86/kernel/cpu/mce/genpool.c > @@ -20,6 +20,7 @@ > * 2 pages to save MCE events for now (~80 MCE records at most). > */ > #define MCE_POOLSZ (2 * PAGE_SIZE) > +#define CPU_GEN_MEMSZ 256 > The comment above MCE_POOLSZ probably needs a complete re-write. Right now, it reads as follows: * This memory pool is only to be used to save MCE records in MCE context. * MCE events are rare, so a fixed size memory pool should be enough. Use * 2 pages to save MCE events for now (~80 MCE records at most). Apart from the numbers being incorrect since sizeof(struct mce) has increased, this patch is based on the assumption that the current MCE memory pool is no longer enough in certain cases. > static struct gen_pool *mce_evt_pool; > static LLIST_HEAD(mce_event_llist); > @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce) > return 0; > } > > +int mce_gen_pool_extend(void) > +{ > + unsigned long addr, len; s/len/size/ > + int ret = -ENOMEM; > + u32 num_threads; > + > + num_threads = num_present_cpus(); > + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); Nit: Can the use of the num_threads variable be avoided? How about: size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ); > + addr = (unsigned long)kzalloc(len, GFP_KERNEL); Also, shouldn't the new allocation be incremental to the 2 pages already present? Let's say, for example, that you have a 40-cpu system and the calculated size in this case comes out to 40 * 2 * 128b = 9920bytes i.e. 3 pages. You only need to allocate 1 additional page to add to mce_evt_pool instead of the 3 pages that the current code does. Sohil > + > + if (!addr) > + goto out; > + > + ret = gen_pool_add(mce_evt_pool, addr, len, -1); > + if (ret) > + kfree((void *)addr); > + > +out: > + return ret; > +} > + > static int mce_gen_pool_create(void) > { > struct gen_pool *tmpp; ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool 2024-02-08 21:09 ` Sohil Mehta @ 2024-02-09 19:52 ` Naik, Avadhut 0 siblings, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-09 19:52 UTC (permalink / raw) To: Sohil Mehta, x86, linux-edac Cc: bp, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/8/2024 15:09, Sohil Mehta wrote: > On 2/7/2024 2:56 PM, Avadhut Naik wrote: > >> Extend the size of MCE Records pool to better serve modern systems. The >> increase in size depends on the CPU count of the system. Currently, since >> size of struct mce is 124 bytes, each logical CPU of the system will have >> space for at least 2 MCE records available in the pool. To get around the >> allocation woes during early boot time, the same is undertaken using >> late_initcall(). >> > > I guess making this proportional to the number of CPUs is probably fine > assuming CPUs and memory capacity *would* generally increase in sync. > > But, is there some logic to having 2 MCE records per logical cpu or it > is just a heuristic approach? In practice, the pool is shared amongst > all MCE sources and can be filled by anyone, right? > Yes, the pool is shared among all MCE sources but the logic for 256 is that the genpool was set to 2 pages i.e. 8192 bytes in 2015. Around that time, AFAIK, the max number of logical CPUs on a system was 32. So, in the maximum case, each CPU will have around 256 bytes (8192/32) in the pool. It equates to approximately 2 MCE records since sizeof(struct mce) back then was 88 bytes. >> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> >> --- >> arch/x86/kernel/cpu/mce/core.c | 3 +++ >> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++++++++ >> arch/x86/kernel/cpu/mce/internal.h | 1 + >> 3 files changed, 26 insertions(+) >> >> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c >> index b5cc557cfc37..5d6d7994d549 100644 >> --- a/arch/x86/kernel/cpu/mce/core.c >> +++ b/arch/x86/kernel/cpu/mce/core.c >> @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void) >> if (mca_cfg.recovery) >> enable_copy_mc_fragile(); >> >> + if (mce_gen_pool_extend()) >> + pr_info("Couldn't extend MCE records pool!\n"); >> + > > Why do this unconditionally? For a vast majority of low core-count, low > memory systems the default 2 pages would be good enough. > > Should there be a threshold beyond which the extension becomes active? > Let's say, for example, a check for num_present_cpus() > 32 (Roughly > based on 8Kb memory and 124b*2 estimate per logical CPU). > > Whatever you choose, a comment above the code would be helpful > describing when the extension is expected to be useful. > Put it in unconditionally because IMO the increase in memory even for low-core systems didn't seem to be substantial. Just an additional page for systems with less than 16 CPUs. But I do get your point. Will add a check in mcheck_late_init() for CPUs present. Something like below: @@ -2901,7 +2901,7 @@ static int __init mcheck_late_init(void) if (mca_cfg.recovery) enable_copy_mc_fragile(); - if (mce_gen_pool_extend()) + if ((num_present_cpus() > 32) && mce_gen_pool_extend()) pr_info("Couldn't extend MCE records pool!\n"); Does this look good? The genpool extension will then be undertaken only for systems with more than 32 CPUs. Will explain the same in a comment. >> mcheck_debugfs_init(); >> >> /* >> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c >> index fbe8b61c3413..aed01612d342 100644 >> --- a/arch/x86/kernel/cpu/mce/genpool.c >> +++ b/arch/x86/kernel/cpu/mce/genpool.c >> @@ -20,6 +20,7 @@ >> * 2 pages to save MCE events for now (~80 MCE records at most). >> */ >> #define MCE_POOLSZ (2 * PAGE_SIZE) >> +#define CPU_GEN_MEMSZ 256 >> > > The comment above MCE_POOLSZ probably needs a complete re-write. Right > now, it reads as follows: > > * This memory pool is only to be used to save MCE records in MCE context. > * MCE events are rare, so a fixed size memory pool should be enough. Use > * 2 pages to save MCE events for now (~80 MCE records at most). > > Apart from the numbers being incorrect since sizeof(struct mce) has > increased, this patch is based on the assumption that the current MCE > memory pool is no longer enough in certain cases. > Yes, will change the comment to something like below: * This memory pool is only to be used to save MCE records in MCE context. * Though MCE events are rare, their frequency typically depends on the * system's memory and CPU count. * Allocate 2 pages to the MCE Records pool during early boot with the * option to extend the pool, as needed, through command line, for systems * with CPU count of more than 32. * By default, each logical CPU can have around 2 MCE records in the pool * at the same time. Sounds good? >> static struct gen_pool *mce_evt_pool; >> static LLIST_HEAD(mce_event_llist); >> @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce) >> return 0; >> } >> >> +int mce_gen_pool_extend(void) >> +{ >> + unsigned long addr, len; > > s/len/size/ > Noted. >> + int ret = -ENOMEM; >> + u32 num_threads; >> + >> + num_threads = num_present_cpus(); >> + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); > > Nit: Can the use of the num_threads variable be avoided? > How about: > > size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ); > Will do. > > >> + addr = (unsigned long)kzalloc(len, GFP_KERNEL); > > Also, shouldn't the new allocation be incremental to the 2 pages already > present? > > Let's say, for example, that you have a 40-cpu system and the calculated > size in this case comes out to 40 * 2 * 128b = 9920bytes i.e. 3 pages. > You only need to allocate 1 additional page to add to mce_evt_pool > instead of the 3 pages that the current code does. > Will make it incremental when genpool extension is being undertaken through the default means. Something like below: @@ -129,6 +134,7 @@ int mce_gen_pool_extend(void) } else { num_threads = num_present_cpus(); len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); + len -= MCE_POOLSZ; Does this sound good? -- Thanks, Avadhut Naik > Sohil > >> + >> + if (!addr) >> + goto out; >> + >> + ret = gen_pool_add(mce_evt_pool, addr, len, -1); >> + if (ret) >> + kfree((void *)addr); >> + >> +out: >> + return ret; >> +} >> + >> static int mce_gen_pool_create(void) >> { >> struct gen_pool *tmpp; > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-07 22:56 [PATCH 0/2] Extend size of the MCE Records pool Avadhut Naik 2024-02-07 22:56 ` [PATCH 1/2] x86/MCE: " Avadhut Naik @ 2024-02-07 22:56 ` Avadhut Naik 2024-02-09 1:36 ` Sohil Mehta 1 sibling, 1 reply; 68+ messages in thread From: Avadhut Naik @ 2024-02-07 22:56 UTC (permalink / raw) To: x86, linux-edac; +Cc: bp, tony.luck, linux-kernel, yazen.ghannam, avadnaik Extension of MCE Records pool, based on system's CPU count, was undertaken through the previous patch (x86/MCE: Extend size of the MCE Records pool). Add a new command line parameter "mce-genpool-extend" to set the size of MCE Records pool to a predetermined number of pages instead of system's CPU count. Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> --- .../admin-guide/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++--- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3a1fa1f81d9d..62e7da4d9fda 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3135,6 +3135,8 @@ mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst + mce-genpool-extend= [X86-64] Number of pages to add to MCE Records pool. + md= [HW] RAID subsystems devices and level See Documentation/admin-guide/md.rst. diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index aed01612d342..d6e04fa5ee07 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -22,6 +22,7 @@ #define MCE_POOLSZ (2 * PAGE_SIZE) #define CPU_GEN_MEMSZ 256 +static unsigned int mce_genpool_extend; static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); static char gen_pool_buf[MCE_POOLSZ]; @@ -123,10 +124,14 @@ int mce_gen_pool_extend(void) int ret = -ENOMEM; u32 num_threads; - num_threads = num_present_cpus(); - len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); - addr = (unsigned long)kzalloc(len, GFP_KERNEL); + if (mce_genpool_extend) { + len = mce_genpool_extend * PAGE_SIZE; + } else { + num_threads = num_present_cpus(); + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); + } + addr = (unsigned long)kzalloc(len, GFP_KERNEL); if (!addr) goto out; @@ -159,6 +164,17 @@ static int mce_gen_pool_create(void) return ret; } +static int __init parse_mce_genpool_extend(char *str) +{ + if (*str == '=') + str++; + + get_option(&str, &mce_genpool_extend); + return 1; +} + +__setup("mce-genpool-extend", parse_mce_genpool_extend); + int mce_gen_pool_init(void) { /* Just init mce_gen_pool once. */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-07 22:56 ` [PATCH 2/2] x86/MCE: Add command line option to extend " Avadhut Naik @ 2024-02-09 1:36 ` Sohil Mehta 2024-02-09 20:02 ` Naik, Avadhut 0 siblings, 1 reply; 68+ messages in thread From: Sohil Mehta @ 2024-02-09 1:36 UTC (permalink / raw) To: Avadhut Naik, x86, linux-edac Cc: bp, tony.luck, linux-kernel, yazen.ghannam, avadnaik On 2/7/2024 2:56 PM, Avadhut Naik wrote: > Extension of MCE Records pool, based on system's CPU count, was undertaken > through the previous patch (x86/MCE: Extend size of the MCE Records pool). > This statement is unnecessary. > Add a new command line parameter "mce-genpool-extend" to set the size of > MCE Records pool to a predetermined number of pages instead of system's > CPU count. > Like Tony, I am unsure of when this would be useful. Also, why does it need to override the CPU count based extension mechanism? Would just adding more pages not work for them? If there really is a good reason to do this, how about changing mce-genpool-extend to mce-genpool-add-pages and keeping the description the same? mce-genpool-add-pages= [X86-64] Number of pages to add to MCE Records pool. Then you can simply add these many number of additional pages to the new CPU based mechanism. Sohil > Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> > --- > .../admin-guide/kernel-parameters.txt | 2 ++ > arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++--- > 2 files changed, 21 insertions(+), 3 deletions(-) > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 1:36 ` Sohil Mehta @ 2024-02-09 20:02 ` Naik, Avadhut 2024-02-09 20:09 ` Borislav Petkov 2024-02-09 20:16 ` Sohil Mehta 0 siblings, 2 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-09 20:02 UTC (permalink / raw) To: Sohil Mehta, x86, linux-edac Cc: bp, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/8/2024 19:36, Sohil Mehta wrote: > On 2/7/2024 2:56 PM, Avadhut Naik wrote: >> Extension of MCE Records pool, based on system's CPU count, was undertaken >> through the previous patch (x86/MCE: Extend size of the MCE Records pool). >> > > This statement is unnecessary. > Noted. >> Add a new command line parameter "mce-genpool-extend" to set the size of >> MCE Records pool to a predetermined number of pages instead of system's >> CPU count. >> > > Like Tony, I am unsure of when this would be useful. > > Also, why does it need to override the CPU count based extension > mechanism? Would just adding more pages not work for them? > > If there really is a good reason to do this, how about changing > mce-genpool-extend to mce-genpool-add-pages and keeping the description > the same? > > mce-genpool-add-pages= [X86-64] Number of pages to add to MCE Records pool. > > Then you can simply add these many number of additional pages to the new > CPU based mechanism. > Is it safe to assume that users will always want to increase the size of the pool and not decrease it? IMO, the command-line option provides flexibility for users to choose the size of MCE Records pool in case, they don't agree with the CPU count logic. Just added it to ensure that we are not enforcing this increased memory footprint across the board. Would you agree? -- Thanks, Avadhut Naik > Sohil > >> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> >> --- >> .../admin-guide/kernel-parameters.txt | 2 ++ >> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++--- >> 2 files changed, 21 insertions(+), 3 deletions(-) >> > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 20:02 ` Naik, Avadhut @ 2024-02-09 20:09 ` Borislav Petkov 2024-02-09 20:35 ` Naik, Avadhut 2024-02-09 20:16 ` Sohil Mehta 1 sibling, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-09 20:09 UTC (permalink / raw) To: Naik, Avadhut Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote: > Is it safe to assume that users will always want to increase the size > of the pool and not decrease it? Why don't you make the gen pool size a function of the number of CPUs on the system and have it all work automagically? Burdening the user with yet another cmdline switch is a bad idea. We have way too many as it is. This stuff should work out-of-the-box, without user intervention if possible. And it is possible in this case. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 20:09 ` Borislav Petkov @ 2024-02-09 20:35 ` Naik, Avadhut 2024-02-09 20:51 ` Borislav Petkov 0 siblings, 1 reply; 68+ messages in thread From: Naik, Avadhut @ 2024-02-09 20:35 UTC (permalink / raw) To: Borislav Petkov Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/9/2024 14:09, Borislav Petkov wrote: > On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote: >> Is it safe to assume that users will always want to increase the size >> of the pool and not decrease it? > > Why don't you make the gen pool size a function of the number of CPUs on > the system and have it all work automagically? > IIUC, this is exactly what the first patch in this series is trying to accomplish. Please correct me if I understood wrong. > Burdening the user with yet another cmdline switch is a bad idea. We > have way too many as it is. > > This stuff should work out-of-the-box, without user intervention if > possible. And it is possible in this case. > Okay. Will drop the command line argument. -- Thanks, Avadhut Naik > Thx. > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 20:35 ` Naik, Avadhut @ 2024-02-09 20:51 ` Borislav Petkov 2024-02-10 7:52 ` Borislav Petkov 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-09 20:51 UTC (permalink / raw) To: Naik, Avadhut Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote: > IIUC, this is exactly what the first patch in this series is trying to > accomplish. Please correct me if I understood wrong. Yes, you did. I don't mean to extend it - I mean to allocate it from the very beginning to min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); There's a sane minimum and one page pro logical CPU should be fine on pretty much every configuration... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 20:51 ` Borislav Petkov @ 2024-02-10 7:52 ` Borislav Petkov 2024-02-10 21:15 ` Naik, Avadhut 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-10 7:52 UTC (permalink / raw) To: Naik, Avadhut Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <bp@alien8.de> wrote: >On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote: >> IIUC, this is exactly what the first patch in this series is trying to >> accomplish. Please correct me if I understood wrong. > >Yes, you did. > >I don't mean to extend it - I mean to allocate it from the very >beginning to > > min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); max() ofc. >There's a sane minimum and one page pro logical CPU should be fine on >pretty much every configuration... > >Thx. > -- Sent from a small device: formatting sucks and brevity is inevitable. ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-10 7:52 ` Borislav Petkov @ 2024-02-10 21:15 ` Naik, Avadhut 2024-02-11 11:14 ` Borislav Petkov 0 siblings, 1 reply; 68+ messages in thread From: Naik, Avadhut @ 2024-02-10 21:15 UTC (permalink / raw) To: Borislav Petkov Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/10/2024 01:52, Borislav Petkov wrote: > On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <bp@alien8.de> wrote: >> On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote: >>> IIUC, this is exactly what the first patch in this series is trying to >>> accomplish. Please correct me if I understood wrong. >> >> Yes, you did. >> >> I don't mean to extend it - I mean to allocate it from the very >> beginning to >> >> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); > IIUC, you wouldn't want to extend the pool through late_initcall(). Instead, you would want for memory to be allocated (on the heap) and size of the pool to be set at the very beginning i.e. when the pool is created (~2 seconds, according to dmesg timestamps). Please correct me if I have understood wrong. I was actually doing something similar initially (setting size of the pool right after its creation) but then went with extending the pool since I wasn't sure if heap allocations should be undertaken that early during bootup. > max() ofc. > Thanks! This does clear a one part of my confusion. -- Thanks, Avadhut Naik >> There's a sane minimum and one page pro logical CPU should be fine on >> pretty much every configuration... >> >> Thx. >> > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-10 21:15 ` Naik, Avadhut @ 2024-02-11 11:14 ` Borislav Petkov 2024-02-12 2:54 ` Naik, Avadhut 2024-02-12 18:47 ` Yazen Ghannam 0 siblings, 2 replies; 68+ messages in thread From: Borislav Petkov @ 2024-02-11 11:14 UTC (permalink / raw) To: Naik, Avadhut Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote: > IIUC, you wouldn't want to extend the pool through late_initcall(). > Instead, you would want for memory to be allocated (on the heap) and > size of the pool to be set at the very beginning i.e. when the pool > is created (~2 seconds, according to dmesg timestamps). > > Please correct me if I have understood wrong. Nah, you got it right. I went, looked and realized that we have to do this early dance because we have no allocator yet. And we can't move this gen_pool allocation to later, when we *do* have an allocator because MCA is up and logging already. But your extending approach doesn't fly in all cases either: gen_pool_add->gen_pool_add_virt->gen_pool_add_owner it grabs the pool->lock spinlock and adds to &pool->chunks while, at the exact same time, gen_pool_alloc(), in *NMI* context iterates over that same &pool->chunks in the case we're logging an MCE at exact that same time when you're extending the buffer. And Tony already said that in the thread you're quoting: https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/ So no, that doesn't work either. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-11 11:14 ` Borislav Petkov @ 2024-02-12 2:54 ` Naik, Avadhut 2024-02-12 8:58 ` Borislav Petkov 2024-02-12 18:47 ` Yazen Ghannam 1 sibling, 1 reply; 68+ messages in thread From: Naik, Avadhut @ 2024-02-12 2:54 UTC (permalink / raw) To: Borislav Petkov Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/11/2024 05:14, Borislav Petkov wrote: > On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote: >> IIUC, you wouldn't want to extend the pool through late_initcall(). >> Instead, you would want for memory to be allocated (on the heap) and >> size of the pool to be set at the very beginning i.e. when the pool >> is created (~2 seconds, according to dmesg timestamps). >> >> Please correct me if I have understood wrong. > > Nah, you got it right. I went, looked and realized that we have to do > this early dance because we have no allocator yet. And we can't move > this gen_pool allocation to later, when we *do* have an allocator > because MCA is up and logging already. > Okay. Will make changes to allocate memory and set size of the pool when it is created. Also, will remove the command line parameter and resubmit. > But your extending approach doesn't fly in all cases either: > > gen_pool_add->gen_pool_add_virt->gen_pool_add_owner > > it grabs the pool->lock spinlock and adds to &pool->chunks while, at the > exact same time, gen_pool_alloc(), in *NMI* context iterates over that > same &pool->chunks in the case we're logging an MCE at exact that same > time when you're extending the buffer. > > And Tony already said that in the thread you're quoting: > > https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/ > > So no, that doesn't work either. > > Thx. Thanks for this explanation! > -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 2:54 ` Naik, Avadhut @ 2024-02-12 8:58 ` Borislav Petkov 2024-02-12 9:32 ` Borislav Petkov 2024-02-15 20:14 ` Naik, Avadhut 0 siblings, 2 replies; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 8:58 UTC (permalink / raw) To: Naik, Avadhut Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote: > Okay. Will make changes to allocate memory and set size of the pool > when it is created. Also, will remove the command line parameter and > resubmit. Before you do, go read that original thread again but this time take your time to grok it. And then try answering those questions: * Why are *you* fixing this? I know what the AWS reason is, what is yours? * Can you think of a slick deduplication scheme instead of blindly raising the buffer size? * What's wrong with not logging some early errors, can we live with that too? If it were firmware-first, it cannot simply extend its buffer size because it has limited space. So what does firmware do in such cases? Think long and hard about the big picture, analyze the problem properly and from all angles before you go and do patches. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 8:58 ` Borislav Petkov @ 2024-02-12 9:32 ` Borislav Petkov 2024-02-12 17:29 ` Luck, Tony 2024-02-15 20:15 ` Naik, Avadhut 2024-02-15 20:14 ` Naik, Avadhut 1 sibling, 2 replies; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 9:32 UTC (permalink / raw) To: Naik, Avadhut Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote: > * Can you think of a slick deduplication scheme instead of blindly > raising the buffer size? And here's the simplest scheme: you don't extend the buffer. On overflow, you say "Buffer full, here's the MCE" and you dump the error long into dmesg. Problem solved. A slicker deduplication scheme would be even better, tho. Maybe struct mce.count which gets incremented instead of adding the error record to the buffer again. And so on... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 9:32 ` Borislav Petkov @ 2024-02-12 17:29 ` Luck, Tony 2024-02-12 17:54 ` Borislav Petkov 2024-02-15 20:15 ` Naik, Avadhut 1 sibling, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-02-12 17:29 UTC (permalink / raw) To: Borislav Petkov, Naik, Avadhut Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik > And here's the simplest scheme: you don't extend the buffer. On > overflow, you say "Buffer full, here's the MCE" and you dump the error > long into dmesg. Problem solved. > > A slicker deduplication scheme would be even better, tho. Maybe struct > mce.count which gets incremented instead of adding the error record to > the buffer again. And so on... Walking the structures already allocated from the genpool in the #MC handler may be possible, but what is the criteria for "duplicates"? Do we avoid entering duplicates into the pool altogether? Or when the pool is full overwrite a duplicate? How about compile time allocation of extra space. Algorithm below for illustrative purposes only. May need some more thought about how to scale up. -Tony [Diff pasted into Outlook, chances that it will automatically apply = 0%] diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..0fc2925c0839 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,10 +16,15 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, so a fixed size memory pool should be enough. + * Low CPU count systems allocate 2 pages (enough for ~64 "struct mce" + * records). Large systems scale up the allocation based on CPU count. */ +#if CONFIG_NR_CPUS < 128 #define MCE_POOLSZ (2 * PAGE_SIZE) +#else +#define MCE_POOLSZ (CONFIG_NR_CPUS / 64 * PAGE_SIZE) +#endif static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); [agluck@agluck-desk3 mywork]$ vi arch/x86/kernel/cpu/mce/genpool.c [agluck@agluck-desk3 mywork]$ git diff diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..47bf677578ca 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,10 +16,15 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, but scale up with CPU count. Low CPU count + * systems allocate 2 pages (enough for ~64 "struct mce" records). Large + * systems scale up the allocation based on CPU count. */ +#if CONFIG_NR_CPUS < 128 #define MCE_POOLSZ (2 * PAGE_SIZE) +#else +#define MCE_POOLSZ (CONFIG_NR_CPUS / 64 * PAGE_SIZE) +#endif static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 17:29 ` Luck, Tony @ 2024-02-12 17:54 ` Borislav Petkov 2024-02-12 18:45 ` Luck, Tony 2024-02-15 20:18 ` [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool Naik, Avadhut 0 siblings, 2 replies; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 17:54 UTC (permalink / raw) To: Luck, Tony Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote: > Walking the structures already allocated from the genpool in the #MC > handler may be possible, but what is the criteria for "duplicates"? for each i in pool: memcmp(mce[i], new_mce, sizeof(struct mce)); It'll probably need to mask out fields like ->time etc. > Do we avoid entering duplicates into the pool altogether? We could do mce[i].count++; in the same loop. Dunno yet if we even need to state how many duplicates are there... > Or when the pool is full overwrite a duplicate? Nah, not overwrite the duplicate but not add the new one. Cheaper. > How about compile time allocation of extra space. Algorithm below for > illustrative purposes only. May need some more thought about how > to scale up. Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing - needs to be based on the actual number of CPUs on the machine. BUT, we don't have an allocator yet. So we end up allocating it there on the heap. Unless we can do something with memblock... But then this still needs code audit whether num_possible_cpus() is final at that point. Because if it is, that would be the optimal thing to do a memblock_alloc() there... > [Diff pasted into Outlook, chances that it will automatically apply = 0%] How you even do patches with outschmook is mindboggling :-) At least use Thunderbird. That's what I do for the company mail but then again I don't even try to do patches over company mail... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 17:54 ` Borislav Petkov @ 2024-02-12 18:45 ` Luck, Tony 2024-02-12 19:14 ` Borislav Petkov 2024-02-15 20:18 ` [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool Naik, Avadhut 1 sibling, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-02-12 18:45 UTC (permalink / raw) To: Borislav Petkov Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik > Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing > - needs to be based on the actual number of CPUs on the machine. > > BUT, we don't have an allocator yet. > > So we end up allocating it there on the heap. > > Unless we can do something with memblock... > > But then this still needs code audit whether num_possible_cpus() is > final at that point. Because if it is, that would be the optimal thing > to do a memblock_alloc() there... I threw a printk() into mce_gen_pool_init() and got: [ 2.948285] mce: mce_gen_pool_init: num_possible_cpus = 144 So it is currently true that number of CPUs has been computed at this point. So using memblock_alloc() based on number of CPUs may work > > [Diff pasted into Outlook, chances that it will automatically apply = 0%] > > How you even do patches with outschmook is mindboggling :-) > > At least use Thunderbird. That's what I do for the company mail but then > again I don't even try to do patches over company mail... I use "git send-email" to send out patches, and usually "b4" to get them to my desktop. I do have "mutt" on there, but IT have made it complex for me to use it to simply read & reply. It requires separate mutt config files to fetch mail and a different config to send mail because of the firewall rules on the lab where my desktop lives. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 18:45 ` Luck, Tony @ 2024-02-12 19:14 ` Borislav Petkov 2024-02-12 19:41 ` Luck, Tony 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 19:14 UTC (permalink / raw) To: Luck, Tony Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Mon, Feb 12, 2024 at 06:45:34PM +0000, Luck, Tony wrote: > So it is currently true that number of CPUs has been computed at this point. It needs a proper explanation why that's ok rather than an empirical test only. > I use "git send-email" to send out patches, and usually "b4" to get them > to my desktop. I do have "mutt" on there, but IT have made it complex > for me to use it to simply read & reply. IT departments all around the world make sure of that. :) > It requires separate mutt config files to fetch mail and a different > config to send mail because of the firewall rules on the lab where my > desktop lives. Yeah, that's why I'm working with !company mail account. For my own sanity. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 19:14 ` Borislav Petkov @ 2024-02-12 19:41 ` Luck, Tony 2024-02-12 21:37 ` Tony Luck 0 siblings, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-02-12 19:41 UTC (permalink / raw) To: Borislav Petkov Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik > It needs a proper explanation why that's ok rather than an empirical > test only. start_kernel() ... setup_arch() .... acpi stuff parses MADT and sets bits in possible map ... arch_cpu_finalize_init() ... calls mce_gen_pool_init() -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 19:41 ` Luck, Tony @ 2024-02-12 21:37 ` Tony Luck 2024-02-12 22:08 ` Borislav Petkov 0 siblings, 1 reply; 68+ messages in thread From: Tony Luck @ 2024-02-12 21:37 UTC (permalink / raw) To: Borislav Petkov Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Mon, Feb 12, 2024 at 07:41:03PM +0000, Luck, Tony wrote: > > It needs a proper explanation why that's ok rather than an empirical > > test only. > > start_kernel() > ... setup_arch() > .... acpi stuff parses MADT and sets bits in possible map > > ... arch_cpu_finalize_init() > ... calls mce_gen_pool_init() This made me question the "we don't have an allocator in mce_gen_pool_init()". Because if we got through all the ACPI stuff, we surely have an allocator. Below patch doesn't explode at runtime. -Tony diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..81de877f2a51 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,14 +16,12 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, so a fixed size memory pool should be enough. + * Allocate on a sliding scale based on number of CPUs. */ -#define MCE_POOLSZ (2 * PAGE_SIZE) static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); -static char gen_pool_buf[MCE_POOLSZ]; /* * Compare the record "t" with each of the records on list "l" to see if @@ -118,14 +116,23 @@ int mce_gen_pool_add(struct mce *mce) static int mce_gen_pool_create(void) { + int mce_numrecords, mce_poolsz; struct gen_pool *tmpp; int ret = -ENOMEM; + void *mce_pool; tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); if (!tmpp) goto out; - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); + mce_numrecords = max(80, num_possible_cpus() * 8); + mce_poolsz = mce_numrecords * ilog2(sizeof(struct mce_evt_llist)); + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); + if (!mce_pool) { + gen_pool_destroy(tmpp); + goto out; + } + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); if (ret) { gen_pool_destroy(tmpp); goto out; -- 2.43.0 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 21:37 ` Tony Luck @ 2024-02-12 22:08 ` Borislav Petkov 2024-02-12 22:19 ` Borislav Petkov 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 22:08 UTC (permalink / raw) To: Tony Luck Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Mon, Feb 12, 2024 at 01:37:09PM -0800, Tony Luck wrote: > This made me question the "we don't have an allocator in > mce_gen_pool_init()". Because if we got through all the > ACPI stuff, we surely have an allocator. > > Below patch doesn't explode at runtime. Yap, I see it. And I can't dig out why it didn't use to work and we had to do it this way. The earliest thing I found is: https://lore.kernel.org/all/1432150538-3120-2-git-send-email-gong.chen@linux.intel.com/T/#mf673ed669ee0d4c27b75bd48450c13c01cbb2cbf I'll have to dig into my archives tomorrow, on a clear head... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 22:08 ` Borislav Petkov @ 2024-02-12 22:19 ` Borislav Petkov 2024-02-12 22:42 ` Borislav Petkov 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 22:19 UTC (permalink / raw) To: Tony Luck Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote: > I'll have to dig into my archives tomorrow, on a clear head... So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is 4.2-something. And even back then, mcheck_cpu_init() gets called *after* mm_init() which already initializes the allocators. So why did we allocate that buffer statically? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 22:19 ` Borislav Petkov @ 2024-02-12 22:42 ` Borislav Petkov 2024-02-28 23:14 ` [PATCH] x86/mce: Dynamically size space for machine check records Tony Luck 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 22:42 UTC (permalink / raw) To: Tony Luck Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Mon, Feb 12, 2024 at 11:19:13PM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote: > > I'll have to dig into my archives tomorrow, on a clear head... > > So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is > 4.2-something. > > And even back then, mcheck_cpu_init() gets called *after* mm_init() > which already initializes the allocators. So why did we allocate that > buffer statically? Found it in my archives. You should have it too: Date: Thu, 31 Jul 2014 02:51:25 -0400 From: "Chen, Gong" <gong.chen@linux.intel.com> To: tony.luck@intel.com, bp@alien8.de Subject: Re: [RFC PATCH untest v2 1/4] x86, MCE: Provide a lock-less memory pool to save error record Message-ID: <20140731065125.GA5999@gchen.bj.intel.com> and that's not on any ML that's why I can't find it on lore... There's this fragment from Chen: -------- > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index bb92f38..a1b6841 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -2023,6 +2023,9 @@ int __init mcheck_init(void) > { > mcheck_intel_therm_init(); > > + if (!mce_genpool_init()) > + mca_cfg.disabled = true; > + when setup_arch is called, memory subsystem hasn't been initialized, which means I can't use regular page allocation function. So I still need to put genpool init in mcheck_cpu_init. -------- And that is still the case - mcheck_init() gets called in setup_arch() and thus before before mm_init() which is called mm_core_init() now. And on that same thread we agree that we should allocate it statically but then the call to mce_gen_pool_init() ended up in mcheck_cpu_init() which happens *after* mm_init(). What a big fscking facepalm. :-\ -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-12 22:42 ` Borislav Petkov @ 2024-02-28 23:14 ` Tony Luck 2024-02-29 0:39 ` Sohil Mehta ` (3 more replies) 0 siblings, 4 replies; 68+ messages in thread From: Tony Luck @ 2024-02-28 23:14 UTC (permalink / raw) To: Borislav Petkov Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik Systems with a large number of CPUs may generate a large number of machine check records when things go seriously wrong. But Linux has a fixed buffer that can only capture a few dozen errors. Allocate space based on the number of CPUs (with a minimum value based on the historical fixed buffer that could store 80 records). Signed-off-by: Tony Luck <tony.luck@intel.com> --- Discussion earlier concluded with the realization that it is safe to dynamically allocate the mce_evt_pool at boot time. So here's a patch to do that. Scaling algorithm here is a simple linear "4 records per possible CPU" with a minimum of 80 to match the legacy behavior. I'm open to other suggestions. Note that I threw in a "+1" to the return from ilog2() when calling gen_pool_create(). From reading code, and running some tests, it appears that the min_alloc_order argument needs to be large enough to allocate one of the mce_evt_llist structures. Some other gen_pool users in Linux may also need this "+1". arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..a1f0a8f29cf5 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,14 +16,13 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, so a fixed size memory pool should be enough. + * Allocate on a sliding scale based on number of CPUs. */ -#define MCE_POOLSZ (2 * PAGE_SIZE) +#define MCE_MIN_ENTRIES 80 static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); -static char gen_pool_buf[MCE_POOLSZ]; /* * Compare the record "t" with each of the records on list "l" to see if @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) static int mce_gen_pool_create(void) { + int mce_numrecords, mce_poolsz; struct gen_pool *tmpp; int ret = -ENOMEM; + void *mce_pool; + int order; - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); + order = ilog2(sizeof(struct mce_evt_llist)) + 1; + tmpp = gen_pool_create(order, -1); if (!tmpp) goto out; - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); + mce_numrecords = max(80, num_possible_cpus() * 4); + mce_poolsz = mce_numrecords * (1 << order); + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); + if (!mce_pool) { + gen_pool_destroy(tmpp); + goto out; + } + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); if (ret) { gen_pool_destroy(tmpp); goto out; -- 2.43.0 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-28 23:14 ` [PATCH] x86/mce: Dynamically size space for machine check records Tony Luck @ 2024-02-29 0:39 ` Sohil Mehta 2024-02-29 0:44 ` Luck, Tony 2024-02-29 1:56 ` Sohil Mehta ` (2 subsequent siblings) 3 siblings, 1 reply; 68+ messages in thread From: Sohil Mehta @ 2024-02-29 0:39 UTC (permalink / raw) To: Tony Luck, Borislav Petkov Cc: Naik, Avadhut, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On 2/28/2024 3:14 PM, Tony Luck wrote: > static int mce_gen_pool_create(void) > { > + int mce_numrecords, mce_poolsz; > struct gen_pool *tmpp; > int ret = -ENOMEM; > + void *mce_pool; > + int order; > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > + tmpp = gen_pool_create(order, -1); > if (!tmpp) > goto out; > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > + mce_numrecords = max(80, num_possible_cpus() * 4); > + mce_poolsz = mce_numrecords * (1 << order); > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); > + if (!mce_pool) { > + gen_pool_destroy(tmpp); > + goto out; > + } > + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); > if (ret) { > gen_pool_destroy(tmpp); Is this missing a kfree(mce_pool) here? > goto out; ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 0:39 ` Sohil Mehta @ 2024-02-29 0:44 ` Luck, Tony 0 siblings, 0 replies; 68+ messages in thread From: Luck, Tony @ 2024-02-29 0:44 UTC (permalink / raw) To: Mehta, Sohil, Borislav Petkov Cc: Naik, Avadhut, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik >> + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); >> if (ret) { >> gen_pool_destroy(tmpp); > > Is this missing a kfree(mce_pool) here? > >> goto out; Yes. Will add. Thanks -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-28 23:14 ` [PATCH] x86/mce: Dynamically size space for machine check records Tony Luck 2024-02-29 0:39 ` Sohil Mehta @ 2024-02-29 1:56 ` Sohil Mehta 2024-02-29 15:49 ` Yazen Ghannam 2024-02-29 17:21 ` Tony Luck 2024-02-29 6:42 ` Naik, Avadhut 2024-03-06 21:52 ` Naik, Avadhut 3 siblings, 2 replies; 68+ messages in thread From: Sohil Mehta @ 2024-02-29 1:56 UTC (permalink / raw) To: Tony Luck, Borislav Petkov Cc: Naik, Avadhut, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik A few other nits. On 2/28/2024 3:14 PM, Tony Luck wrote: > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > index fbe8b61c3413..a1f0a8f29cf5 100644 > --- a/arch/x86/kernel/cpu/mce/genpool.c > +++ b/arch/x86/kernel/cpu/mce/genpool.c > @@ -16,14 +16,13 @@ > * used to save error information organized in a lock-less list. > * > * This memory pool is only to be used to save MCE records in MCE context. > - * MCE events are rare, so a fixed size memory pool should be enough. Use > - * 2 pages to save MCE events for now (~80 MCE records at most). > + * MCE events are rare, so a fixed size memory pool should be enough. > + * Allocate on a sliding scale based on number of CPUs. > */ > -#define MCE_POOLSZ (2 * PAGE_SIZE) > +#define MCE_MIN_ENTRIES 80 > > static struct gen_pool *mce_evt_pool; > static LLIST_HEAD(mce_event_llist); > -static char gen_pool_buf[MCE_POOLSZ]; > > /* > * Compare the record "t" with each of the records on list "l" to see if > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) > > static int mce_gen_pool_create(void) > { > + int mce_numrecords, mce_poolsz; Should order be also declared in this line? That way we can have all the uninitialized 'int's together. > struct gen_pool *tmpp; > int ret = -ENOMEM; > + void *mce_pool; > + int order; > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; I didn't exactly understand why a +1 is needed here. Do you have a pointer to somewhere to help understand this? Also, I think, a comment on top might be useful since this isn't obvious. > + tmpp = gen_pool_create(order, -1); > if (!tmpp) > goto out; > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > + mce_numrecords = max(80, num_possible_cpus() * 4); How about using MCE_MIN_ENTRIES here? > + mce_poolsz = mce_numrecords * (1 << order); > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); > + if (!mce_pool) { > + gen_pool_destroy(tmpp); > + goto out; > + } > + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); > if (ret) { > gen_pool_destroy(tmpp); > goto out; ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 1:56 ` Sohil Mehta @ 2024-02-29 15:49 ` Yazen Ghannam 2024-02-29 17:22 ` Tony Luck 2024-02-29 17:21 ` Tony Luck 1 sibling, 1 reply; 68+ messages in thread From: Yazen Ghannam @ 2024-02-29 15:49 UTC (permalink / raw) To: Sohil Mehta, Tony Luck, Borislav Petkov Cc: yazen.ghannam, Naik, Avadhut, x86, linux-edac, linux-kernel, Avadhut Naik On 2/28/2024 8:56 PM, Sohil Mehta wrote: > A few other nits. > > On 2/28/2024 3:14 PM, Tony Luck wrote: >> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c >> index fbe8b61c3413..a1f0a8f29cf5 100644 >> --- a/arch/x86/kernel/cpu/mce/genpool.c >> +++ b/arch/x86/kernel/cpu/mce/genpool.c >> @@ -16,14 +16,13 @@ >> * used to save error information organized in a lock-less list. >> * >> * This memory pool is only to be used to save MCE records in MCE context. >> - * MCE events are rare, so a fixed size memory pool should be enough. Use >> - * 2 pages to save MCE events for now (~80 MCE records at most). >> + * MCE events are rare, so a fixed size memory pool should be enough. >> + * Allocate on a sliding scale based on number of CPUs. >> */ >> -#define MCE_POOLSZ (2 * PAGE_SIZE) >> +#define MCE_MIN_ENTRIES 80 >> >> static struct gen_pool *mce_evt_pool; >> static LLIST_HEAD(mce_event_llist); >> -static char gen_pool_buf[MCE_POOLSZ]; >> >> /* >> * Compare the record "t" with each of the records on list "l" to see if >> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) >> >> static int mce_gen_pool_create(void) >> { >> + int mce_numrecords, mce_poolsz; > > Should order be also declared in this line? That way we can have all the > uninitialized 'int's together. > >> struct gen_pool *tmpp; >> int ret = -ENOMEM; >> + void *mce_pool; >> + int order; >> >> - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); >> + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > > I didn't exactly understand why a +1 is needed here. Do you have a > pointer to somewhere to help understand this? > > Also, I think, a comment on top might be useful since this isn't obvious. > Would order_base_2() work here? It automatically rounds up to the next power. Thanks, Yazen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 15:49 ` Yazen Ghannam @ 2024-02-29 17:22 ` Tony Luck 0 siblings, 0 replies; 68+ messages in thread From: Tony Luck @ 2024-02-29 17:22 UTC (permalink / raw) To: Yazen Ghannam Cc: Sohil Mehta, Borislav Petkov, Naik, Avadhut, x86, linux-edac, linux-kernel, Avadhut Naik On Thu, Feb 29, 2024 at 10:49:18AM -0500, Yazen Ghannam wrote: > On 2/28/2024 8:56 PM, Sohil Mehta wrote: > > > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > > > > I didn't exactly understand why a +1 is needed here. Do you have a > > pointer to somewhere to help understand this? > > > > Also, I think, a comment on top might be useful since this isn't obvious. > > > > Would order_base_2() work here? It automatically rounds up to the next power. Yes! That's exactly what is needed here. Thanks! -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 1:56 ` Sohil Mehta 2024-02-29 15:49 ` Yazen Ghannam @ 2024-02-29 17:21 ` Tony Luck 2024-02-29 23:56 ` Sohil Mehta 1 sibling, 1 reply; 68+ messages in thread From: Tony Luck @ 2024-02-29 17:21 UTC (permalink / raw) To: Sohil Mehta Cc: Borislav Petkov, Naik, Avadhut, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Wed, Feb 28, 2024 at 05:56:26PM -0800, Sohil Mehta wrote: > A few other nits. > > On 2/28/2024 3:14 PM, Tony Luck wrote: > > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > > index fbe8b61c3413..a1f0a8f29cf5 100644 > > --- a/arch/x86/kernel/cpu/mce/genpool.c > > +++ b/arch/x86/kernel/cpu/mce/genpool.c > > @@ -16,14 +16,13 @@ > > * used to save error information organized in a lock-less list. > > * > > * This memory pool is only to be used to save MCE records in MCE context. > > - * MCE events are rare, so a fixed size memory pool should be enough. Use > > - * 2 pages to save MCE events for now (~80 MCE records at most). > > + * MCE events are rare, so a fixed size memory pool should be enough. > > + * Allocate on a sliding scale based on number of CPUs. > > */ > > -#define MCE_POOLSZ (2 * PAGE_SIZE) > > +#define MCE_MIN_ENTRIES 80 > > > > static struct gen_pool *mce_evt_pool; > > static LLIST_HEAD(mce_event_llist); > > -static char gen_pool_buf[MCE_POOLSZ]; > > > > /* > > * Compare the record "t" with each of the records on list "l" to see if > > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) > > > > static int mce_gen_pool_create(void) > > { > > + int mce_numrecords, mce_poolsz; > > Should order be also declared in this line? That way we can have all the > uninitialized 'int's together. Sure. Even with the addition of "order" the line is still short enough. > > struct gen_pool *tmpp; > > int ret = -ENOMEM; > > + void *mce_pool; > > + int order; > > > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > > I didn't exactly understand why a +1 is needed here. Do you have a > pointer to somewhere to help understand this? > > Also, I think, a comment on top might be useful since this isn't obvious. gen_pool works in power-of-two blocks. The user must pick a specific size with the gen_pool_create() call. Internally the gen_pool code uses a bitmap to track which blocks in the pool are allocated/free. Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So the original version of this code picks order 7 to allocate in 128 byte units. But this means that every allocation of a mce_evt_llist will take two 128-byte blocks. Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c that two pages are enough for ~80 records was wrong when written. At that point struct mce_evt_llist was below 128, so order was 6, and each allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64) results in 64 possible allocations. But over time Intel and AMD added to the structure. So the current math comes out at just 32 allocations before the pool is out of space. Yazen provided the right answer for this. Change to use order_base_2() > > + tmpp = gen_pool_create(order, -1); > > if (!tmpp) > > goto out; > > > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > > + mce_numrecords = max(80, num_possible_cpus() * 4); > > How about using MCE_MIN_ENTRIES here? Oops. I meant to do that when I added the #define! I've also added a "#define MCE_PER_CPU 4" instead of a raw "4" in that expression. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 17:21 ` Tony Luck @ 2024-02-29 23:56 ` Sohil Mehta 0 siblings, 0 replies; 68+ messages in thread From: Sohil Mehta @ 2024-02-29 23:56 UTC (permalink / raw) To: Tony Luck Cc: Borislav Petkov, Naik, Avadhut, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On 2/29/2024 9:21 AM, Tony Luck wrote: > Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So > the original version of this code picks order 7 to allocate in 128 byte > units. But this means that every allocation of a mce_evt_llist will take > two 128-byte blocks. > > Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c > that two pages are enough for ~80 records was wrong when written. At > that point struct mce_evt_llist was below 128, so order was 6, and each > allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64) > results in 64 possible allocations. > Thanks for the explanation. The part that got me is that I somehow expected ilog2() to round-up and not round-down. > But over time Intel and AMD added to the structure. So the current math > comes out at just 32 allocations before the pool is out of space. > > Yazen provided the right answer for this. Change to use order_base_2() > Yes, I agree. order_base_2() is better than doing ilog2(struct size) + 1. In the rare scenario of the size exactly being a power of 2 we don't need to add the +1. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-28 23:14 ` [PATCH] x86/mce: Dynamically size space for machine check records Tony Luck 2024-02-29 0:39 ` Sohil Mehta 2024-02-29 1:56 ` Sohil Mehta @ 2024-02-29 6:42 ` Naik, Avadhut 2024-02-29 8:39 ` Borislav Petkov 2024-02-29 17:26 ` Tony Luck 2024-03-06 21:52 ` Naik, Avadhut 3 siblings, 2 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-29 6:42 UTC (permalink / raw) To: Tony Luck, Borislav Petkov Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/28/2024 17:14, Tony Luck wrote: > Systems with a large number of CPUs may generate a large > number of machine check records when things go seriously > wrong. But Linux has a fixed buffer that can only capture > a few dozen errors. > > Allocate space based on the number of CPUs (with a minimum > value based on the historical fixed buffer that could store > 80 records). > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > > Discussion earlier concluded with the realization that it is > safe to dynamically allocate the mce_evt_pool at boot time. > So here's a patch to do that. Scaling algorithm here is a > simple linear "4 records per possible CPU" with a minimum > of 80 to match the legacy behavior. I'm open to other > suggestions. > > Note that I threw in a "+1" to the return from ilog2() when > calling gen_pool_create(). From reading code, and running > some tests, it appears that the min_alloc_order argument > needs to be large enough to allocate one of the mce_evt_llist > structures. > > Some other gen_pool users in Linux may also need this "+1". > Somewhat confused here. Weren't we also exploring ways to avoid duplicate records from being added to the genpool? Has something changed in that regard? > arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > index fbe8b61c3413..a1f0a8f29cf5 100644 > --- a/arch/x86/kernel/cpu/mce/genpool.c > +++ b/arch/x86/kernel/cpu/mce/genpool.c > @@ -16,14 +16,13 @@ > * used to save error information organized in a lock-less list. > * > * This memory pool is only to be used to save MCE records in MCE context. > - * MCE events are rare, so a fixed size memory pool should be enough. Use > - * 2 pages to save MCE events for now (~80 MCE records at most). > + * MCE events are rare, so a fixed size memory pool should be enough. > + * Allocate on a sliding scale based on number of CPUs. > */ > -#define MCE_POOLSZ (2 * PAGE_SIZE) > +#define MCE_MIN_ENTRIES 80 > > static struct gen_pool *mce_evt_pool; > static LLIST_HEAD(mce_event_llist); > -static char gen_pool_buf[MCE_POOLSZ]; > > /* > * Compare the record "t" with each of the records on list "l" to see if > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) > > static int mce_gen_pool_create(void) > { > + int mce_numrecords, mce_poolsz; > struct gen_pool *tmpp; > int ret = -ENOMEM; > + void *mce_pool; > + int order; > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > + tmpp = gen_pool_create(order, -1); > if (!tmpp) > goto out; > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > + mce_numrecords = max(80, num_possible_cpus() * 4); > + mce_poolsz = mce_numrecords * (1 << order); > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); To err on the side of caution, wouldn't kzalloc() be a safer choice here? -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 6:42 ` Naik, Avadhut @ 2024-02-29 8:39 ` Borislav Petkov 2024-02-29 17:47 ` Tony Luck 2024-02-29 17:26 ` Tony Luck 1 sibling, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-29 8:39 UTC (permalink / raw) To: Naik, Avadhut Cc: Tony Luck, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: > Somewhat confused here. Weren't we also exploring ways to avoid > duplicate records from being added to the genpool? Has something > changed in that regard? You can always send patches proposing how *you* think this duplicate elimination should look like and we can talk. :) I don't think anyone would mind it if done properly but first you'd need a real-life use case. As in, do we log sooo many duplicates such that we'd want to dedup? Hmmm. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 8:39 ` Borislav Petkov @ 2024-02-29 17:47 ` Tony Luck 2024-02-29 18:28 ` Naik, Avadhut 0 siblings, 1 reply; 68+ messages in thread From: Tony Luck @ 2024-02-29 17:47 UTC (permalink / raw) To: Borislav Petkov Cc: Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote: > On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: > > Somewhat confused here. Weren't we also exploring ways to avoid > > duplicate records from being added to the genpool? Has something > > changed in that regard? > > You can always send patches proposing how *you* think this duplicate > elimination should look like and we can talk. :) > > I don't think anyone would mind it if done properly but first you'd need > a real-life use case. As in, do we log sooo many duplicates such that > we'd want to dedup? There are definitly cases where dedup will not help. If a row fails in a DIMM there will be a flood of correctable errors with different addresses (depending on number of channels in the interleave schema for a system this may be dozens or hundreds of distinct addresses). Same for other failures in structures like column and rank. As to "real-life" use cases. A search on Lore for "MCE records pool full!" only finds threads about modifications to this code. So the general population of Linux developers isn't seeing this. But a search in my internal e-mail box kicks up a dozen or so distinct hits from internal validation teams in just the past year. But those folks are super-dedicated to finding corner cases. Just this morning I got a triumphant e-mail from someone who reproduced an issue "after 1.6 million error injections". I'd bet that large cloud providers with system numbered in the hundreds of thousands see the MCE pool exhausted from time to time. Open question is "How many error records do you need to diagnose the cause of various floods of errors?" I'm going to say that the current 32 (see earlier e-mail today to Sohil https://lore.kernel.org/all/ZeC8_jzdFnkpPVPf@agluck-desk3/ ) isn't enough for big systems. It may be hard to distinguish between the various bulk fail modes with just that many error logs. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 17:47 ` Tony Luck @ 2024-02-29 18:28 ` Naik, Avadhut 2024-02-29 18:38 ` Luck, Tony 0 siblings, 1 reply; 68+ messages in thread From: Naik, Avadhut @ 2024-02-29 18:28 UTC (permalink / raw) To: Tony Luck, Borislav Petkov Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On 2/29/2024 11:47, Tony Luck wrote: > On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote: >> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: >>> Somewhat confused here. Weren't we also exploring ways to avoid >>> duplicate records from being added to the genpool? Has something >>> changed in that regard? >> >> You can always send patches proposing how *you* think this duplicate >> elimination should look like and we can talk. :) >> >> I don't think anyone would mind it if done properly but first you'd need >> a real-life use case. As in, do we log sooo many duplicates such that >> we'd want to dedup? > > There are definitly cases where dedup will not help. If a row fails in a > DIMM there will be a flood of correctable errors with different addresses > (depending on number of channels in the interleave schema for a system > this may be dozens or hundreds of distinct addresses). > > Same for other failures in structures like column and rank. > Wouldn't having dedup actually increase the time we spend #MC context? Comparing the new MCE record against each existing record in the genpool. AFAIK, MCEs cannot be nested. Correct me if I am wrong here. In a flood situation, like the one described above, that is exactly what may happen: An MCE coming in while the dedup mechanism is underway (in #MC context). -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 18:28 ` Naik, Avadhut @ 2024-02-29 18:38 ` Luck, Tony 0 siblings, 0 replies; 68+ messages in thread From: Luck, Tony @ 2024-02-29 18:38 UTC (permalink / raw) To: Naik, Avadhut, Borislav Petkov Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik > Wouldn't having dedup actually increase the time we spend #MC context? > Comparing the new MCE record against each existing record in the > genpool. Yes, dedup would take extra time (increasing linearly with the number of pending errors that were not filtered out by the dedup process). > AFAIK, MCEs cannot be nested. Correct me if I am wrong here. Can't be nested on the same CPU. But multiple CPUs may take a local machine check simultaneously. Local machine check is opt-in on Intel, I believe it is default on AMD. Errors can also be signaled with CMCI. > In a flood situation, like the one described above, that is exactly > what may happen: An MCE coming in while the dedup mechanism is > underway (in #MC context). In a flood of errors it would be complicated to synchronize dedup filtering on multiple CPUs. The trade-off between trying to get that code right, and just allocating a few extra Kbytes of memory would seem to favor allocating more memory. -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-29 6:42 ` Naik, Avadhut 2024-02-29 8:39 ` Borislav Petkov @ 2024-02-29 17:26 ` Tony Luck 1 sibling, 0 replies; 68+ messages in thread From: Tony Luck @ 2024-02-29 17:26 UTC (permalink / raw) To: Naik, Avadhut Cc: Borislav Petkov, Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: > Hi, > > On 2/28/2024 17:14, Tony Luck wrote: > > Systems with a large number of CPUs may generate a large > > number of machine check records when things go seriously > > wrong. But Linux has a fixed buffer that can only capture > > a few dozen errors. > > > > Allocate space based on the number of CPUs (with a minimum > > value based on the historical fixed buffer that could store > > 80 records). > > > > Signed-off-by: Tony Luck <tony.luck@intel.com> > > --- > > > > Discussion earlier concluded with the realization that it is > > safe to dynamically allocate the mce_evt_pool at boot time. > > So here's a patch to do that. Scaling algorithm here is a > > simple linear "4 records per possible CPU" with a minimum > > of 80 to match the legacy behavior. I'm open to other > > suggestions. > > > > Note that I threw in a "+1" to the return from ilog2() when > > calling gen_pool_create(). From reading code, and running > > some tests, it appears that the min_alloc_order argument > > needs to be large enough to allocate one of the mce_evt_llist > > structures. > > > > Some other gen_pool users in Linux may also need this "+1". > > > > Somewhat confused here. Weren't we also exploring ways to avoid > duplicate records from being added to the genpool? Has something > changed in that regard? I'm going to cover this in the reply to Boris. > > + mce_numrecords = max(80, num_possible_cpus() * 4); > > + mce_poolsz = mce_numrecords * (1 << order); > > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); > > To err on the side of caution, wouldn't kzalloc() be a safer choice here? Seems like too much caution. When mce_gen_pool_add() allocates an entry from the pool it does: memcpy(&node->mce, mce, sizeof(*mce)); llist_add(&node->llnode, &mce_event_llist); between those two lines, every field in the struct mce_evt_llist is written (including any holes in the struct mce part of the structure). -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] x86/mce: Dynamically size space for machine check records 2024-02-28 23:14 ` [PATCH] x86/mce: Dynamically size space for machine check records Tony Luck ` (2 preceding siblings ...) 2024-02-29 6:42 ` Naik, Avadhut @ 2024-03-06 21:52 ` Naik, Avadhut 2024-03-06 22:07 ` Luck, Tony 3 siblings, 1 reply; 68+ messages in thread From: Naik, Avadhut @ 2024-03-06 21:52 UTC (permalink / raw) To: Tony Luck, Borislav Petkov Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/28/2024 17:14, Tony Luck wrote: > Systems with a large number of CPUs may generate a large > number of machine check records when things go seriously > wrong. But Linux has a fixed buffer that can only capture > a few dozen errors. > > Allocate space based on the number of CPUs (with a minimum > value based on the historical fixed buffer that could store > 80 records). > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > > Discussion earlier concluded with the realization that it is > safe to dynamically allocate the mce_evt_pool at boot time. > So here's a patch to do that. Scaling algorithm here is a > simple linear "4 records per possible CPU" with a minimum > of 80 to match the legacy behavior. I'm open to other > suggestions. > > Note that I threw in a "+1" to the return from ilog2() when > calling gen_pool_create(). From reading code, and running > some tests, it appears that the min_alloc_order argument > needs to be large enough to allocate one of the mce_evt_llist > structures. > > Some other gen_pool users in Linux may also need this "+1". > > arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > index fbe8b61c3413..a1f0a8f29cf5 100644 > --- a/arch/x86/kernel/cpu/mce/genpool.c > +++ b/arch/x86/kernel/cpu/mce/genpool.c > @@ -16,14 +16,13 @@ > * used to save error information organized in a lock-less list. > * > * This memory pool is only to be used to save MCE records in MCE context. > - * MCE events are rare, so a fixed size memory pool should be enough. Use > - * 2 pages to save MCE events for now (~80 MCE records at most). > + * MCE events are rare, so a fixed size memory pool should be enough. > + * Allocate on a sliding scale based on number of CPUs. > */ > -#define MCE_POOLSZ (2 * PAGE_SIZE) > +#define MCE_MIN_ENTRIES 80 > > static struct gen_pool *mce_evt_pool; > static LLIST_HEAD(mce_event_llist); > -static char gen_pool_buf[MCE_POOLSZ]; > > /* > * Compare the record "t" with each of the records on list "l" to see if > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) > > static int mce_gen_pool_create(void) > { > + int mce_numrecords, mce_poolsz; > struct gen_pool *tmpp; > int ret = -ENOMEM; > + void *mce_pool; > + int order; > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > + tmpp = gen_pool_create(order, -1); > if (!tmpp) > goto out; > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > + mce_numrecords = max(80, num_possible_cpus() * 4); Per Boris's below suggestion, shouldn't this be: mce_numrecords = max(80, num_possible_cpus() * 16); >> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); > > max() ofc. > >> There's a sane minimum and one page pro logical CPU should be fine on >> pretty much every configuration... 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic behavior you explained in the other subthread. Apart from this, tested the patch on a couple of AMD systems. Didn't observe any issues. > + mce_poolsz = mce_numrecords * (1 << order); > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); > + if (!mce_pool) { > + gen_pool_destroy(tmpp); > + goto out; > + } > + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); > if (ret) { > gen_pool_destroy(tmpp); > goto out; -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH] x86/mce: Dynamically size space for machine check records 2024-03-06 21:52 ` Naik, Avadhut @ 2024-03-06 22:07 ` Luck, Tony 2024-03-06 23:21 ` Naik, Avadhut 0 siblings, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-03-06 22:07 UTC (permalink / raw) To: Naik, Avadhut, Borislav Petkov Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik > > + mce_numrecords = max(80, num_possible_cpus() * 4); > > Per Boris's below suggestion, shouldn't this be: > mce_numrecords = max(80, num_possible_cpus() * 16); > > >> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); > > > > max() ofc. > > > >> There's a sane minimum and one page pro logical CPU should be fine on > >> pretty much every configuration... > > 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic > behavior you explained in the other subthread. Picking a good number of records-per-core may be more art than science. Boris is right that a page per CPU shouldn't cause any significant issue to systems with many CPUs, because they should have copious amounts of memory to make a balanced configuration. But 16 records per CPU feels way too high to me. The theoretical limit in a single scan of machine check banks on Intel is 32 (since Intel never has more than 32 banks). But those banks cover diverse h/w devices and it seems improbable that all, or even most, of them would log errors at the same time, with all CPUs on all sockets doing the same. After I posted the version with num_possible_cpus() * 4 I began to wonder whether "2" would be enough. > Apart from this, tested the patch on a couple of AMD systems. Didn't observe any > issues. Thanks very much for testing. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] x86/mce: Dynamically size space for machine check records 2024-03-06 22:07 ` Luck, Tony @ 2024-03-06 23:21 ` Naik, Avadhut 0 siblings, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-03-06 23:21 UTC (permalink / raw) To: Luck, Tony, Borislav Petkov Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik On 3/6/2024 16:07, Luck, Tony wrote: >>> + mce_numrecords = max(80, num_possible_cpus() * 4); >> >> Per Boris's below suggestion, shouldn't this be: >> mce_numrecords = max(80, num_possible_cpus() * 16); >> >>>> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); >>> >>> max() ofc. >>> >>>> There's a sane minimum and one page pro logical CPU should be fine on >>>> pretty much every configuration... >> >> 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic >> behavior you explained in the other subthread. > > Picking a good number of records-per-core may be more art than science. Boris > is right that a page per CPU shouldn't cause any significant issue to systems with > many CPUs, because they should have copious amounts of memory to make a > balanced configuration. But 16 records per CPU feels way too high to me. The > theoretical limit in a single scan of machine check banks on Intel is 32 (since > Intel never has more than 32 banks). But those banks cover diverse h/w devices > and it seems improbable that all, or even most, of them would log errors at the > same time, with all CPUs on all sockets doing the same. > > After I posted the version with num_possible_cpus() * 4 I began to wonder whether > "2" would be enough. > Was thinking along the same lines that 16 MCE records per thread might be too high. But since Boris made the suggestion, I thought there might be a use case that I am unaware of. Perhaps, some issue that had been debugged in the past. Hence, my earlier question if it should be 16 instead of 4. I think 2 records should also be good. IIRC, the patch that I submitted reserved space of 2 records per logical CPU in the genpool. >> Apart from this, tested the patch on a couple of AMD systems. Didn't observe any >> issues. > > Thanks very much for testing. > > -Tony -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 17:54 ` Borislav Petkov 2024-02-12 18:45 ` Luck, Tony @ 2024-02-15 20:18 ` Naik, Avadhut 1 sibling, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-15 20:18 UTC (permalink / raw) To: Borislav Petkov, Luck, Tony Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/12/2024 11:54, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote: >> Walking the structures already allocated from the genpool in the #MC >> handler may be possible, but what is the criteria for "duplicates"? > > for each i in pool: > memcmp(mce[i], new_mce, sizeof(struct mce)); > > It'll probably need to mask out fields like ->time etc. > Also, some fields like cpuvendor, ppin, microcode will remain same for all MCEs received on a system. Can we avoid comparing them? We already seem to have a function mce_cmp(), introduced back in 2016, which accomplishes something similar for fatal errors. But it only checks for bank, status, addr and misc registers. Should we just modify this function to compare MCEs? It should work for fatal errors too. For my own understanding: The general motto for #MC or interrupt contexts is *keep it short and sweet*. Though memcmp() is fairly optimized, we would still be running a *for* loop in MC context. In case successive back-to-back MCEs are being received and if the pool already has a fair number of records, wouldn't this comparison significantly extend our stay in #MC context? Had discussed this with Yazen, IIUC, nested MCEs are not supported on x86. Please correct me if I am wrong in this. -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 9:32 ` Borislav Petkov 2024-02-12 17:29 ` Luck, Tony @ 2024-02-15 20:15 ` Naik, Avadhut 1 sibling, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-15 20:15 UTC (permalink / raw) To: Borislav Petkov Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On 2/12/2024 03:32, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote: >> * Can you think of a slick deduplication scheme instead of blindly >> raising the buffer size? > > And here's the simplest scheme: you don't extend the buffer. On > overflow, you say "Buffer full, here's the MCE" and you dump the error > long into dmesg. Problem solved. > Wouldn't this require logging in the #MC context? We would know that the genpool is full in mce_gen_pool_add() which, I think is executed in #MC context. > A slicker deduplication scheme would be even better, tho. Maybe struct > mce.count which gets incremented instead of adding the error record to > the buffer again. And so on... > -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 8:58 ` Borislav Petkov 2024-02-12 9:32 ` Borislav Petkov @ 2024-02-15 20:14 ` Naik, Avadhut 1 sibling, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-15 20:14 UTC (permalink / raw) To: Borislav Petkov Cc: Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik Hi, On 2/12/2024 02:58, Borislav Petkov wrote: > On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote: >> Okay. Will make changes to allocate memory and set size of the pool >> when it is created. Also, will remove the command line parameter and >> resubmit. > > Before you do, go read that original thread again but this time take > your time to grok it. > > And then try answering those questions: > > * Why are *you* fixing this? I know what the AWS reason is, what is > yours? > I think this issue of genpool getting full with MCE records can occur on AMD system too since the pool doesn't scale up with the number of CPUs and memory in the system. The probability of issue occurrence only increases as CPU count and memory increases. Feel that the genpool size should be proportional to, at least, the CPU count of the system. > * Can you think of a slick deduplication scheme instead of blindly > raising the buffer size? > > * What's wrong with not logging some early errors, can we live with that > too? If it were firmware-first, it cannot simply extend its buffer size > because it has limited space. So what does firmware do in such cases? > Think that we can live with not logging some early errors, as long as they are correctable. Not very sure about what you mean by Firmware First. Do you mean handling of MCEs through HEST and GHES? Or something else? > Think long and hard about the big picture, analyze the problem properly > and from all angles before you go and do patches. > > Thx. > -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-11 11:14 ` Borislav Petkov 2024-02-12 2:54 ` Naik, Avadhut @ 2024-02-12 18:47 ` Yazen Ghannam 2024-02-12 18:58 ` Luck, Tony 1 sibling, 1 reply; 68+ messages in thread From: Yazen Ghannam @ 2024-02-12 18:47 UTC (permalink / raw) To: Borislav Petkov, Naik, Avadhut Cc: yazen.ghannam, Sohil Mehta, x86, linux-edac, tony.luck, linux-kernel, Avadhut Naik On 2/11/2024 6:14 AM, Borislav Petkov wrote: > On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote: >> IIUC, you wouldn't want to extend the pool through late_initcall(). >> Instead, you would want for memory to be allocated (on the heap) and >> size of the pool to be set at the very beginning i.e. when the pool >> is created (~2 seconds, according to dmesg timestamps). >> >> Please correct me if I have understood wrong. > > Nah, you got it right. I went, looked and realized that we have to do > this early dance because we have no allocator yet. And we can't move > this gen_pool allocation to later, when we *do* have an allocator > because MCA is up and logging already. > > But your extending approach doesn't fly in all cases either: > > gen_pool_add->gen_pool_add_virt->gen_pool_add_owner > > it grabs the pool->lock spinlock and adds to &pool->chunks while, at the > exact same time, gen_pool_alloc(), in *NMI* context iterates over that > same &pool->chunks in the case we're logging an MCE at exact that same > time when you're extending the buffer. > > And Tony already said that in the thread you're quoting: > > https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/ > > So no, that doesn't work either. > I'm confused why it won't work. X86 has ARCH_HAVE_NMI_SAFE_CMPXCHG. I expect atomics/caches will still work in interrupt or #MC context. If not, then we'd have a fatal error that causes a hardware reset or a kernel panic before we get to logging, I think. Or is the issue when running on the same CPU? In this case, either &pool->chunks was updated before taking the #MC, so the extra memory is there and can be used. Or it wasn't updated, so the extra memory is not available during the #MC which is the same behavior as now. I need to look more at the genpool code, but I thought I'd ask too. Thanks, Yazen ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 18:47 ` Yazen Ghannam @ 2024-02-12 18:58 ` Luck, Tony 2024-02-12 19:40 ` Naik, Avadhut 2024-02-12 19:43 ` Yazen Ghannam 0 siblings, 2 replies; 68+ messages in thread From: Luck, Tony @ 2024-02-12 18:58 UTC (permalink / raw) To: Yazen Ghannam, Borislav Petkov, Naik, Avadhut Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik > I need to look more at the genpool code, but I thought I'd ask too. Yazen, gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool. This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of the list_add_rcu() spin_lock(&pool->lock); list_add_rcu(&chunk->next_chunk, &pool->chunks); spin_unlock(&pool->lock); -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 18:58 ` Luck, Tony @ 2024-02-12 19:40 ` Naik, Avadhut 2024-02-12 20:18 ` Borislav Petkov 2024-02-12 19:43 ` Yazen Ghannam 1 sibling, 1 reply; 68+ messages in thread From: Naik, Avadhut @ 2024-02-12 19:40 UTC (permalink / raw) To: Luck, Tony, Yazen Ghannam, Borislav Petkov Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik Hi, On 2/12/2024 12:58, Luck, Tony wrote: >> I need to look more at the genpool code, but I thought I'd ask too. > > Yazen, > > gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool. > > This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of > the list_add_rcu() > > spin_lock(&pool->lock); > list_add_rcu(&chunk->next_chunk, &pool->chunks); > spin_unlock(&pool->lock); > Even I am somewhat confused by this. The spinlock is mostly held to prevent other primitives from modifying chunks within the genpool. In #MC context however, we are not modifying the existing chunks, per se. While in the MC context, records will be added to the genpool through gen_pool_alloc() which eventually drops down into gen_pool_alloc_algo_owner(). gen_pool_alloc_algo_owner() iterates over the existing chunks within the genpool through list_for_each_entry_rcu(), within an RCU read-side critical section (rcu_read_lock()). Now, the below description of list_for_each_entry_rcu(): * list_for_each_entry_rcu - iterate over rcu list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the list_head within the struct. * @cond: optional lockdep expression if called from non-RCU protection. * * This list-traversal primitive may safely run concurrently with * the _rcu list-mutation primitives such as list_add_rcu() * as long as the traversal is guarded by rcu_read_lock(). Makes me wonder if the genpool can be extended and traversed concurrently. OFC, not sure if gen_pool_alloc_algo_owner() being in #MC context makes a difference here or if I am missing something. -- Thanks, Avadhut Naik > -Tony > > > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 19:40 ` Naik, Avadhut @ 2024-02-12 20:18 ` Borislav Petkov 2024-02-12 20:51 ` Naik, Avadhut 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 20:18 UTC (permalink / raw) To: Naik, Avadhut Cc: Luck, Tony, Yazen Ghannam, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote: > The spinlock is mostly held to prevent other primitives > from modifying chunks within the genpool. > > In #MC context however, we are not modifying the existing > chunks, per se. What if we decide to do mce[i]count++; in #MC context? That's modifying the existing chunks. TBH, I'm not sure what that spinlock is for. See my reply to that same subthread. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 20:18 ` Borislav Petkov @ 2024-02-12 20:51 ` Naik, Avadhut 0 siblings, 0 replies; 68+ messages in thread From: Naik, Avadhut @ 2024-02-12 20:51 UTC (permalink / raw) To: Borislav Petkov Cc: Luck, Tony, Yazen Ghannam, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik Hi, On 2/12/2024 14:18, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote: >> The spinlock is mostly held to prevent other primitives >> from modifying chunks within the genpool. >> >> In #MC context however, we are not modifying the existing >> chunks, per se. > > What if we decide to do > > mce[i]count++; > > in #MC context? > > That's modifying the existing chunks. > > TBH, I'm not sure what that spinlock is for. See my reply to that same > subthread. > Yes, noticed your reply. Clears most of my confusion. -- Thanks, Avadhut Naik ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 18:58 ` Luck, Tony 2024-02-12 19:40 ` Naik, Avadhut @ 2024-02-12 19:43 ` Yazen Ghannam 2024-02-12 19:49 ` Luck, Tony 1 sibling, 1 reply; 68+ messages in thread From: Yazen Ghannam @ 2024-02-12 19:43 UTC (permalink / raw) To: Luck, Tony, Borislav Petkov, Naik, Avadhut Cc: yazen.ghannam, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On 2/12/2024 1:58 PM, Luck, Tony wrote: >> I need to look more at the genpool code, but I thought I'd ask too. > > Yazen, > > gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool. > > This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of > the list_add_rcu() > > spin_lock(&pool->lock); > list_add_rcu(&chunk->next_chunk, &pool->chunks); > spin_unlock(&pool->lock); > Thanks Tony. So the concern is not about traversal, but rather that the #MC can break the list_add_rcu(). Is this correct? Thanks, Yazen ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 19:43 ` Yazen Ghannam @ 2024-02-12 19:49 ` Luck, Tony 2024-02-12 20:10 ` Borislav Petkov 0 siblings, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-02-12 19:49 UTC (permalink / raw) To: Yazen Ghannam, Borislav Petkov, Naik, Avadhut, Paul E . McKenney Cc: Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik I said: > spin_lock(&pool->lock); > list_add_rcu(&chunk->next_chunk, &pool->chunks); > spin_unlock(&pool->lock); Avadhut said: > gen_pool_alloc_algo_owner() iterates over the existing chunks > within the genpool through list_for_each_entry_rcu(), within > an RCU read-side critical section (rcu_read_lock()). > So the concern is not about traversal, but rather that the #MC can break the > list_add_rcu(). Is this correct? Yazen, Yes. The question is whether a #MC that come in the middle of list_rcu_add() can safely do list_for_each_entry_rcu() on the same list. RCU is black magic ... maybe it can do this? Adding Paul. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 19:49 ` Luck, Tony @ 2024-02-12 20:10 ` Borislav Petkov 2024-02-12 20:44 ` Paul E. McKenney 0 siblings, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 20:10 UTC (permalink / raw) To: Luck, Tony Cc: Yazen Ghannam, Naik, Avadhut, Paul E . McKenney, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote: > Yes. The question is whether a #MC that come in the middle of list_rcu_add() > can safely do list_for_each_entry_rcu() on the same list. > > RCU is black magic ... maybe it can do this? Adding Paul. Yeah, the list traversal might be ok as this is what that list_add does - you can't encounter an inconsistent list - but we still take a spinlock on addition and the commit which added it: 7f184275aa30 ("lib, Make gen_pool memory allocator lockless") says The lockless operation only works if there is enough memory available. If new memory is added to the pool a lock has to be still taken. So any user relying on locklessness has to ensure that sufficient memory is preallocated. and this is exactly what we're doing - adding new memory. So, until we're absolutely sure that it is ok to interrupt a context holding a spinlock with a #MC which is non-maskable, I don't think we wanna do this. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 20:10 ` Borislav Petkov @ 2024-02-12 20:44 ` Paul E. McKenney 2024-02-12 21:18 ` Luck, Tony 2024-02-12 21:27 ` Borislav Petkov 0 siblings, 2 replies; 68+ messages in thread From: Paul E. McKenney @ 2024-02-12 20:44 UTC (permalink / raw) To: Borislav Petkov Cc: Luck, Tony, Yazen Ghannam, Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On Mon, Feb 12, 2024 at 09:10:38PM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote: > > Yes. The question is whether a #MC that come in the middle of list_rcu_add() > > can safely do list_for_each_entry_rcu() on the same list. > > > > RCU is black magic ... maybe it can do this? Adding Paul. > > Yeah, the list traversal might be ok as this is what that list_add does > - you can't encounter an inconsistent list - but we still take > a spinlock on addition and the commit which added it: > > 7f184275aa30 ("lib, Make gen_pool memory allocator lockless") > > says > > The lockless operation only works if there is enough memory available. > If new memory is added to the pool a lock has to be still taken. So > any user relying on locklessness has to ensure that sufficient memory > is preallocated. > > and this is exactly what we're doing - adding new memory. Is the #MC adding new memory, or is the interrupted context adding new memory? > So, until we're absolutely sure that it is ok to interrupt a context > holding a spinlock with a #MC which is non-maskable, I don't think we > wanna do this. If it is the #MC adding new memory, agreed. If the #MC is simply traversing the list, and the interrupted context was in the midst of adding a new element, this should be no worse than some other CPU traversing the list while this CPU is in the midst of adding a new element. Or am I missing a turn in here somewhere? Thanx, Paul ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 20:44 ` Paul E. McKenney @ 2024-02-12 21:18 ` Luck, Tony 2024-02-12 21:27 ` Borislav Petkov 1 sibling, 0 replies; 68+ messages in thread From: Luck, Tony @ 2024-02-12 21:18 UTC (permalink / raw) To: paulmck, Borislav Petkov Cc: Yazen Ghannam, Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik >> and this is exactly what we're doing - adding new memory. > > Is the #MC adding new memory, or is the interrupted context adding new > memory? The interrupted context is adding the memory. >> So, until we're absolutely sure that it is ok to interrupt a context >> holding a spinlock with a #MC which is non-maskable, I don't think we >> wanna do this. > > If it is the #MC adding new memory, agreed. Not what is happening. > If the #MC is simply traversing the list, and the interrupted context > was in the midst of adding a new element, this should be no worse than > some other CPU traversing the list while this CPU is in the midst of > adding a new element. > > Or am I missing a turn in here somewhere? Not missing anything. I believe you've answered the question. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 20:44 ` Paul E. McKenney 2024-02-12 21:18 ` Luck, Tony @ 2024-02-12 21:27 ` Borislav Petkov 2024-02-12 22:46 ` Paul E. McKenney 1 sibling, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 21:27 UTC (permalink / raw) To: Paul E. McKenney Cc: Luck, Tony, Yazen Ghannam, Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote: > If it is the #MC adding new memory, agreed. > > If the #MC is simply traversing the list, and the interrupted context > was in the midst of adding a new element, this should be no worse than > some other CPU traversing the list while this CPU is in the midst of > adding a new element. Right, Tony answered which context is doing what. What I'm still scratching my head over is, why grab a spinlock around list_add_rcu(&chunk->next_chunk, &pool->chunks); ? That's the part that looks really weird. And that's the interrupted context, yap. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 21:27 ` Borislav Petkov @ 2024-02-12 22:46 ` Paul E. McKenney 2024-02-12 22:53 ` Luck, Tony 2024-02-12 23:10 ` Borislav Petkov 0 siblings, 2 replies; 68+ messages in thread From: Paul E. McKenney @ 2024-02-12 22:46 UTC (permalink / raw) To: Borislav Petkov Cc: Luck, Tony, Yazen Ghannam, Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On Mon, Feb 12, 2024 at 10:27:41PM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote: > > If it is the #MC adding new memory, agreed. > > > > If the #MC is simply traversing the list, and the interrupted context > > was in the midst of adding a new element, this should be no worse than > > some other CPU traversing the list while this CPU is in the midst of > > adding a new element. > > Right, Tony answered which context is doing what. > > What I'm still scratching my head over is, why grab a spinlock around > > list_add_rcu(&chunk->next_chunk, &pool->chunks); > > ? > > That's the part that looks really weird. > > And that's the interrupted context, yap. The usual reason is to exclude other CPUs also doing list_add_rcu() on the same list. Or is there other synchronization that is preventing concurrent updates? Thanx, Paul ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 22:46 ` Paul E. McKenney @ 2024-02-12 22:53 ` Luck, Tony 2024-02-12 23:10 ` Borislav Petkov 1 sibling, 0 replies; 68+ messages in thread From: Luck, Tony @ 2024-02-12 22:53 UTC (permalink / raw) To: paulmck, Borislav Petkov Cc: Yazen Ghannam, Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik > The usual reason is to exclude other CPUs also doing list_add_rcu() > on the same list. Or is there other synchronization that is preventing > concurrent updates? In this case the patch is trying to do a one-time bump of the pool size. So no races here. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 22:46 ` Paul E. McKenney 2024-02-12 22:53 ` Luck, Tony @ 2024-02-12 23:10 ` Borislav Petkov 2024-02-13 1:07 ` Paul E. McKenney 1 sibling, 1 reply; 68+ messages in thread From: Borislav Petkov @ 2024-02-12 23:10 UTC (permalink / raw) To: Paul E. McKenney Cc: Luck, Tony, Yazen Ghannam, Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote: > The usual reason is to exclude other CPUs also doing list_add_rcu() > on the same list. Doh, it even says so in the comment above list_add_rcu(). And the traversal which is happening in NMI-like context is fine. So phew, I think we should be fine here. Thanks! And as it turns out, we're not going to need any of that after all as it looks like we can allocate the proper size from the very beginning... Lovely. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-12 23:10 ` Borislav Petkov @ 2024-02-13 1:07 ` Paul E. McKenney 0 siblings, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2024-02-13 1:07 UTC (permalink / raw) To: Borislav Petkov Cc: Luck, Tony, Yazen Ghannam, Naik, Avadhut, Mehta, Sohil, x86, linux-edac, linux-kernel, Avadhut Naik On Tue, Feb 13, 2024 at 12:10:09AM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote: > > The usual reason is to exclude other CPUs also doing list_add_rcu() > > on the same list. > > Doh, it even says so in the comment above list_add_rcu(). > > And the traversal which is happening in NMI-like context is fine. > > So phew, I think we should be fine here. Thanks! > > And as it turns out, we're not going to need any of that after all as > it looks like we can allocate the proper size from the very beginning... Sounds even better! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 20:02 ` Naik, Avadhut 2024-02-09 20:09 ` Borislav Petkov @ 2024-02-09 20:16 ` Sohil Mehta 2024-02-09 20:28 ` Luck, Tony 1 sibling, 1 reply; 68+ messages in thread From: Sohil Mehta @ 2024-02-09 20:16 UTC (permalink / raw) To: Naik, Avadhut, x86, linux-edac Cc: bp, tony.luck, linux-kernel, yazen.ghannam, Avadhut Naik On 2/9/2024 12:02 PM, Naik, Avadhut wrote: > Is it safe to assume that users will always want to increase the size of the > pool and not decrease it? > > IMO, the command-line option provides flexibility for users to choose the size of > MCE Records pool in case, they don't agree with the CPU count logic. Just added it > to ensure that we are not enforcing this increased memory footprint across the board. > > Would you agree? > Not really. Providing this level of configuration seems excessive and unnecessary. To me, it seems that we are over-compensating with the calculations in the previous patch and then providing a mechanism to correct it here and putting this burden on the user. How about being more conservative with the allocations in the previous patch so that we don't need to introduce this additional mechanism right now? Later, if there is really a need for some specific usage, the patch can be re-submitted then with the supporting data. Sohil ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 20:16 ` Sohil Mehta @ 2024-02-09 20:28 ` Luck, Tony 2024-02-09 21:02 ` Sohil Mehta 0 siblings, 1 reply; 68+ messages in thread From: Luck, Tony @ 2024-02-09 20:28 UTC (permalink / raw) To: Mehta, Sohil, Naik, Avadhut, x86, linux-edac Cc: bp, linux-kernel, yazen.ghannam, Avadhut Naik > How about being more conservative with the allocations in the previous > patch so that we don't need to introduce this additional mechanism right > now? Later, if there is really a need for some specific usage, the patch > can be re-submitted then with the supporting data. There used to be a rule-of-thumb when configuring systems to have at least one GByte of memory per CPU. Anyone following that rule shouldn't be worried about sub-kilobyte allocations per CPU. -Tony ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool 2024-02-09 20:28 ` Luck, Tony @ 2024-02-09 21:02 ` Sohil Mehta 0 siblings, 0 replies; 68+ messages in thread From: Sohil Mehta @ 2024-02-09 21:02 UTC (permalink / raw) To: Luck, Tony, Naik, Avadhut, x86, linux-edac Cc: bp, linux-kernel, yazen.ghannam, Avadhut Naik On 2/9/2024 12:28 PM, Luck, Tony wrote: >> How about being more conservative with the allocations in the previous >> patch so that we don't need to introduce this additional mechanism right >> now? Later, if there is really a need for some specific usage, the patch >> can be re-submitted then with the supporting data. > > There used to be a rule-of-thumb when configuring systems to have at least > one GByte of memory per CPU. Anyone following that rule shouldn't be > worried about sub-kilobyte allocations per CPU. > I meant, to avoid the need for this second patch we can always start lower and increase it later. 256 bytes per cpu seems fine to me as done in the previous patch. But, if that seems too high as described by Avadhut below then maybe we can start with 200 bytes or any other number. It's just heuristic IIUC. https://lore.kernel.org/lkml/8d2d0dac-b188-4826-a43a-bb5fc0528f0d@amd.com/ Sohil ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2024-03-06 23:21 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-07 22:56 [PATCH 0/2] Extend size of the MCE Records pool Avadhut Naik 2024-02-07 22:56 ` [PATCH 1/2] x86/MCE: " Avadhut Naik 2024-02-08 0:02 ` Luck, Tony 2024-02-08 17:41 ` Naik, Avadhut 2024-02-08 17:47 ` Naik, Avadhut 2024-02-08 18:39 ` Luck, Tony 2024-02-09 19:47 ` Naik, Avadhut 2024-02-08 21:09 ` Sohil Mehta 2024-02-09 19:52 ` Naik, Avadhut 2024-02-07 22:56 ` [PATCH 2/2] x86/MCE: Add command line option to extend " Avadhut Naik 2024-02-09 1:36 ` Sohil Mehta 2024-02-09 20:02 ` Naik, Avadhut 2024-02-09 20:09 ` Borislav Petkov 2024-02-09 20:35 ` Naik, Avadhut 2024-02-09 20:51 ` Borislav Petkov 2024-02-10 7:52 ` Borislav Petkov 2024-02-10 21:15 ` Naik, Avadhut 2024-02-11 11:14 ` Borislav Petkov 2024-02-12 2:54 ` Naik, Avadhut 2024-02-12 8:58 ` Borislav Petkov 2024-02-12 9:32 ` Borislav Petkov 2024-02-12 17:29 ` Luck, Tony 2024-02-12 17:54 ` Borislav Petkov 2024-02-12 18:45 ` Luck, Tony 2024-02-12 19:14 ` Borislav Petkov 2024-02-12 19:41 ` Luck, Tony 2024-02-12 21:37 ` Tony Luck 2024-02-12 22:08 ` Borislav Petkov 2024-02-12 22:19 ` Borislav Petkov 2024-02-12 22:42 ` Borislav Petkov 2024-02-28 23:14 ` [PATCH] x86/mce: Dynamically size space for machine check records Tony Luck 2024-02-29 0:39 ` Sohil Mehta 2024-02-29 0:44 ` Luck, Tony 2024-02-29 1:56 ` Sohil Mehta 2024-02-29 15:49 ` Yazen Ghannam 2024-02-29 17:22 ` Tony Luck 2024-02-29 17:21 ` Tony Luck 2024-02-29 23:56 ` Sohil Mehta 2024-02-29 6:42 ` Naik, Avadhut 2024-02-29 8:39 ` Borislav Petkov 2024-02-29 17:47 ` Tony Luck 2024-02-29 18:28 ` Naik, Avadhut 2024-02-29 18:38 ` Luck, Tony 2024-02-29 17:26 ` Tony Luck 2024-03-06 21:52 ` Naik, Avadhut 2024-03-06 22:07 ` Luck, Tony 2024-03-06 23:21 ` Naik, Avadhut 2024-02-15 20:18 ` [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool Naik, Avadhut 2024-02-15 20:15 ` Naik, Avadhut 2024-02-15 20:14 ` Naik, Avadhut 2024-02-12 18:47 ` Yazen Ghannam 2024-02-12 18:58 ` Luck, Tony 2024-02-12 19:40 ` Naik, Avadhut 2024-02-12 20:18 ` Borislav Petkov 2024-02-12 20:51 ` Naik, Avadhut 2024-02-12 19:43 ` Yazen Ghannam 2024-02-12 19:49 ` Luck, Tony 2024-02-12 20:10 ` Borislav Petkov 2024-02-12 20:44 ` Paul E. McKenney 2024-02-12 21:18 ` Luck, Tony 2024-02-12 21:27 ` Borislav Petkov 2024-02-12 22:46 ` Paul E. McKenney 2024-02-12 22:53 ` Luck, Tony 2024-02-12 23:10 ` Borislav Petkov 2024-02-13 1:07 ` Paul E. McKenney 2024-02-09 20:16 ` Sohil Mehta 2024-02-09 20:28 ` Luck, Tony 2024-02-09 21:02 ` Sohil Mehta
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).