OpenBMC on RCS platforms

* OpenBMC on RCS platforms
@ 2021-04-23 14:30 Timothy Pearson
  2021-04-23 17:11 ` Patrick Williams
  2021-04-23 17:23 ` Ed Tanous
  0 siblings, 2 replies; 11+ messages in thread
From: Timothy Pearson @ 2021-04-23 14:30 UTC (permalink / raw)
  To: openbmc

All,

I'm reaching out after some internal discussion on how we can better integrate our platforms with the OpenBMC project.  As many of you may know, we have been using OpenBMC in our lineup of OpenPOWER-based server and desktop products, with a number of custom patches on top to better serve our target markets.

While we have had fairly good success with OpenBMC in the server / datacenter space, reception has been lukewarm at best in the desktop space.  This is not too surprising, given OpenBMC's historical focus on datacenter applications, but it is also becoming an expensive technical and PR pain point for us as the years go by.  To make matters worse, we're still shielding our desktop / workstation customer base to some degree from certain design decisions that persist in upstream OpenBMC, and we'd like to open discussion on all of these topics to see if a resolution can be found with minimal wasted effort from all sides.

Roughly speaking, we see issues in OpenBMC in 5 main areas:

== Fan control ==

Out of all of the various pain points we've dealt with over the years, this has proven the most costly and is responsible on its own for the lack of RCS platforms upstream in OpenBMC.

To be perfectly frank, OpenBMC's current fan control subsystem is a technical embarrassment, and not up to the high quality seen elsewhere in the project.  Worse, this multi-daemon DBUS-interconnected Rube Goldberg contraption has somehow managed to persist over the past 4+ years, likely because it reached a complexity level where it is both tightly integrated with the rest of the OpenBMC system and extremely difficult to understand, therefore it is equally difficult to replace.  Furthering the lack of progress is the fact that it is mostly "working" for datacenter applications, so there may be a "don't touch what isn't broken" mentality in play.  From a technical perspective, it is indirected to a sufficient level as to be nearly incomprehensible to most people, with the source spread across multiple different projects and repositories, yet somehow it remains rigid / fragile enough to not support basic features like runtime (or even post-compile) fan configuration for a given server.

What we need is a much simpler, more robust fan control daemon.  Ideally this would be one self-contained process, not multiple interconnected processes where a single failure causes the entire system to go into safe mode.

Our requirements:
1.) True PID control with tunable constants.  Trying to do things with PWM/temp maps alone may have made sense in the resource-constrained environments common in the 1970s, but it makes no sense on modern, powerful BMC silicon with hard floating point instructions.  Even the stock fan daemon implements a sort of bespoke integrator-by-another-name, but without the P and D components it does a terrible job outside of a constant-temperature datacenter environment.
2.) Tunable PID constants, tunable temperature thresholds, tunable min/max fan speeds, and arbitrary linkage between temp inputs (zones) and fan outputs (also zoned).
3.) Configurable zones -- both temperature and PWMs, as well as installed / not installed fans and temperature sensors.
4.) Configurable failure behavior.  A single failed or uninstalled chassis fan should NOT cause the entire platform to go into failsafe mode!
5.) A decent GUI to configure all of this, and the ability to export / import the settings.

To be fair, we've only been able to implement 1, 2, 3, and 4 above at compile time -- we don't have the runtime configuration options due to the way the fan systems work in OpenBMC right now, and the sheer amount of work needed to overhaul the GUI in the out-of-tree situation we remain stuck in.  With all that said, however, we point out that our competition, especially on x86 platforms, has all of these features and more, all neatly contained in a nice user-friendly point+click GUI.  OpenBMC should be able to easily match or exceed that functionality, but for some reason it seems stuck in datacenter-only mode with archaic hardcoded tables and constants.

== Local firmware updates ==

This is right behind fan control in terms of cost and PR damage for us vs. competing platforms.  While OpenBMC's firmware update support is very well tuned for datacenter operations (we use a simple SSH + pflash method on our large clusters, for example) it's absolutely terrible for desktop and workstation applications where a second PC is not guaranteed to be available, and where wired Ethernet even exists DHCP is either non-existent or provided by a consumer cable box.  Some method of flashing -- and recovering -- the BMC and host firmware right from the local machine is badly needed, especially for the WiFi-only environments we're starting to see more of in the wild.  Ideally this would be a command line tool / library such that we can integrate it with our bootloader or a GUI as desired.

== BMC boot time ==

This is self explanatory.  Other vendors' solutions allow the host to be powered on within seconds of power application from the wall, and even our own Kestrel soft BMC allows the host to begin booting less than 10 seconds after power is applied.  Several *minutes* for OpenBMC to reach a point where it can even start to boot the host is a major issue outside of datacenter applications.

== Host boot status indications ==

Any ODM that makes server products has had to deal with the psychological "dead server effect", where lack of visible progress during boot causes spurious callouts / RMAs.  It's even worse on desktop, especially if server-type hardware is used inside the machine.  We've worked around this a few times with our "IPL observer" services, and really do need this functionality in OpenBMC.  The current version we have is both front panel lights and a progress bar on the BMC boot monitor (VGA/HDMI), and this is something we're willing to contribute upstream.

== IPMI / BMC permissions ==

An item that's come up recently is that, at least on our older OpenBMC versions, there's a complete disconnect between the BMC's shell user database and the IPMI user database.  Resetting the BMC root password isn't possible from IPMI on the host, and setting up IPMI doesn't seem possible from the BMC shell.  If IPMI support is something OpenBMC provides alongside Redfish, it needs to be better integrated -- we're dealing with multiple locked-out BMC issues at the moment at various customer sites, and the recovery method is painful at best when it should be as simple as an ipmitool command from the host terminal.

If there is interest, I'd suggest we all work on getting some semblance of a modern fan control system and the boot status indication framework into upstream OpenBMC.  This would allow Raptor to start upstreaming base support for RCS product lines without risking severe regressions in user pain points like noisy fans -- perceived high noise levels are always a great way to kill sales of office products, and as a result the fan control functionality is something we're quite sensitive about.  The main problem is that with the existing fan control system's tentacles snaking everywhere including the UI, this will need to be a concerted effort by multiple organizations including the maintainers of the UI and the other ODMs currently using the existing fan control functionality.  We're willing to make yet another attempt *if* there's enough buy-in from the various stakeholders to ensure a prompt merge and update of the other components.

Finally, some of you may also be aware of our Kestrel project [1], which eschews the typical BMC ASICs, Linux, and OpenBMC itself.  I'd like to point out that this is not a direct competitor to OpenBMC, it is designed specifically for certain target applications with unique requirements surrounding overall size, functionality, speed, auditability, transparency, etc.  Why we have gone to those lengths will become apparent later this year, but suffice it to say we're considering Kestrel to be used in spaces where OpenBMC is not practical and vice versa.  In fact, we'd like to see OpenBMC run on the Kestrel SoCs (or a derivative thereof) at some point in the future, once the performance concerns above are sufficiently mitigated to make that practical.

[1] https://gitlab.raptorengineering.com/kestrel-collaboration/kestrel-litex/litex-boards/-/blob/master/README.md

^ permalink raw reply	[flat|nested] 11+ messages in thread