Friday May 08, 2009
Friday May 08, 2009
Here's how it works. When one or more of the cores on the processor is in the P0 (max perf) state, those cores may enter Turbo Boost mode, allowing them to run at the faster clock frequency. "How much faster" depends on how much power and thermal headroom is available, but in general, the more cores there are on the socket that are idle and power managed (via. C-states), the faster the remaining running core(s) will go. With the introduction of OpenSolaris support for Deep C-states which integrated in build 110, we're certainly seeing the effects...since the system is now readily taking advantage of the deeper C-states, turbo boost happens all the time. :)
But how does one observe this? It's certainly useful to know when Turbo Boost is happening, and how much of an "overclock" the processor is achieving. Fortunately, in build 110 Rafael Vanoni pushed some changes to PowerTOP that provides this observability:
This is a screenshot of PowerTOP running on a Xeon 5500 (Nehalem) based system. Notice in the P-states (Frequencies) column that the highest clock speed has (turbo) next to it. As turbo mode is entered, PowerTOP will track the average frequency of the system's processors over the sampling interval. You can actually watch that top end frequency fluctuate as utilization across the system changes.
Clearly, this observability is important for understanding system performance (but more importantly, for performance determinism). Very cool stuff...and yes, this will be present in the upcoming OpenSolaris release. Here's a video we did in which Rafael talks more about this...
Wednesday Apr 01, 2009
Monday Mar 30, 2009
Indeed, the job of thread placement on modern systems has become quite interesting. Just about every modern processor on the market these days is (at least) multi-core, with many also presenting multiple hardware "threads", "strands", or "Hyper Threads" sharing instruction or floating point pipelines...and then there's shared caches, crypto accelerators, memory controllers... So there's a lot to consider when deciding where (on which logical CPUs) a given handful of threads should execute. Where possible we've tried to avoid having threads fight over shared system resources. If the load is light enough, and enough system resources exist that each thread can have it's own pipeline, cache (or even socket)...that's a pretty good strategy for mitigating potential resource contention.
All this good stuff is made possible by the kernel's Processor Group based CMT scheduling subsystem, which (at boot) enumerates all the "interesting" relationships that exist between the system's logical CPUs...which in turns allows the dispatcher to be smart about how it utilizes those CPUs to deliver great performance.
We (or at least I) didn't realize at the time, but all this work we were doing to make the dispatcher smarter about how it uses the CPUs, also turns out to be really useful for being smart about how you're *not* using the CPUs. This means that in addition to optimizing for performance, this same dispatcher awareness can be used to optimize for power efficiency.

As part of the Power Aware Dispatcher project, we extended the kernel's CMT scheduling subsystem to enumerate groups of logical CPUs representing active and Idle CPU Power Management Domains. On x86 systems, these domains are enumerated through ACPI. Being aware of these domains allows the dispatcher to place threads in ways that not only optimize performance for shared system resources, but also maximizes opportunities to power manage CPUs. For example, the dispatcher may try to coalesce light utilization on the system onto a smaller number of power domains (e.g. sockets), thus freeing up other CPU resources in the system to be power managed more deeply. On the Intel Xeon 5500 processor series based systems, this enables us to take better advantage of the processor's deep idle power management features, including deep C-states.
Also, consistent with our goals around the Tickless Kernel Architecture project, the Power Aware Dispatcher is an "event based" CPU power management architecture, which means that all CPU power state changes are driven entirely by utilization events triggered by the dispatcher as threads come and go from the CPUs. One clear benefit of this, is that when the system is idle, there no need to periodically wake up to check CPU utilization (which in itself is inefficient and wasteful). It also means that the kernel can be aggressive about adjusting resource power states (in near real-time) with respect to changes in utilization.
We really like thinking about Power Management as just another piece of Resource Management. By designing efficient resource utilization into the kernel subsystems that deal with power manageable hardware resources...we can be smart about how we utilize the system (for improved performance), and how we *don't* use the system (to leverage power management features). The power efficiency results we're seeing with PAD are impressive, and we're really looking forward to building on the PAD work we integrated into build 110 in the months ahead.
Tuesday Aug 21, 2007
Although performance remains key, at what cost should that performance be delivered? We *must* engineer systems to deliver the performance that Sun / Solaris customers have come to expect while using no more resources than is necessary to do so. Beyond performance, we must deliver efficiency. Therein lies the challenge of Project Tesla
Thursday Jul 26, 2007
Tuesday May 29, 2007
Monday May 28, 2007
The "Making Solaris a better Linux than Linux" quote referenced in the Slashdot post seems to have elicited a wide range of responses from folks in the community. Some folks have expressed that they don't want to see Solaris "become a better Linux", out of concern that Solaris would lose some of it's differentiating strengths (backward compatibility / stability being a frequently raised example). Others on the thread have pointed out examples of things in the Solaris environment that they feel represent barriers for adoption...which in turn has elicited more debate as to whether those barriers are really barriers, and then more debate still as to how best to deal with them. :)
At the SVOSUG meeting, Ian gave some background describing where he's coming from, why he decided to join Sun to advocate for OpenSolaris, and his vision for Project Indiana. The devil is in the details, and it's pretty clear there are many of them, but the modivation and idea behind Project Indiana (or at least my take on it) seems fairly simple. Provide OpenSolaris with the features it needs to appeal to, and be welcoming of Linux enthusiasts and/or folks who would otherwise reach for a Linux solution.
At the meeting, I said I felt that the goal shouldn't necessarily be to make Solaris a better Linux than Linux..but to make Solaris a better Solaris, such that it appeals to Linux enthusiasts more than Linux itself does. The difference is where you set your sights. I don't believe there's any shortage of opportunity. While OpenSolaris is superior in many ways, I believe it's deficient in others. I note myself carrying around a short mental list of things that (for me) are missing, or deficient in OpenSolaris that I suspect could represent an adoption "show stopper" for someone else. My short list represents the feature gap that exists between where OpenSolaris is, and where (as a developer) I wish it would be.
I suspect that such a list would vary depending on who you ask. For Project Indiana, I would imagine that characterizing what this list would look like from the perspective of a Linux enthusiast, as well as someone who tried (and gave up on) OpenSolaris would be a useful start.
Monday May 21, 2007
As part of the Tesla Project, we're going to be looking at providing a "scheduled" clock implementation. The clock cyclic currently fires 100 times a second somewhere in the system. From a power management perspective, it would be nice if the clock fired only when necessary (something is scheduled to timeout, scheduled accounting is due, etc). This would allow the CPU on which the clock cyclic fires to potentially remain quiescent much longer (on average), which in turn would mean that the CPU could remain longer (or go deeper) in a power saving state.
It might be that the scaling issue becomes less so if the clock doesn't always have to fire. Then again, this may be one of those "elephant in the living room" type issues...you can pretend that it isn't there only so long... :)