Author Archives: pzanoni

Runtime PM support for Intel Linux Graphics – Part 3: system-wide power management

This is the third part of the documentation for the Runtime Power Management (RPM) subsystem of the Intel Linux Kernel Graphics Driver, also known as i915.ko. In this part I will explain the power management interactions between the graphics device and the rest of the system. Part 1 gave an overview of the feature and part 2 explained the implementation details.

I should also point that although I am a Graphics developer and I believe to have a good knowledge on the Graphics part of the problem, I am not an expert on general power management, so I may say something wrong. If you notice that, please correct me!

Disclaimer

I work for Intel and I am a member of its Linux Kernel Graphics team. On this text I will talk about things that are related to my work at Intel, but this text reflects only my own opinion, not necessarily Intel’s opinion. Everything I talk about here is already public information.

Basic concepts

The very first thing you need to do is to learn the basic concepts related to power management: Core C states – or just C States – and Package C states – or PC states. Since descriptions of these concepts were already written by people who know more about them than I do, I will just provide you some pointers:

The small detail that makes a lot of power management comparisons wrong

Now that you know what the PC states are, you understood that the current PC state directly affects the system’s total power consumption. You also understood that if just one little piece of the whole system is not allowing a certain PC state, then the whole system will not enter that PC state. And this is the big problem.

Let’s say you have a system with a disk that is only allowing up to PC3 state, and your graphics device is allowing up to the PC6 state. That means that you are stuck on PC3, you will not reach PC6. Then, you enable a certain graphics power management feature – let’s say, for example, Framebuffer Compression (FBC) -, and the graphics device starts allowing PC7 instead of PC6. Since the disk is still there, preventing anything deeper than PC3, your system will stay at PC3 or worse. Then you try to measure power consumption both before and after enabling FBC to see how much it changes. You will conclude that the difference is zero! Why? Because the disk is keeping your machine at PC3. Does that mean the FBC feature does not save any power? No. What this means is that you have to fix your disk device power management configuration first.

Then let’s say you fix the disk power management policies so it starts allowing PC7. This way, when you disable FBC your machine will be able to reach only up to PC6 state, but when you enable FBC you will reach up to PC7. And now, if you measure the power consumption, you will notice a difference. But even this case can allow you to reach wrong conclusions: if you’re trying to check how much you can save by properly configuring your disk power management policies, you will reach different conclusions depending on whether FBC is enabled or not! Power management comparisons are really not easy, and you usually cannot look at just one feature: you have to consider the whole platform, its states and your residencies. You always have to know what exactly is preventing you from getting into the deeper states before concluding anything.

Another common misconception

Another common misconception that I have to mention here is the relationship between power management and performance. A lot of people assume that enabling power management features sacrifices performance. While this may be true for some cases, it is also false for other cases! Power management features help your system not only draw less power when idle, but also stay at cooler temperatures. This means that when you really need performance, you will be less limited by the thermal management system, since it will take you more time to reach the high temperatures, so you will be able to run at higher clocks for more time.

I am not an expert in this area so I don’t really want to write too much because I don’t want to risk saying something wrong. But I suggest you to read the Thermal Management chapter of the datasheet I referenced earlier and pay a lot of attention while reading the Thermal Considerations and Turbo Boost sections!

The tools

After reading the last sections you may be wondering what you need to do to know which device is preventing the system from getting into deeper package C states. Unfortunately, as far as I know, there is really no way to discover that today. But we do have a tool that allows you to see how much time your system is spending on each state and that also gives you some hints on changes you could try to make to reach deeper PC states: PowerTOP.

Since you already read PowerTOP’s user manual – referenced earlier in this text – you already know how to use it. I suggest you to play with the tunables tab and try to see how each parameter affects the system. Do you see a change on the PC state residencies after toggling a certain parameter? Is the difference for one parameter only noticeable after you toggle another parameter? Do you have a certain group of parameters that need to be toggled together to allow deeper PC states? Also, since you’re basically changing the default parameters of the devices, you may see undesired side effects, such as a mouse that refuses to go back to life after you move it. Unfortunately, some power saving features are disabled exactly because they can cause bad side effects: that’s why you have to change the defaults. On the good side, it’s not everybody that is affected by these side effects, so your system may be just fine.

Another nice tool is Turbostat, which is located inside the Kernel source tree, under tools/power/x86/turbostat. When I am trying to discover if a certain change makes a difference in the PC state residencies, I find it much easier to look at Turbostat’s output rather than at PowerTOP’s output, simply because Turbostat just prints all the values on new lines, allowing you to quickly scroll up to see the past. I usually have one terminal on PowerTOP’s tunables tab, and another terminal with Turbostat running, so I just keep toggling the tunables and see if the residencies change.

Back to the graphics device

Now that you understand PC states, I can tell you that there are many graphics features that affect the deepest possible PC state. I won’t really list every possible feature here, but the main rule is: the more pixels you process per second, the more power you will draw. On the datasheet provided at the beginning of this text, please read Section 4.2.6: Package C-States and Display Resolutions.

Did you read it? Ok, we can proceed. I know you just read that, but I need to emphasize: the screen resolution is not the only thing that can limit the deepest possible PC state. So, for example, if you have a lot of rendering, or if some feature such as FBC is disabled, your deepest possible PC state may be limited.

Screen on vs screen off

If you really read that datasheet, you may have noticed that, on those specific processors, the deepest PC state you can reach with the screen on is PC7, while the deepest PC state you can reach with the screen off is PC10. When there are no pixels to process, everything gets easier.

From the graphics driver point of view, it is much easier to make sure we allow the deepest possible PC state while the screen is off, and you can usually expect the state of i915.ko to be really good on this aspect. You can use this information as a way to find out if the graphics device is the limiting factor on your PC states. First, you close all the applications that could be doing off-screen rendering. Then you launch Turbostat and check up to which PC state you can reach. Then you disable the screen – see Part 1 for possible commands -, wait about 30 seconds for things to settle down and for Turbostat to print the output, and then you check if you were able to reach deeper PC states. If the answer is yes, then the graphics device is the limiting factor on your system. If not, then there’s probably something else limiting how deep you can go. Of course, i915.ko runtime PM needs to be properly configured and enabled while you’re doing this.

If you really want to guarantee that no bad applications are interfering with your test, I suggest you to close your display manager, all applications, and run tests/pm_rpm --stay from intel-gpu-tools. Then you can run the experiment with Turbostat.

Common bottlenecks I observed

If you go back to Part 1 you will notice I already mentioned the power management bottlenecks that I usually observe on my own systems: the disk and audio power management policies. I usually find that if I only change these policies I can already get to the deepest possible PC states on my development machines. I won’t get great residences without tuning everything else, but I will at least be able to reach those deepest PC states.

But remember: every system is different, and I have no idea what is attached to your machine, so I can’t really guarantee that the recipe that worked for me will also work for you. I don’t know which wireless card you have, I don’t know which devices are plugged to your system, I don’t know how fast your memory is, so I can’t really provide you an universal tool for power management. The most I can do is suggest you to run PowerTOP’s auto tune feature.

The good news is that there are people trying to solve both the disk and audio problems. Want to help on the disk side? See this blog post from Matthew Garrett.

Back to graphics Runtime PM

So what does the graphics runtime PM implementation have to do with all these things I just explained? It’s very simple: when we runtime suspend i915.ko, we completely disable the graphics device. So, in theory, it should start allowing the deepest possible PC states. But, as I explained, even though this allows power savings, it does not really guarantee that you will save power, because the other devices might be preventing you from reaching the deepest PC states. But at least now we – the graphics team – probably won’t be the reason why you’re not saving more power, so you will have to blame other people.

Conclusion

My main goal with these posts was to teach you the little things I know about power management. I really hope that you use this acquired knowledge to make our world a greener place. I also have to ask you: please help us improve the state of power management on the Linux ecosystem! Please help testing all these features. Please find the main bottlenecks of your system. Please engage with the upstream developers of the many subsystems and help them change the default parameters and policies of the devices so they can save more power. Please initiate power management discussions with the distributions, so they can review their tools and default values too.

And before I forget, here’s a link to a comic that is going to give you even more motivation to help you make the world consume less power. You can be like Superman! http://www.smbc-comics.com/?id=2305

Runtime PM support for Intel Linux Graphics – Part 2: the internal parts

Welcome to the second part of the Runtime PM (RPM) documentation. On the first part I gave an overview of the Intel Linux Kernel Graphics Runtime PM infrastructure. On this part I will explain a little bit about the feature design, debugging and our test suite. If you’re someone who has to give maintenance of i915.ko to your users, you should definitely read this.

Disclaimer

I work for Intel and I am a member of its Linux Kernel Graphics team. On this text I will talk about things that are related to my work at Intel, but this text reflects only my own opinion, not necessarily Intel’s opinion. Everything I talk about here is already public information.

Feature design

The design of the i915 RPM is based on reference counting: whenever some code needs to use the graphics device, it needs to call intel_runtime_pm_get(), and whenever the code is done using the hardware, it can call intel_runtime_pm_put(). Whenever the refcount is zero we can runtime suspend the device. This is the most basic concept, but we have some things on top.

An interesting detail of some of our hardware generations is that we have the concept of power wells, which are specific pieces of the hardware that can be powered off independently of the others, saving some power. For example, on Haswell, if you’re just using the eDP panel on pipe A, you can turn off the power well responsible for pipes B and C, saving some power on a usage case that is very common for laptop owners. Check the diagram on page 139 of the public Haswell documentation to see which pieces of the hardware are affected by the power well.

As you can see on drivers/gpu/drm/i915/intel_runtime_pm.c, different platforms have different sets of power wells, so we created the power domain concept in order to map our code abstractions to the different power wells on different platforms. For example, the display code has to grab POWER_DOMAIN_PIPE_B when it uses the pipe B, but the real power wells that will actually be grabbed when this power domain is grabbed depend on the platform. Just like the runtime PM subsystem, the power domains framework is based on reference counting. It is also worth mentioning that whenever any power domain is grabbed, we also call intel_runtime_pm_get().

Besides the power domains, there are other pieces of our code that grab RPM references, such as forcewake code – which prevents RC6 from being enabled – and a few others. To discover them, just grep our code for intel_runtime_pm_get() calls.

The difficulties

Developers: a new paradigm

Before all this got implemented, the state of power management on our driver was as simple as it could be: the hardware was always there for the driver, so any time the driver needed to interact with it – such as when reading or writing registers -, it would succeed. After RPM and the power domains were implemented, the situation changed: the hardware might decide to go away for a nap, so if the driver does not explicitly prevent it from sleeping or explicitly wake it up, all the communication with the hardware will be ignored.

It is also important to remember that when the hardware gets runtime suspended – or when a specific power well is powered down – it may lose some of its state. So if you write a register, runtime suspend, resume, and then read the register, the value you read may not be the value you wrote. So now we need to make sure all the relevant hardware state is reprogrammed whenever we resume – or that we just don’t runtime suspend whenever there is state we need to keep.

And since the developers were all used with the previous model where they never needed to think about grabbing reference counts before doing things, we had a big period of time where regressions after regressions were added. This was a big pain: the developers and reviewers always forgot to grab the refcounts and forgot that the hardware might just go away and lose some of its state. Today the situation is much better since the power management concepts are now usually remembered during code writing or reviewing, but we’re still not regression-free, and we’ll probably never be – unless the developers get replaced by robots or something else.

Driver entry points

Another major problem contributing to the difficulty of RPM is that the graphics driver has way too many entry points. We have a massive interaction with drm.ko, so it calls many of our functions and we call many of its functions. We have a huge number of IOCTLs, some defined by our own driver and some inherited by drm.ko. We also have files on different places, such as sysfs and debugfs. We have the command submission infrastructure, which has IOCTLs as its entry points, but requires the hardware to be awake even after the IOCTL is over. We allow the user space to map graphics memory. We have many workqueues and delayed work functions. All this and more. Most of these interfaces require the hardware to be awake at some point – or to stay awake even after the interface is used – so it is really hard to guarantee that all the possible holes are covered by our reference count get() and put() calls.

Debugging

Based on all the difficulties listed above, it is easy to see that we can’t really be sure that we covered all the holes and that they will stay covered forever, leading to high anxiety levels for the developers and a lot of work for the bug triagers. So in order to reduce our anxiety levels we decided to add code to help us catch these possible problems.

While the upper layer and the driver entry points are many and difficult to check, we have a few specific functions at the end of our call stack that are responsible for actually touching the hardware. So on these functions we added some assertions. One of these assertions is a function called assert_device_not_suspended(), which is called whenever we do a register read or write, and also whenever we release forcewake. Another assertion we have is the unclaimed register checking code: for certain pieces of our hardware, if we try to read from or write to a register that does not exist, a specific bit in a specific register will change, so we can know that we did an operation that was ignored by the hardware. We also have a few other assertions that I can’t remember right now, but the ones explained are the most important. We are also probably missing assertions like these at other points of our code, and patches are welcome. Bug reports too.

As part of the Kernel, we also have some debugfs files that print information about the current state of some features, such as the reference counts for the power wells, among other things.

In addition to the Kernel code, we also use intel-gpu-tools (IGT) to try to catch RPM bugs. First of all, all the existing IGT tests can potentially trigger the assertions above, so all IGT tests are somehow helping test RPM. Second, we also added RPM subtests to some of the tests that already existed before RPM was implemented. Usually these subtests have the word suspend as part of their names. Third, we added a test called pm_rpm that has the goal to explicitly test the areas not covered by the other tests.

Most of the subtests of pm_rpm follow the same script: (i) put the device on a state where it can runtime suspend; (ii) assert it is runtime suspended; (iii) use one of the driver interfaces; (iv) make sure everything is fine, and the operation just done had the desired effect; and finally (v) make sure the device runtime suspends after everything is finished. We do have variations and other special things, but the main idea is this. If you’re interesting in finding out more about the many interfaces between the Kernel and i915.ko, this test is a very good place to look at.

So if you’re some distribution maintainer or someone else interested in making use of i915.ko runtime suspend, please run intel-gpu-tools in order to check the driver sanity. If you can’t run it all, please make sure you at least pass all tests of pm_rpm, and also make sure that none of these tests produce Kernel WARNs, BUGs or any other messages at the error level (hint: dmesg | egrep "(WARN|ERR|BUG)").

I found a problem, your driver suck!

Well, congratulations! You’re one step away from becoming a Linux Kernel Contributor! That is going to look good on your resumé, isn’t it? You’re welcome.

If you read everything up to this point you can already imagine how complex everything is, so I hope you’re not mad at us. The very first thing which you need to do is to report the problem. While we do look at bugs reported through email, the best way to guarantee that your problem won’t be forgotten is to report a bug on freedesktop.org.

Now if you’re feeling adventurous, you can try to discover how to reproduce the bug. Once you do that, you can try to write a test case for intel-gpu-tools – possibly a subtest for pm_rpm.c. And if you really care, you can try to implement new assertions on our Kernel driver to prevent similar bugs. And now your resumé is going to look really good!

Runtime PM support for Intel Linux Graphics – Part 1: for the users

Since more than a year ago the Intel Linux Graphics team has been working on the Runtime Power Management (RPM) implementation for our Kernel driver, i915.ko. Since a feature is useless if nobody knows about it, I figured it’s time to document things a little bit in order to spread the knowledge.

Disclaimer

I work for Intel and I am a member of its Linux Kernel Graphics team. On this text I will talk about things that are related to my work at Intel, but this text reflects only my own opinion, not necessarily Intel’s opinion. Everything I talk about here is already publicly documented somewhere else.

What is it?

The basic goal of the i915.ko RPM implementation is to try to reduce the power consumption when all screens are disabled and the graphics device is idle. We basically shut down the whole graphics device, and put it in the PCI D3 state. This implementation uses the Kernel standard runtime PM framework, so the way you control i915.ko RPM should be the same way you control RPM for the other devices.

There are quite a few documents and presentations explaining Kernel RPM, you can learn a lot from them. Use your favorite search engine!

Which platforms support it?

For now, the platforms that support runtime suspend are just SNB and the newer ones, except for IVB. The power saving features make more sense on the HSW/VLV and newer platforms, but SNB support was added as a proof of concept that the RPM code could support multiple  platforms. Also notice that we don’t yet have support for IVB (which is between SNB and HSW), but adding this support would be very simple and probably share most (all?) of the SNB code. In order to be 100% sure of what is supported, grab the Kernel source code, then look at the definition of the HAS_RUNTIME_PM macro inside drivers/gpu/drm/i915/i915_drv.h.

When do we runtime suspend?

First of all, you need a very recent Kernel, because the older ones don’t support RPM for our graphics device. Then you have to enable RPM for the graphics device. Then, all screens have to be disabled and no applications can be using the graphics device.  After all these conditions are met, we start counting time. If no screen is re-enabled and no user-space application uses the GPU after some time, we runtime suspend. If something uses the GPU before the timeout, the timeout is reset after the application finishes using the GPU. In other words: we use the autosuspend delay from the Kernel RPM infrastructure. The current default autosuspend delay is 10 seconds, but it can be changed at runtime.

When I say “all screens have to be disabled”, I mean that, on Kernels older than 3.16, all CRTCs have to be completely unset, without attached framebuffers. But on Kernel 3.16 and newer, RPM can also happen if the screens are in DPMS state, which is a much more common case – just leave your machine idle and it will probably set the DPMS state at some point.

Why ten seconds?

Currently, the default autosuspend delay is set to 10 seconds, which means we will only runtime suspend if nothing uses the GPU for 10 seconds. This is a very conservative value: you won’t get to the runtime suspended state unless your applications are really quiet. Some applications keep waking up the GPU consistently every second, so on these environments you won’t really see RPM.

Even though I arbitrarily decided 10s to be the default value, I admit this value is too conservative, but there are a few reasons why I chose it.

  • With a big timeout, RPM is less likely to start happening on everybody’s machines at the same time, which minimizes the damages caused by possible bugs. When I proposed the 10s timeout, RPM was a new feature, and new features usually bring new bugs.
  • If the timeout value is too low, there’s a possibility that we may do way too many runtime suspend + resume cycles per second, and although I never measured this, it is pretty safe to assume that doing all these suspend + resume cycles will waste more power than just not doing them at all. If you’ve been idle for a whole ten seconds, it’s likely that you’ll probably remain idle for at least a few more seconds.
  • I was expecting someone – maybe even me – would do some very thorough experiments and conclude some ideal value: one that makes sure we don’t suspend + resume so frequently that we would waste power doing it, but that also makes sure we take all the opportunities we can to save power.
  • A conservative value would encourage people to analyze their user-space apps and tune them so they would wake-up the GPU – and maybe even the CPU – only when absolutely necessary, which would benefit not only the cases where we have RPM enabled, but also the cases where we are just using the applications.
  • The value can be changed at runtime! Do you want 1ms? You can have it!

Of course, due to all the complexities involved, I imagine that the ideal runtime suspend delay depends on the set of user-space applications that is running, so in theory the value should be different for every single use case.

Update: after I wrote this text, but before I published it publicly, Daniel Vetter sent a patch proposing a new timeout of 100ms.

How do I play with it?

To be able to change most of these parameters you will need to be root. Since our graphics device is always PCI device 00:02.0, the files will always have the same locations:

  • /sys/devices/pci0000:00/0000:00:02.0/power/control: this is the file that you use to enable and disable graphics RPM. The values are non-trivial: “auto” means “RPM enabled”, while “on” means “RPM disabled”. Write to this file using “echo” to change the parameters, read it to check the current state.
  • /sys/devices/pci0000:00/0000:00:02.0/power/autosuspend_delay_ms: this is the file that controls how often we will runtime suspend. Remember that the number is in milliseconds, so “10” means 10ms, not 10s. Write to this file to change the parameters, read to check the current state.
  • /sys/devices/pci0000:00/0000:00:02.0/power/runtime_status: this is the file you use to check the RPM status. It can print “suspended”, “suspending”, “resuming” and “active”.

Due to some funny interactions between the graphics and audio devices – needed for things like HDMI audio -, you will also need to enable audio runtime PM. I won’t go into details of how audio RPM works because I really don’t know much about it, but I know that to enable it:

echo 1 > /sys/module/snd_hda_intel/parameters/power_save
echo auto > /sys/bus/pci/devices/0000:00:03.0/power/control

Or you can just blacklist the snd_hda_intel module, but that means you will lose its features.

As an alternate to all the instructions above, you can also try to just run:

sudo powertop --auto-tune

This command will enable not only graphics and audio RPM, but try to tune all your system to use less power. While this is good, it can also have some bad effects, such as disabling your USB mouse. If something annoys you, you can then run sudo powertop, go to the tunables tab and undo what you want.

After all this, to force runtime PM you can do this:

  • If your Kernel supports runtime PM for DPMS, you can just run:
    xset dpms force off
    In this case, if you want to enable the screens again, you just need to move the mouse.
  • If your Kernel is older, you can run:
    xrandr
    Then, for each connected output, you run:
    xrandr --output $output --off
    In this case, if you want to enable the screens again, you will have to run, for each output:
    xrandr --output $output --auto

If you boot with drm.debug=0xe, you can look at the output of dmesg and look for “Device suspended” and “Device resumed” messages. This can be done with:

dmesg | grep intel_runtime_

What is the difference between runtime suspend and the usual suspend?

The “suspend” that everybody is used to is something that happens for the whole platform. All devices and applications stop, the machine enters the S3 state, and you just can’t use it while it remains on that state. The runtime suspend feature allows only the devices that are currently unused on the system to be suspended, so all the other devices still work, and user space also keeps working. Then, when something needs access to the device – such as user space -, it gets resumed.

What’s next?

On the next part of our documentation I will explain the development aspect of the RPM feature, including the feature design, its problems and how to debug them.

Credits

I have to mention that even though I did a lot of work on the RPM subsystem of our driver, a big number of other developers also did huge amounts of work here too, so they have to take the credit! Since pretty much everybody on our team helped in one way or another – writing features, enabling specific platforms, giving design feedback, reviewing patches, reporting bugs, adding regressions, etc. – I am not going to say specific names: everybody deserves the credit. Also, a lot of people that are not on the Intel Linux Kernel Graphics team contributed to this work too.

Hello World

So now I finally have a blog!

The plan is to write about my adventures on the Intel Linux graphics drivers. I hope I will be able to share a little bit of my knowledge, making people understand our challenges and also maybe influencing them to contribute.

If there’s something you would like me to write about, please request in the comments. I can’t guarantee I will write about it, but maybe you’ll see something.

Cheers!