From: mgorman <mgorman@suse.com>
Date: Fri, 25 Sep 2020 09:32:32 +0100
Subject: [PATCH] cpuidle: Poll for a minimum of 30ns and poll for a tick if
lower c-states are disabled
References: bnc#1176588
Patch-mainline: Not yet, needs to be posted but will likely be rejected for favoring power over performance
A bug was reported against a distribution kernel about a regression
related to an application that has very large numbers of threads operating
on large amounts of memory with a mix of page faults and address space
modifications. The threads enter/exit idle states extremely rapidly and
perf indicated that a large amount of time was spent on native_safe_halt.
The application requires that cpuidle states be limited to C1 to reduce
latencies on wakeup.
The problem is that the application indirectly relied on similar behaviour
to commit 36fcb4292473 ("cpuidle: use first valid target residency as
poll time") where CPUs would poll to the lowest C-state exit latency
before exiting. As low c-states, the application more directly relies
on a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
to poll a CPU until a rescheduling event occurred.
Rewinding this back "works" but is extreme. Instead this patch sets a
baseline polling time that is close to the C2 exit latency and anecdotally
is a common target as a wakeup latency. It guesses if lower C-states have
been disabled and if so, it polls until the rescheduling event or a tick
has passed. It's unlikely a tick will pass but it avoids the corner case
commit a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
intended to avoid.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
drivers/cpuidle/cpuidle.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 29d2d7a21bd7..b903016e653b 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -361,6 +361,8 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
cpuidle_curr_governor->reflect(dev, index);
}
+#define MIN_POLL_TIME (30 * NSEC_PER_USEC)
+
/**
* cpuidle_poll_time - return amount of time to poll for,
* governors can override dev->poll_limit_ns if necessary
@@ -373,21 +375,39 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
int i;
- u64 limit_ns;
+ u64 limit_ns, max_limit;
if (dev->poll_limit_ns)
return dev->poll_limit_ns;
limit_ns = TICK_NSEC;
+ max_limit = 0;
for (i = 1; i < drv->state_count; i++) {
+ u64 state_limit;
+
if (dev->states_usage[i].disable)
continue;
- limit_ns = (u64)drv->states[i].target_residency * NSEC_PER_USEC;
- break;
+ state_limit = (u64)drv->states[i].target_residency * NSEC_PER_USEC;
+ if (limit_ns == TICK_NSEC)
+ limit_ns = state_limit;
+ max_limit = state_limit;
+ }
+
+ dev->poll_limit_ns = max_t(u64, MIN_POLL_TIME, limit_ns);
+
+ /*
+ * If the deepest state is below the minimum, assume that c-states
+ * are limited by the driver or kernel command line and that latency
+ * is a concern. In this case, poll for longer periods;
+ */
+ if (max_limit < MIN_POLL_TIME) {
+ pr_info("cpuidle deepest latency of %llu below min %llu, idling based on tick\n",
+ max_limit, (u64)MIN_POLL_TIME);
+ dev->poll_limit_ns = TICK_NSEC;
}
- dev->poll_limit_ns = limit_ns;
+ pr_info("cpuidle polling time = %llu ns\n", dev->poll_limit_ns);
return dev->poll_limit_ns;
}