Blob Blame History Raw
From: mgorman <mgorman@suse.com>
Date: Fri, 25 Sep 2020 09:32:32 +0100
Subject: [PATCH] cpuidle: Poll for a minimum of 30ns and poll for a tick if
 lower c-states are disabled

References: bnc#1176588
Patch-mainline: Not yet, needs to be posted but will likely be rejected for favoring power over performance

A bug was reported against a distribution kernel about a regression
related to an application that has very large numbers of threads operating
on large amounts of memory with a mix of page faults and address space
modifications. The threads enter/exit idle states extremely rapidly and
perf indicated that a large amount of time was spent on native_safe_halt.
The application requires that cpuidle states be limited to C1 to reduce
latencies on wakeup.

The problem is that the application indirectly relied on similar behaviour
to commit 36fcb4292473 ("cpuidle: use first valid target residency as
poll time") where CPUs would poll to the lowest C-state exit latency
before exiting. As low c-states, the application more directly relies
on a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
to poll a CPU until a rescheduling event occurred.

Rewinding this back "works" but is extreme. Instead this patch sets a
baseline polling time that is close to the C2 exit latency and anecdotally
is a common target as a wakeup latency. It guesses if lower C-states have
been disabled and if so, it polls until the rescheduling event or a tick
has passed. It's unlikely a tick will pass but it avoids the corner case
commit a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
intended to avoid.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 drivers/cpuidle/cpuidle.c |   28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -359,6 +359,8 @@ void cpuidle_reflect(struct cpuidle_devi
 		cpuidle_curr_governor->reflect(dev, index);
 }
 
+#define MIN_POLL_TIME (30 * NSEC_PER_USEC)
+
 /**
  * cpuidle_poll_time - return amount of time to poll for,
  * governors can override dev->poll_limit_ns if necessary
@@ -371,21 +373,39 @@ u64 cpuidle_poll_time(struct cpuidle_dri
 		      struct cpuidle_device *dev)
 {
 	int i;
-	u64 limit_ns;
+	u64 limit_ns, max_limit;
 
 	if (dev->poll_limit_ns)
 		return dev->poll_limit_ns;
 
 	limit_ns = TICK_NSEC;
+	max_limit = 0;
 	for (i = 1; i < drv->state_count; i++) {
+		u64 state_limit;
+
 		if (dev->states_usage[i].disable)
 			continue;
 
-		limit_ns = (u64)drv->states[i].target_residency * NSEC_PER_USEC;
-		break;
+		state_limit = (u64)drv->states[i].target_residency * NSEC_PER_USEC;
+		if (limit_ns == TICK_NSEC)
+			limit_ns = state_limit;
+		max_limit = state_limit;
+	}
+
+	dev->poll_limit_ns = max_t(u64, MIN_POLL_TIME, limit_ns);
+
+	/*
+	 * If the deepest state is below the minimum, assume that c-states
+	 * are limited by the driver or kernel command line and that latency
+	 * is a concern. In this case, poll for longer periods;
+	 */
+	if (max_limit < MIN_POLL_TIME) {
+		pr_info("cpuidle deepest latency of %llu below min %llu, idling based on tick\n",
+			max_limit, (u64)MIN_POLL_TIME);
+		dev->poll_limit_ns = TICK_NSEC;
 	}
 
-	dev->poll_limit_ns = limit_ns;
+	pr_info("cpuidle polling time = %llu ns\n", dev->poll_limit_ns);
 
 	return dev->poll_limit_ns;
 }